I have an valid XML file(valid cause browser can parse it) that I try to parse using JDOM2. The code was running good for other xml files but for this particular xml file it gives me the following exception on builder.build() line : "com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence. "
My code is as follows
import java.io.*;
import java.util.*;
import java.net.*;
import org.jdom2.*;
import org.jdom2.input.*;
import org.jdom2.output.*;
import org.jdom2.adapters.*;
public class Test
{
public static void main(String st[])
{
String results="N.A.";
SAXBuilder builder = new SAXBuilder();
Document doc;
results = scrapeSite().trim();
try
{
doc = builder.build(new ByteArrayInputStream(results.getBytes()));
}
catch(JDOMException e)
{
System.out.println(e.toString());
}
catch(IOException e)
{
System.out.println(e.toString());
}
}
public static String scrapeSite()
{
String temp="";
try
{
URL url = new URL("http://msu-footprints.org/2011/Aditya/search_5.xml");
URLConnection conn = url.openConnection();
conn.setAllowUserInteraction(false);
InputStream urlStream = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(urlStream));
String t = br.readLine();
while(t!=null)
{
temp = temp + t;
t = br.readLine();
}
}
catch(IOException e)
{
System.out.println(e.toString());
}
return temp;
}
}
As jtahlborn points out, you should always treat XML as bytes, letting the parser work out the encoding.
But more than that, you should never ever use String.getBytes() to get the bytes of a string: you will not be getting what you think you are.
In this case you can just get the bytes of the site, but even if you were constructing XML in a string and then handing that to a parser as a byte sequence (or, more likely, writing the bytes to a file), you would want to specify the encoding such that it matches the encoding the XML says it's in, which by default is UTF-8:
Likewise, if for some reason you needed to use a Writer or Reader, you must specify the encoding to write or read in.
If you need to construct XML, a good way is to use the XMLStreamWriter class:
why are you reading the xml into a String with a Reader? you are corrupting the xml before you parse it. treat xml as bytes, not chars.
and why are you reading the whole URL InputStream just to convert it into another ByteArrayInputStream? you can reduce that to about 2 lines of code by passing the URL InputStream directly to the builder. (not mention avoid additional memory issues caused by reading the entire stream into memory).