How do I make eclipse print out weird characters i

2019-03-20 07:16发布

问题:

So I'm trying to make my program output a text file with a list of names. Some of the names have weird characters, such as Åström.

I have grabbed these list of names from a webpage that is encoded in "UTF-8", or at least I'm pretty sure it does because the page source says

" meta http-equiv="Content-Type" content="text/html; charset=UTF-8" / "

This is what I've tried so far:

public static void write(List<String> list) throws IOException  {
        Writer out = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
        try {
            for (int i=0;i<list.size();i++) {
                try {
                    byte[] utf8Bytes = list.get(i).getBytes("UTF-8");
                    out.write(new String(utf8Bytes, "UTF-8"));
                } catch (UnsupportedEncodingException e) {
                    e.printStackTrace();
                }

                out.write(System.getProperty("line.separator"));

            }
        }
        finally {
        out.close();
        }
    }

and I'm a little confused as to why it's not working. The output I get is "Åström", which is very weird.

Can someone please point me in the right direction? Thanks!

And on another unrelated note, is there an easier way to write a new line to a text file besides the clunky

out.write(System.getProperty("line.separator"));

that I have? I saw that online somewhere and it works, but I was just wondering if there was a cleaner way.

回答1:

Set your Eclipse > Preferences > General > Workspace > Text file encoding to UTF-8.



回答2:

The content is indeed in UTF-8 and it appears OK if printed to the console. What may be causing the problem is the decoding and encoding of the string which is unnecessary. Instead of an OutputStreamWriter try using a java.io.PrintWriter. It has the println methods that print out the string with the system line separator at the end. It would look something like:

printStream.println(list.get(i));

Also, when opening the file to see it try using a browser. They allow you to choose the encoding after opening it so you can try several encodings quickly to see what is being really used.



回答3:

Notepad is not a particularly feature rich editor. It will attempt to guess the document encoding, sometimes with unexpected results. "Plain text" documents don't carry any metadata about their encoding which gives them certain limitations. Windows apps (Notepad included) often rely on the byte-order-mark (U+FEFF or "\uFEFF" in Java strings) to determine if the encoding is a Unicode format. That might help out Notepad; it's going to be useless for your web page problem.

The HTML 4 spec defines how output encoding should be set. You should set the Content-Type HTTP header in addition to specifying the meta encoding.

You don't mention what you're using in your web app. A servlet should set the content type setContentType("text/html; charset=UTF-8"); a JSP should use the page directive to do the same. Other view technologies will provide similar mechanisms.


byte[] utf8Bytes = list.get(i).getBytes("UTF-8");
out.write(new String(utf8Bytes, "UTF-8"));

This code performs some useless operations; it transcodes character data from UTF-16 to UTF-8, then back from UTF-8 to UTF-16, then writes data to a Writer (which will transcode the UTF-16 to UTF-8 again). This code is equivalent:

String str = list.get(i);
out.write(str);

Use a PrintWriter to get newline support.


You can read more about character encoding in Java here, here and here.