How to correctly read url content with utf8 chars?

2019-02-16 03:13发布

问题:

    public class URLReader {
         public static byte[] read(String from, String to, String string){
          try {
           String text = "http://translate.google.com/translate_a/t?"+
                        "client=o&text="+URLEncoder.encode(string, "UTF-8")+
                        "&hl=en&sl="+from+"&tl="+to+"";

           URL url = new URL(text);
           BufferedReader in = new BufferedReader(
                        new InputStreamReader(url.openStream(), "UTF-8"));
           String json = in.readLine();
           byte[] bytes = json.getBytes("UTF-8");
           in.close();
           return bytes;
                    //return text.getBytes();
          }
          catch (Exception e) {
           return null;
          }
         }
        }

and:

public class AbcServlet extends HttpServlet {
 public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
  resp.setContentType("text/plain;charset=UTF-8");
  resp.getWriter().println(new String(URLReader.read("pl", "en", "koń")));
 }
}

When I run this i get:{"sentences"[{"trans":"end","orig":"koďż˝","translit":"","src_translit":""}],"src":"pl","server_time":30} so utf doesnt work correctly but if i return encoded url: http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en and paste at url bar i get correctly:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}

回答1:

byte[] bytes = json.getBytes("UTF-8");

gives you a UTF-8 bytes sequences so URLReader.read also give you UTF-8 bytes sequences

but you tried to decode with without specifying the encoder, i.e. new String(URLReader.read("pl", "en", "koń")) so Java will use your system default encoding to decode (which is not UTF-8)

Try :

new String(URLReader.read("pl", "en", "koń"), "UTF-8")

Update

Here is fully working code on my machine:

public class URLReader {

    public static byte[] read(String from, String to, String string) {
        try {
            String text = "http://translate.google.com/translate_a/t?"
                    + "client=o&text=" + URLEncoder.encode(string, "UTF-8")
                    + "&hl=en&sl=" + from + "&tl=" + to + "";
            URL url = new URL(text);
            URLConnection conn = url.openConnection();
            // Look like faking the request coming from Web browser solve 403 error
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
            String json = in.readLine();
            byte[] bytes = json.getBytes("UTF-8");
            in.close();
            return bytes;
            //return text.getBytes();
        } catch (Exception e) {
            System.out.println(e);
            // becarful with returning null. subsequence call will return NullPointException.
            return null;
        }
    }
}

Don't forget to escape ń to \u0144. Java compiler may not compile Unicode text properly so it is good idea to write it in plain ASCII.

public class AbcServlet extends HttpServlet {

    @Override
    public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        resp.setContentType("text/plain;charset=UTF-8");
        byte[] read = URLReader.read("pl", "en", "ko\u0144");
        resp.getOutputStream().write(read) ;
    }
}