Safe HTML form accept charset?

2019-08-02 11:30发布

I faced a parameter encoding issue when submitting a form with the get method (I can't use the post method). Some accentuated characters were not escaped in the URL, since my page was UTF8. The Spring controller retrieved bad characters instead.

I solved this issue by setting accept-charset="ISO-8859-1" on my form, but now, I am wondering which charset is safe for all server/browser combination. Is there any recommended for my forms and 'get' URLs?

3条回答
Emotional °昔
2楼-- · 2019-08-02 12:15

The problem is URL's always get encoded as 127-ASCII. Because your form sends back additional characters values outside the standard ASCII set via a GET you have several issues going on:

  1. URL's are limited to 2048 characters, so your form values might be getting truncated
  2. If a user enters characters outside the ISO accept-type you set in the Form attribute, they would not be encoded correctly into the URL. That is because the browser translates everything into 127-ASCII when encoding URL's after first using the page's encoding. Any special character not in that ISO set would be encoded incorrectly.
  3. The browser always translates the characters in your URL first using the page encoding or meta tags. But if there is a server HTTP-header, that encoding would override your meta tag encoding. The default encoding for HTML5 pages is UTF-8. But you are using an ISO standard overriding that. Even so, all encoding done by your browsers replaces non-ASCII characters with a "%" followed by hexadecimal digits from the pages encoding or in your case the form's set encoding. That is then sent up to the server so look at your URL to see what has been sent.
  4. When your URL comes to the server, it comes in as 127-ASCII, so you would need to first get the string as ASCII, then decode back to the page encoding or in your case the Form accept values used to get the true values.

I recommend you remove the form encoding, use the pages UTF-8 settings for broader character support, and drop in these two metatags below to make sure you are sending back UTF-8 encoded data, which includes all the characters needed and is easily decoded on the server as described above by other posters above.

<meta charset="utf-8" />
<meta content="text/html; charset=utf-8" http-equiv="content-type" />
查看更多
Explosion°爆炸
3楼-- · 2019-08-02 12:16

nickdos is right. Another way of doing this is using the meta-data tag:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

Also keep in mind when handling the response on the server, the code should also use the correct (same) encoding.

Example:

use stringParamer.getBytes("utf-8") instead of stringParamer.getBytes()

And when using Spring make sure the correct encoding is configured for message converters in the DispatcherServlet's configuration file (XYZ_-servlet.xml), e.g.:

<bean id="stringHttpMessageConverter" class="org.springframework.http.converter.StringHttpMessageConverter">
    <property name="supportedMediaTypes" value = "text/plain;charset=UTF-8"/>
</bean>
查看更多
老娘就宠你
4楼-- · 2019-08-02 12:20

This is frustrating (to put it mildly) with servlets. The standard URL encoding must use UTF-8 yet servlets not only default to ISO-8859-1 but don't offer any way to change that with code.

Sure you can req.setRequestEncoding("UTF-8") before you read anything, but for some ungodly reason this only affects request body, not query string parameters. There is nothing in the servlet request interface to specify the encoding used for query string parameters.

Using ISO-8859-1 in your form is a hack. Using this ancient encoding will cause more problems than solve for sure. Especially since browsers do not support ISO-8859-1 and always treat it as Windows-1252. Whereas servlets treat ISO-8859-1 as ISO-8859-1, so you will be screwed beyond belief if you go with this.

To change this in Tomcat for example, you can use the URIEncoding attribute in your <connector> element:

<connector ... URIEncoding="UTF-8" ... />

If you don't use a container that has these settings, can't change its settings or some other issue, you can still make it work because ISO-8859-1 decoding retains full information from the original binary.

String correct = new String(request.getParameter("test").getBytes("ISO-8859-1"), "UTF-8")

So let's say test=ä and if everything is correctly set, the browser encodes it as test=%C3%A4. Your servlet will incorrectly decode it as ISO-8859-1 and give you the resulting string "ä". If you apply the correction, you can get ä back:

System.out.println(new String("ä".getBytes("ISO-8859-1"), "UTF-8").equals("ä"));
//true
查看更多
登录 后发表回答