Hibernate/JPA import.sql utf8 characters corrupted

2019-03-25 10:07发布

问题:

I'm using import.sql to write my development data to DB. I'm using MySQL Server 5.5 and my persistence.xml is here:

<?xml version="1.0" encoding="UTF-8"?>
<persistence version="2.0"
xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd">
<persistence-unit name="MobilHM" transaction-type="RESOURCE_LOCAL">
    <provider>org.hibernate.ejb.HibernatePersistence</provider>
    <class>tr.com.stigma.db.entity.Doctor</class>
    <class>tr.com.stigma.db.entity.Patient</class>
    <class>tr.com.stigma.db.entity.Record</class>
    <class>tr.com.stigma.db.entity.User</class>
    <properties>
        <property name="hibernate.hbm2ddl.auto" value="create" />
        <property name="hibernate.show_sql" value="true" />
        <property name="hibernate.format_sql" value="true" />
        <!-- Auto detect annotation model classes -->
        <property name="hibernate.archive.autodetection" value="class" />
        <!-- Datasource -->
        <property name="hibernate.connection.driver_class" value="com.mysql.jdbc.Driver" />
        <property name="hibernate.connection.username" value="mobilhm" />
        <property name="hibernate.connection.password" value="mobilhm" />
        <property name="hibernate.connection.url" value="jdbc:mysql://localhost/mobilhm" />
        <property name="hibernate.dialect" value="org.hibernate.dialect.MySQLDialect" />
    </properties>
</persistence-unit>

Some characters in my import.sql is not shown correctly in DB. For example character ü becomes ü in db. Default charset in mysql is utf-8 and I'm creating tables like

CREATE TABLE doctor (doctorId int unsigned NOT NULL AUTO_INCREMENT, name varchar(45) NOT NULL, surname varchar(45) NOT NULL, PRIMARY KEY (doctorId)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

It is weird that if I import using Mysql import/export manager data is correct but using hibernate.hbm2ddl.auto=create makes characters corrupted.

How can I solve this?

Edit: Also I've tried adding

<property name="hibernate.connection.useUnicode" value="true" />
<property name="hibernate.connection.characterEncoding"
            value="UTF-8" />
<property name="hibernate.connection.charSet" value="UTF-8" />

to persistence.xml. But it didn't help.

Fix: I've solved it eventually. I'm using Tomcat and that is the point of corruption not hibernate or mysql. I've started it with set JAVA_OPTS=-Dfile.encoding=UTF-8 command and my problem goes away.

The title of question became misleading now. Sorry for that.

回答1:

When creating the reader for that file, Hibernate uses new InputStreamReader(stream); directly, without explicit encoding (the default execution platform charset encoding is assumed/used).

So, in other words, your import.sql file must be in the default execution platform charset encoding.

There is an old (2006!) open issue for this, in case one wishes to send a patch: https://hibernate.atlassian.net/browse/HBX-711


Options to fix:

  • Add -Dfile.encoding=UTF-8 to the JAVA_OPTS environment variable, such as:

    # Linux/Unix
    export JAVA_OPTS=-Dfile.encoding=UTF-8
    # Windows
    set JAVA_OPTS=-Dfile.encoding=UTF-8
    
    # Attention, check before if your JAVA_OPTS doesn't already have a value. If so,
    # then it should be
    export JAVA_OPTS=$JAVA_OPTS -Dfile.encoding=UTF-8
    # or
    set JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8
    
  • Set a property in your Maven plugin (could be surefire, failsafe or other, depending on how do you run the code that imports the hibernate file). Example for surefire:

    <plugin>
       <groupId>org.apache.maven.plugins</groupId>
       <artifactId>maven-surefire-plugin</artifactId>
       <configuration>
          <argLine>-Dfile.encoding=UTF8</argLine>
       </configuration>
    </plugin>
    
  • If gradle: To add this property in gradle add systemProperty 'file.encoding', 'UTF-8' to task configuration block. (Thanks @meztihn)



回答2:

I'm using import.sql to populate database on test phase and this link has helped me to solve encoding problem: http://javacimrman.blogspot.ru/2011/07/hibernate-importsql-encoding-when.html.



回答3:

Here's a reliable solution without setting any system property.

We assume that import files are encoded with UTF-8 but Java default charset is different, let's say latin1.

1) Define a custom class for import_files_sql_extractor hibernate.hbm2ddl.import_files_sql_extractor=com.pragmasphere.hibernate.CustomSqlExtractor

2) fix the invalid strings read by hibernate in the implementation.

package com.pragmasphere.hibernate;

import org.hibernate.tool.hbm2ddl.MultipleLinesSqlCommandExtractor;

import java.io.IOError;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;

public class CustomSqlExtractor extends MultipleLinesSqlCommandExtractor {

    private final String SOURCE_CHARSET = "UTF-8";

    @Override
    public String[] extractCommands(final Reader reader) {
        String[] lines = super.extractCommands(reader);

        Charset charset = Charset.defaultCharset();
        if (!charset.equals(Charset.forName(SOURCE_CHARSET))) {
            for (int i = 0; i < lines.length; i++) {
                try {
                    lines[i] = new String(lines[i].getBytes(), SOURCE_CHARSET);
                } catch (UnsupportedEncodingException e) {
                    throw new IOError(e);
                }
            }
        }

        return lines;
    }
}

You can change value of SOURCE_CHARSET with another encoding used by import files.