Cannot store Euro-sign into LOB String property wi

2019-02-12 17:26发布

问题:

I am having trouble writing and reading back special characters like the Euro-sign (€) into LOB String properties in PostgreSQL 8.4 with Hibernate 3.6.10.

What I know is that PostgreSQL provides two distinct ways to store large character objects in a column of a table. They can be stored either directly into that table column or indirectly in a separate table (it's actually called pg_largeobject). In the latter case, the column holds a reference (OID) to the row in pg_largeobject.

The default behaviour in Hibernate 3.6.10 is the indirect OID approach. However, it is possible to add an extra annotation @org.hibernate.annotations.Type(type="org.hibernate.type.TextType") to the Lob property to get the direct storage behaviour.

Both approaches work fine, except for the moment that I want to work with special characters like the Euro sign (€). In that case the direct storage mechanism keeps working, but the indirect storage mechanism breaks.

I'd like to demonstrate that with an example. I created a test entity with 2 @Lob properties. One follows the direct storage principle, the other the indirect storage:

@Basic
@Lob
@Column(name = "CLOB_VALUE_INDIRECT_STORAGE", length = 2147483647)
public String getClobValueIndirectStorage()

and

@Basic
@Lob
@org.hibernate.annotations.Type(type="org.hibernate.type.TextType")
@Column(name = "CLOB_VALUE_DIRECT_STORAGE", length = 2147483647)
public String getClobValueDirectStorage()

If I create an entity, populate both properties with the Euro sign and then persist it towards the database I see the following when I do a SELECT I see

 id | clob_value_direct_storage | clob_value_indirect_storage
----+---------------------------+----------------------------
  6 | €                         | 910579                     

If I then query the table pg_largeobject I see:

  loid  | pageno | data
--------+--------+------
 910579 |      0 | \254

The 'data' column of pg_largeobject is of type bytea, which means that the information is stored as raw bytes. The expression '\254' represents one single byte and in UTF-8 represents the character '¬'. This is exactly the value that I get back when I load the entity back from the database.

The Euro sign in UTF-8 consists of 3 bytes, so I would have expected the 'data' column to have 3 bytes and not 1.

This does not only occur for the Euro sign, but for many special characters. Is this a problem in Hibernate? Or the JDBC driver? Is there a way I can tweak this behaviour?

Thanks in advance,
Kind regards,
Franck de Bruijn

回答1:

After a lot of digging around in the source code of Hibernate and the PostgreSQL JDBC driver I managed to find the root cause of the problem. In the end the write() method of the BlobOutputStream (provided by the JDBC driver) is invoked to write the contents of the Clob into the database. This method looks like this:

public void write(int b) throws java.io.IOException
{
    checkClosed();
    try
    {
        if (bpos >= bsize)
        {
            lo.write(buf);
            bpos = 0;
        }
        buf[bpos++] = (byte)b;
    }
    catch (SQLException se)
    {
        throw new IOException(se.toString());
    }
}

This method takes an 'int' (32 bits/4 bytes) as argument and converts it to a 'byte' (8 bits/1 byte) effectively losing 3 bytes of information. String representations within Java are UTF-16 encoded, meaning that each character is represented by 16 bits/2 bytes. The Euro-sign has the int value 8364. After conversion to byte, the value 172 remains (in octet representation 254).

I am not sure what now the best resolution is to this problem. IMHO the JDBC driver should be responsible for encoding/decoding the Java UTF-16 characters to whatever encoding the database needs. However, I do not see any tweaking possibilities in the JDBC driver code to alter its behaviour (and I do not want to write and maintain my own JDBC driver code).

Therefore, I extended Hibernate with a custom ClobType and managed to convert the UTF-16 characters to UTF-8 before writing to the database and vice-versa when retrieving the Clob.

The solutions is too large to just simple paste in this answer. If you are interested, drop me a line, and I send it to you.

Cheers, Franck