I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å
. Here is my code:
import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"
However the text is not stored properly. The data stored contains text:
"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"
How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8
Edit:
The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.
You should try storing your data in UTF-8 format by doing the following:
To encode in utf-8 format (before storingwith h5py) do:
which returns:
Then to decode you could use the string decode like this:
which would return:
Hope it helps!
EDIT
When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:
This should help properly encoding the original file.
Source: python-notes
With:
I see:
That is
h5py
does see/interpret the strings as unicode - writing and reading.With the dump utility:
Note that in both case the
datatype
is markedUTF8
That's what the docs say:
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
Let
h5py
(or other reader) worry about interpreting\37777777703\37777777670
as the proper unicode character.