Storing string datasets in hdf5 with unicode

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:

import h5py as h5
file = h5.File('deleteme.hdf5','a')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(1,),dtype=dt)
dset.attrs[str(1)] = "some text with ø, æ, å"

However the text is not stored properly. The data stored contains text:

"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"

How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8

Edit:

The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.

标签： python-3.x utf-8 h5py

2条回答

Ridiculous、

2楼-- · 2019-08-16 18:09

You should try storing your data in UTF-8 format by doing the following:

To encode in utf-8 format (before storingwith h5py) do:

u"æ".encode("utf-8")

which returns:

'\xc3\xa6'

Then to decode you could use the string decode like this:

'\xc3\xa6'.decode("utf-8")

which would return:

æ

Hope it helps!

EDIT

When you open files and you want them to be in utf-8, you can use the encoding parameter on the read file method:

f = open(fname, encoding="utf-8")

This should help properly encoding the original file.

Source: python-notes

0人赞添加讨论(0) 举报

Evening l夕情丶

3楼-- · 2019-08-16 18:15

With:

import numpy as np
import h5py as h5
file = h5.File('deleteme.hdf5','w')
dt = h5.special_dtype(vlen=str)
dset = file.create_dataset("text",(3,),dtype=dt)
dset[:] = 'ø æ å'.split()
dset.attrs["1"] = "some text with ø, æ, å"
file.close()

file = h5.File('deleteme.hdf5','r')
print(file['text'][:])
print(file['text'].attrs["1"])
file.close()

I see:

$ python3 stack44661467.py 
['ø' 'æ' 'å']
some text with ø, æ, å

That is h5py does see/interpret the strings as unicode - writing and reading.

With the dump utility:

$ h5dump deleteme.hdf5 
HDF5 "deleteme.hdf5" {
GROUP "/" {
   DATASET "text" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      DATA {
      (0): "\37777777703\37777777670", "\37777777703\37777777646",
      (2): "\37777777703\37777777645"
      }
      ATTRIBUTE "1" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"
         }
      }
   }
}
}

Note that in both case the datatype is marked UTF8

     DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }

That's what the docs say:

http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.

Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.

0人赞添加讨论(0) 举报

Storing string datasets in hdf5 with unicode

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间