Python: numpy.genfromtxt - Need column names that

2019-08-14 03:32发布

问题:

I am working on importing CSV files with numpy.genfromtxt.

The data to be imported has a header of column names, and some of those column names contain characters that genfromtxt considers invalid. Specifically, some of the names contain "#" and " ". The input data cannot be changed as it is generated by other sources that I do not control.

Using names=True and comments=None, I am unable to bring in all of the column names that I need.

I've tried overriding numpy.lib.NameValidator.deletechars=None, but this does not affect the NameValidator class instance that is actually in use.

I understand that deletechars exists due to the recarray potential to access a field as if it were an attribute. However, I simply must be able to read in column names that include invalid characters, even if the characters are stripped off when read in.

Is there a way to force the NameValidator to not check for invalid characters, or to modify the characters it checks for? I am unable to modify numpy/lib/_iotools.py as I am not root and it would be bad to modify a shared installation.

回答1:

You do not explicitly state that numpy.genfromtxt is a hard requirement, so let me suggest that you try asciitable.

This module has a way to replace certain entries before parsing: http://cxc.harvard.edu/contrib/asciitable/#replace-bad-or-missing-values

And you can also define your own readers based on the existing ones: http://cxc.harvard.edu/contrib/asciitable/#advanced-table-reading

The output of asciitable reader are numpy arrays, so you should be able to replace the functions you currently use more or less directly with asciitable.



回答2:

NameValidator will use its default set for deletechars if constructed with deletechars=None, but if you pass in a non-None set then it will use that. And np.genfromtext takes a deletechars parameter which it passes to NameValidator.

So, you should be able to write

np.genfromtxt(..., deletechars=set())

for an empty set, or some subset of the default set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<"""):

deletechars = np.lib._iotools.NameValidator.defaultdeletechars - set("# ")
np.genfromtxt(..., deletechars=deletechars)


回答3:

IMHO, genfromtxt is often used in cases where some simpler solutions would do.

So, unless you have some troublesome datasets (missing entries, multiple unknown column types), you're better off coding a quick and dirty parser (ie, skip some rows, parse the header, read the rest and reorganize at the end).

Now, if you really need genfromtxt, @ecatmur pointed justly that the deletechars argument of genfromtxt is sent to _iotools.NameValidator to constructs the set of characters to delete. Using deletechars=None tells NameValidator to use a default set. A first thing to try is to just not use deletechars=None, but an empty set or ''.

Note that no matter what, double quotes " and ending spaces will be deleted and similar names will be differentiated:

>>> fields = ["blah", "'blah'", "\"blah\"", "#blah", "blah "]
>>> np.lib._iotools.NameValidator(deletechars='').validate(fields)
... ('blah', "'blah'", 'blah_1', '#blah', 'blah_2')

The third and last entries would result in three columns named blah, so we have to rename them.

If this doesn't suit you, I'm afraid you're hitting a block: there's no current way to tell genfromtxt to accept a customized NameValidator. That could be a good idea, though, so you may want to raise the point on numpy's mailing list.