I am working on importing CSV files with numpy.genfromtxt
.
The data to be imported has a header of column names, and some of those column names contain characters that genfromtxt
considers invalid. Specifically, some of the names contain "#" and " ". The input data cannot be changed as it is generated by other sources that I do not control.
Using names=True
and comments=None
, I am unable to bring in all of the column names that I need.
I've tried overriding numpy.lib.NameValidator.deletechars=None
, but this does not affect the NameValidator class instance that is actually in use.
I understand that deletechars
exists due to the recarray potential to access a field as if it were an attribute. However, I simply must be able to read in column names that include invalid characters, even if the characters are stripped off when read in.
Is there a way to force the NameValidator
to not check for invalid characters, or to modify the characters it checks for? I am unable to modify numpy/lib/_iotools.py as I am not root and it would be bad to modify a shared installation.
You do not explicitly state that numpy.genfromtxt is a hard requirement, so let me suggest that you try asciitable.
This module has a way to replace certain entries before parsing: http://cxc.harvard.edu/contrib/asciitable/#replace-bad-or-missing-values
And you can also define your own readers based on the existing ones: http://cxc.harvard.edu/contrib/asciitable/#advanced-table-reading
The output of asciitable reader are numpy arrays, so you should be able to replace the functions you currently use more or less directly with asciitable.
IMHO,
genfromtxt
is often used in cases where some simpler solutions would do.So, unless you have some troublesome datasets (missing entries, multiple unknown column types), you're better off coding a quick and dirty parser (ie, skip some rows, parse the header, read the rest and reorganize at the end).
Now, if you really need
genfromtxt
, @ecatmur pointed justly that thedeletechars
argument ofgenfromtxt
is sent to_iotools.NameValidator
to constructs the set of characters to delete. Usingdeletechars=None
tellsNameValidator
to use a default set. A first thing to try is to just not usedeletechars=None
, but an emptyset
or''
.Note that no matter what, double quotes
"
and ending spaces will be deleted and similar names will be differentiated:The third and last entries would result in three columns named
blah
, so we have to rename them.If this doesn't suit you, I'm afraid you're hitting a block: there's no current way to tell
genfromtxt
to accept a customizedNameValidator
. That could be a good idea, though, so you may want to raise the point on numpy's mailing list.NameValidator
will use its default set fordeletechars
if constructed withdeletechars=None
, but if you pass in a non-None
set then it will use that. Andnp.genfromtext
takes adeletechars
parameter which it passes toNameValidator
.So, you should be able to write
for an empty set, or some subset of the default
set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")
: