可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:

integer
floating point
date
boolean
string

My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.

I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?

What about integers and floats?

回答1:

Dateutil comes to mind for parsing dates.

For integers and floats you could always try a cast in a try/except section

>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
...   cg = float(g)
... except:
...   print "g is not a float"
...
g is not a float
>>>

回答2:

ast.literal_eval() can get the easy ones.

回答3:

The data types I am currently interested in are...

These do not exist in a CSV file. The data is only strings. Only. Nothing more.

test a sample of rows

Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.

The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.

If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).

try:
    x= datetime.datetime.strptime( row[0], some format )
except ValueError:
    # column is not valid.

If column 2 is supposed to be a float, you can only do this.

try:
    y= float( row[1] )
except ValueError:
    # column is not valid.

If column 3 is supposed to be an int, you can only do this.

try:
    z= int( row[2] )
except ValueError:
    # column is not valid.

There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.

Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.

Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.

回答4:

You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:

https://github.com/okfn/messytables
https://messytables.readthedocs.org/ - docs

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164