I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:
- integer
- floating point
- date
- boolean
- string
My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.
I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?
What about integers and floats?
Dateutil comes to mind for parsing dates.
For integers and floats you could always try a cast in a try/except section
>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
... cg = float(g)
... except:
... print "g is not a float"
...
g is not a float
>>>
ast.literal_eval()
can get the easy ones.
The data types I am currently interested in are...
These do not exist in a CSV file. The data is only strings. Only. Nothing more.
test a sample of rows
Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.
The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.
If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).
try:
x= datetime.datetime.strptime( row[0], some format )
except ValueError:
# column is not valid.
If column 2 is supposed to be a float, you can only do this.
try:
y= float( row[1] )
except ValueError:
# column is not valid.
If column 3 is supposed to be an int, you can only do this.
try:
z= int( row[2] )
except ValueError:
# column is not valid.
There is no way to "discover" if the CSV has floating-point digit strings except by doing float
on each row. If a row fails, then someone prepared the file improperly.
Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.
Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.
You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:
- https://github.com/okfn/messytables
- https://messytables.readthedocs.org/ - docs
It happily scales to very large files, to streaming data off the internet etc.
There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)
The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164