I ran across a new problem today. I discovered some data that I'm working with, that looks like this (in a csv
file):
Male,White,All Combined,1989,30-31,31,"59,546","18,141","328,235"
Male,White,Felony - Drug,1989,30-31,31,"3,861","1,176","328,235"
Male,White,Felony - Other,1989,30-31,31,"2,626",800,"328,235"
Male,White,Felony - Property,1989,30-31,31,"3,468","1,057","328,235"
Male,White,Felony - Violent/Sex,1989,30-31,31,"3,150",960,"328,235"
Male,White,Misdemeanor,1989,30-31,31,"46,441","14,149","328,235"
Male,White,Status,1989,30-31,31,0,0,"328,235"
It's hard to see the problem, so let me highlight the second to last column:
"18,141"
"1,176"
800
"1,057"
960
"14,149"
0
The problem is values with commas are being exported as strings, while values without commas are exported as numbers. To be clear, the data should be interpreted as:
18141
1176
800
1057
960
14149
0
That is, it should all be interpreted as numeric values.
However, it makes me think that some "standard" application is exporting data like this. For the moment, let's say that it is Excel.
Is there any effective way to try to import flat files with this varying data type within the same column? Both R (read_csv
from readr
library) and Python's Pandas (read_csv
), using their standard flags, interpreted this data by doing the following:
- Presuming they should all be numbers (regardless of whether or not quotes are present in all "cells").
- Presuming that the commas, therefore, must be the European-style of using a comma for a decimal place (instead of the US period).
So, both packages interpreted that column as follows:
18.141 1.176 800 1.057 960 14.149 0
In a way, it's impressive that both R (read_csv
from readr
library) and Pandas (read_csv
) could both handle this incongruity and get the guesses almost right.
However, is there a flag that I can set or a package out there which can handle this sort of thing? For instance, a flag to say "remove quoted commas, they are most certainly not European for our US decimal place.
If not, is there enough of a need to contribute to this via forking either of their GitHub repos?
pandas.read_csv
hasthousands=','
parameter which you can set to a comma so pandas will read your column asAlso there is a
converters
parameter that takes a dictionary of columns and corresponding functions for applying to each column. You can use it for more complex preprocessing, something like this (does the same thing):pd.read_csv('data.csv', converters={'column_name': lambda x: int(x.replace(',',''))})