How to import csv's with “occasional” quotes (

I ran across a new problem today. I discovered some data that I'm working with, that looks like this (in a csv file):

Male,White,All Combined,1989,30-31,31,"59,546","18,141","328,235"
Male,White,Felony - Drug,1989,30-31,31,"3,861","1,176","328,235"
Male,White,Felony - Other,1989,30-31,31,"2,626",800,"328,235"
Male,White,Felony - Property,1989,30-31,31,"3,468","1,057","328,235"
Male,White,Felony - Violent/Sex,1989,30-31,31,"3,150",960,"328,235"
Male,White,Misdemeanor,1989,30-31,31,"46,441","14,149","328,235"
Male,White,Status,1989,30-31,31,0,0,"328,235"

It's hard to see the problem, so let me highlight the second to last column:

"18,141"
"1,176"
800
"1,057"
960
"14,149"
0

The problem is values with commas are being exported as strings, while values without commas are exported as numbers. To be clear, the data should be interpreted as:

That is, it should all be interpreted as numeric values.

However, it makes me think that some "standard" application is exporting data like this. For the moment, let's say that it is Excel.

Is there any effective way to try to import flat files with this varying data type within the same column? Both R (read_csv from readr library) and Python's Pandas (read_csv), using their standard flags, interpreted this data by doing the following:

Presuming they should all be numbers (regardless of whether or not quotes are present in all "cells").
Presuming that the commas, therefore, must be the European-style of using a comma for a decimal place (instead of the US period).
So, both packages interpreted that column as follows:
```
18.141
1.176
800
1.057
960
14.149
0
```

In a way, it's impressive that both R (read_csv from readr library) and Pandas (read_csv) could both handle this incongruity and get the guesses almost right.

However, is there a flag that I can set or a package out there which can handle this sort of thing? For instance, a flag to say "remove quoted commas, they are most certainly not European for our US decimal place.

If not, is there enough of a need to contribute to this via forking either of their GitHub repos?

标签： python r excel csv pandas

1条回答

地球回转人心会变

2楼-- · 2019-07-12 08:55

pandas.read_csv has thousands=',' parameter which you can set to a comma so pandas will read your column as

Also there is a converters parameter that takes a dictionary of columns and corresponding functions for applying to each column. You can use it for more complex preprocessing, something like this (does the same thing):

pd.read_csv('data.csv', converters={'column_name': lambda x: int(x.replace(',',''))})

0人赞添加讨论(0) 举报

How to import csv's with “occasional” quotes (

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间