I have a .csv with the following structure:
date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc
I am trying to read them into into a pandas dataframe by using the following code:
import pandas as pd
data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
infer_datetime_format=True)
This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.
In [10]: data.dtypes
Out[10]:
date_begin_date_end object
name object
name_code int64
active_accounts object # Problem, I want this to be a number
transaction_amount object # Ditto, I want this to be a number (it's a dollar amount)
transaction_count object # Still a number!
dtype: object
I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.
Any thoughts?
You could use vectorized string methods to parse those columns after the call to
read_csv
:
import pandas as pd
import decimal
D = decimal.Decimal
data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)
for col in ('active_accounts', 'transaction_count'):
data[col] = data[col].str.replace(r',', '').astype(int)
data['transaction_amount'] = (data['transaction_amount']
.str.replace(r'[^-+\d.]', '').astype(D))
print(data.dtypes)
# date_begin_date_end object
# name object
# name_code int64
# active_accounts int64
# transaction_amount object
# transaction_count int64
# dtype: object
print(data)
yields
date_begin_date_end name name_code active_accounts \
0 1/1/2008 1/31/2008 Name_1 1001 123456
1 2/1/2008 2/29/2008 Name_1 1001 43210
2 3/1/2008 3/31/2008 Name_1 1001 485079
3 12/1/2008 12/31/2008 Name_1 1001 87543
4 1/1/2008 1/31/2008 Name_2 1002 268456
5 2/1/2008 2/29/2008 Name_2 1002 53210
transaction_amount transaction_count
0 7890123.45 67890
1 987654.32 109876
2 1265789433.98 777888
3 432098987 87987
4 890123.45 97890
5 987654.32 109876
PS. read_csv
does have a converters
parameter
with which you could supply a function to parse the problematic columns. These
functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function
calls. Handling the columns using vectorized string methods, as shown above
should be far quicker.
import pandas as pd
import re
import decimal
D = decimal.Decimal
def make_parser(cls):
def parse_commas(text):
return cls(re.sub(r'[^-+\d.]', '', text))
return parse_commas
to_int = make_parser(int)
to_decimal = make_parser(D)
data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
, converters={4: to_int, 5: to_decimal, 6: to_int})
print(data)
yields
date_begin_date_end name name_code active_accounts \
0 1/1/2008 1/31/2008 Name_1 1001 123456
1 2/1/2008 2/29/2008 Name_1 1001 43210
2 3/1/2008 3/31/2008 Name_1 1001 485079
3 12/1/2008 12/31/2008 Name_1 1001 87543
4 1/1/2008 1/31/2008 Name_2 1002 268456
5 2/1/2008 2/29/2008 Name_2 1002 53210
transaction_amount transaction_count
0 7890123.45 67890
1 987654.32 109876
2 1265789433.98 777888
3 432098987 87987
4 890123.45 97890
5 987654.32 109876
and the values in the transaction_amount
column are decimal.Decimals:
In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')