Importing financial data into Python Pandas using

2019-02-16 19:33发布

问题:

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc

I am trying to read them into into a pandas dataframe by using the following code:

import pandas as pd

data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
                   infer_datetime_format=True)

This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.

In [10]: data.dtypes
Out[10]: 
date_begin_date_end       object
name                      object
name_code                  int64
active_accounts           object  # Problem, I want this to be a number
transaction_amount        object  # Ditto, I want this to be a number (it's a dollar amount)
transaction_count         object  # Still a number!
dtype: object

I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.

Any thoughts?

回答1:

You could use vectorized string methods to parse those columns after the call to read_csv:

import pandas as pd
import decimal
D = decimal.Decimal

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)

for col in ('active_accounts', 'transaction_count'):
    data[col] = data[col].str.replace(r',', '').astype(int)

data['transaction_amount'] = (data['transaction_amount']
                              .str.replace(r'[^-+\d.]', '').astype(D))


print(data.dtypes)
# date_begin_date_end    object
# name                   object
# name_code               int64
# active_accounts         int64
# transaction_amount     object
# transaction_count       int64
# dtype: object

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

PS. read_csv does have a converters parameter with which you could supply a function to parse the problematic columns. These functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function calls. Handling the columns using vectorized string methods, as shown above should be far quicker.

import pandas as pd
import re
import decimal
D = decimal.Decimal

def make_parser(cls):
    def parse_commas(text):
        return cls(re.sub(r'[^-+\d.]', '', text))
    return parse_commas

to_int = make_parser(int)
to_decimal = make_parser(D)

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                   , converters={4: to_int, 5: to_decimal, 6: to_int})

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

and the values in the transaction_amount column are decimal.Decimals:

In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')