Python Pandas Decimal Mark EU to US

2019-07-10 02:22发布

问题:

I red the mails about EU to US decimal mark conversion, these helped a lot, but I'm still feeling to need some help from the experts .. My datas are from an ERP system with numbers in the format like "1'000'000,32" and I'd like simply to convert into something like "1000000.32" for further processing in Pandas.

My actual solution to obtain the US format starting from the EU looks like:

... 
 # read_csv and merge, clean .. different CSV files
 # result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace   ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')

I had the feeling this to be a slow method to change the ',' with '.' because of the read_csv and to_csv (and the disk ..) so was willing to try the .replace method directly on the DataFrame to save some processing time.

My initial tentative had been something like the below (that I red elsewhere here on the forum ..):

result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

Which did not worked and resulted in an 'invalid literal for float' error'.

I so moved to:

for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

The above worked .. but with some surprise it appeared about 3 times slower than the read_csv/converters solution. The use of the below helped in some way:

    for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
    result['column_name'] =  result['column_name'].astype(float)

I red the fine manuals .. and know that read_csv is optimized .. but did not really expected a red / write /read/ write cicle to be three times faster than a for loop !!

Do you think it might be worth the case to work more on this? Any suggestion ? Or is it better to stay with repeated write/read/write approach?

My file is aboout 30k lines x 150 columns, the read/write/read(convert)/write takes about 18 seconds, the .ix for is above 52 sec with the first kind of loop (and 32 with grouped .replace).

What is your experience in converting DataFrames from EU to US format ? Some suggested method to improve? What about 'mapping' or 'locale' ? Might they be faster?

Thank you so much, Fabio.

P.S. I realize I was 'verbose' and not enought 'pythonic' .. sorry sorry .. I'm still learning ...:-)

回答1:

In fact there is a thousands and decimal parameter in read_csv (see pandas documentation read_csv but unfortunately the both don't yet work together (see issue:github issue )



回答2:

Thank You so much for your great suggestions and help, Andy and Jeff ! You helped a lot :-)

I firstly went back with an editor to the original datas. In some of them I saw that the system probably applied some kind of automatic conversion so I newly downloaded the same dataset as 'unconverted' option and avoided to use e.g. Excel or other programs to open/save files. I used text editors only. At this point I made the read_csv lighter with no converters and grouped the replaces as Jeff's suggested.

The real case is a bit longer than the provided example and includes some stripping (spaces), columns del, string concat, renaming/replace .... The decimal marks are replaced for three columns: USD Sales, Qty, USD_EUR exchange rate. Based on them EUR sales and EUR unit prices are calculated. In the initial file we also have a '-' ,for some other reason, before the exchange rate to be fixed ("-", ""). The result is:

result = pd.read_csv(path, sep=';', thousands = '.')
col = [ 'qty', 'sales', 'rate']
result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))
result['sales_localcurrency'] = abs(result['sales'].astype(float) / result['rate'].astype(float))
result['sales_localcurrency_unit'] = result['sales_localcurrency'] / result['qty'].astype(float)
result.to_csv(path, sep=';')

The 30'000 x 150 DataFrame is processed in less than 15 seconds :-) :-) including all the other things I did not detailed here too much (stripping, del, concat, ..). All that read/write/read/write had been deleted from the code skipping the 'converters' during the read_csv.

Thank you for your helps :-) !

Bye Bye. Fabio.

    -


回答3:

Create a frame with a value like you have specified and write to a csv

In [2]: df = DataFrame("100'100,32",index=range(30000),columns=range(150))

In [3]: df.iloc[0:5,0:5]
Out[3]: 
            0           1           2           3           4
0  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
1  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
2  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
3  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
4  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32

In [4]: df.to_csv('test.csv')

Read it in, no converters

In [5]: df = read_csv('../test.csv',index_col=0)

In [6]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

In [7]: df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: object(150)

In [8]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

Do string substitution column-by-column. Here you could only specify certain columns if you wish, by doing df[[ list of columns ]].apply(.....)

In [9]: df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: float64(150)

In [10]: %timeit df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
1 loops, best of 3: 4.77 s per loop

Total time a shade under 6s

FYI, there is a thousands separate option, but not a decimal one....hmmm this would be much faster....