pandas pivot table of sales

I have a list like below:

    saleid                              upc
0   155_02127453_20090616_135212_0021   02317639000000
1   155_02127453_20090616_135212_0021   00000000000888
2   155_01605733_20090616_135221_0016   00264850000000
3   155_01072401_20090616_135224_0010   02316877000000
4   155_01072401_20090616_135224_0010   05051969277205

It represents one customer (saleid) and the items he/she got (upc of the item)

What I want is to pivot this table to a form like below:

                                   02317639000000 00000000000888 00264850000000 02316877000000
155_02127453_20090616_135212_0021               1              1              0              0
155_01605733_20090616_135221_0016               0              0              1              0
155_01072401_20090616_135224_0010               0              0              0              0

So, columns are unique UPCs and rows are unique SALEIDs.

i read it like this:

tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str})
tbl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18570726 entries, 0 to 18570725
Data columns (total 2 columns):
saleid    object
upc       object
dtypes: object(2)
memory usage: 283.4+ MB

I have done some steps but not the correct ones!

tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique)
upc 00000000000000  00000000000109  00000000000116  00000000000123  00000000000130  00000000000147  00000000000154  00000000000161  00000000000178  00000000000185  ...
saleid  44950   287 26180   4881    1839    623 3347    7

EDIT: Im using the solution variation below:

chunksize = 1000000
f = 0
for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):
    print(f)
    t = pd.crosstab(chunk.saleid, chunk.upc)
    t.head(3)
    t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2')
    f = f+1

the original file is extremely big to fit to memory after conversion. The above solution has the problem on not having all the columns on all the files as I'm reading chunks from the original file.

Question 2: is there a way to force all chunks to have the same columns?

标签： python csv pandas numpy

2条回答

聊天终结者

2楼-- · 2019-06-27 03:35

Option 1

df.groupby(['saleid', 'upc']).size().unstack(fill_value=0)

Option 2

pd.crosstab(df.saleid, df.upc)

Setup

from StringIO import StringIO
import pandas as pd

text = """    saleid                              upc
0   155_02127453_20090616_135212_0021   02317639000000
1   155_02127453_20090616_135212_0021   00000000000888
2   155_01605733_20090616_135221_0016   00264850000000
3   155_01072401_20090616_135224_0010   02316877000000
4   155_01072401_20090616_135224_0010   05051969277205"""

df = pd.read_csv(StringIO(text), delim_whitespace=True, dtype=str)
df

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-06-27 03:40

simple pivot_table() solution:

In [16]: df.pivot_table(index='saleid', columns='upc', aggfunc='size', fill_value=0)
Out[16]:
upc                                00000000000888  00264850000000  02316877000000  02317639000000  05051969277205
saleid
155_01072401_20090616_135224_0010               0               0               1               0               1
155_01605733_20090616_135221_0016               0               1               0               0               0
155_02127453_20090616_135212_0021               1               0               0               1               0

0人赞添加讨论(0) 举报

pandas pivot table of sales

Setup

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间