I have a list like below:
saleid upc
0 155_02127453_20090616_135212_0021 02317639000000
1 155_02127453_20090616_135212_0021 00000000000888
2 155_01605733_20090616_135221_0016 00264850000000
3 155_01072401_20090616_135224_0010 02316877000000
4 155_01072401_20090616_135224_0010 05051969277205
It represents one customer (saleid) and the items he/she got (upc of the item)
What I want is to pivot this table to a form like below:
02317639000000 00000000000888 00264850000000 02316877000000
155_02127453_20090616_135212_0021 1 1 0 0
155_01605733_20090616_135221_0016 0 0 1 0
155_01072401_20090616_135224_0010 0 0 0 0
So, columns are unique UPCs and rows are unique SALEIDs.
i read it like this:
tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str})
tbl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18570726 entries, 0 to 18570725
Data columns (total 2 columns):
saleid object
upc object
dtypes: object(2)
memory usage: 283.4+ MB
I have done some steps but not the correct ones!
tbl.pivot_table(columns=['upc'],aggfunc=pd.Series.nunique)
upc 00000000000000 00000000000109 00000000000116 00000000000123 00000000000130 00000000000147 00000000000154 00000000000161 00000000000178 00000000000185 ...
saleid 44950 287 26180 4881 1839 623 3347 7
EDIT: Im using the solution variation below:
chunksize = 1000000
f = 0
for chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):
print(f)
t = pd.crosstab(chunk.saleid, chunk.upc)
t.head(3)
t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=True,sep=';',compression='bz2')
f = f+1
the original file is extremely big to fit to memory after conversion. The above solution has the problem on not having all the columns on all the files as I'm reading chunks from the original file.
Question 2: is there a way to force all chunks to have the same columns?
Option 1
Option 2
Setup
simple
pivot_table()
solution: