Handling a very big dataframe

2019-08-07 14:21发布

问题:

Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first

data = pd.read_csv(querylog, sep=" ", header=None)

then group it

query_group = data.groupby('Query')
ip_group = data.groupby('IP')

and lastly create a blank dataframe to map their values

df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))

index = 0
for name, group in ip_group:
    df.set_value(index, 'IP', name)
    index += 1
df = df.set_index('IP')

for index, row in data.iterrows():
    df.set_value(row['IP'], row['Query'], 1)
    print(index)
df = df.fillna(0)

So my problem is that the ip_group can go up to a size of 6000 and query_group up to 400000 which would result in making a very big blank dataframe that my memory cannot handle. Can anyone help me on how to solve this issue? Any help is appreciated.

Sample dataframe of the data would look like this

data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
     "IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )

and my expected output would look like this

                google.com youtube.com  facebook.com
IP            
192.168.0.104   1          0             1
192.168.0.103   0          1             0

回答1:

IIUC you can use get_dummies, but without data is problematic find the best solution:

df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()

Sample:

import pandas as pd

data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
     "IP" : [1,5,4,8,3,4] } )
print data  
   IP Query
0   1     a
1   5     b
2   4     c
3   8     d
4   3     a
5   4     b
#set index from column data
data = data.set_index('IP')

#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
    a  b  c  d
IP            
1   1  0  0  0
5   0  1  0  0
4   0  0  1  0
8   0  0  0  1
3   1  0  0  0
4   0  1  0  0

#groupby by index and sum columns
print df.groupby(df.index).sum()
    a  b  c  d
IP            
1   1  0  0  0
3   1  0  0  0
4   0  1  1  0
5   0  1  0  0
8   0  0  0  1

Try convert to int8 by astype for savings 3 times less memory:

print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes

print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null int8
google.com      3 non-null int8
youtube.com     3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes

print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64