Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first
data = pd.read_csv(querylog, sep=" ", header=None)
then group it
query_group = data.groupby('Query')
ip_group = data.groupby('IP')
and lastly create a blank dataframe to map their values
df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))
index = 0
for name, group in ip_group:
df.set_value(index, 'IP', name)
index += 1
df = df.set_index('IP')
for index, row in data.iterrows():
df.set_value(row['IP'], row['Query'], 1)
print(index)
df = df.fillna(0)
So my problem is that the ip_group can go up to a size of 6000 and query_group up to 400000 which would result in making a very big blank dataframe that my memory cannot handle. Can anyone help me on how to solve this issue? Any help is appreciated.
Sample dataframe of the data would look like this
data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
"IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )
and my expected output would look like this
google.com youtube.com facebook.com
IP
192.168.0.104 1 0 1
192.168.0.103 0 1 0
IIUC you can use get_dummies
, but without data is problematic find the best solution:
df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()
Sample:
import pandas as pd
data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
"IP" : [1,5,4,8,3,4] } )
print data
IP Query
0 1 a
1 5 b
2 4 c
3 8 d
4 3 a
5 4 b
#set index from column data
data = data.set_index('IP')
#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
a b c d
IP
1 1 0 0 0
5 0 1 0 0
4 0 0 1 0
8 0 0 0 1
3 1 0 0 0
4 0 1 0 0
#groupby by index and sum columns
print df.groupby(df.index).sum()
a b c d
IP
1 1 0 0 0
3 1 0 0 0
4 0 1 1 0
5 0 1 0 0
8 0 0 0 1
Try convert to int8
by astype
for savings 3 times less memory:
print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null int8
google.com 3 non-null int8
youtube.com 3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes
print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64