I have a very big dataset. It consist more than 10 million records. It is very difficult to use this much of dataset to apply any algorithm. So, that I trying to restructure this dataset. In my dataset, so many records are there per one customer. Now I am trying to convert one record per one customer.
Here I am representing my sample mock up data.
d1<-structure(
list(userid = c(64455670203, 64455670203, 64455670203, 64455670203, 64455670203, 64455670204, 64455670204, 64455670204, 64455670204, 64455670204),
day = c(1L, 1L, 2L, 3L, 3L, 2L, 2L, 3L, 4L, 4L),
channel = structure(
c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L),
.Label = c("dsp", "osr"),
class = "factor"
)
),
.Names = c("userid", "day", "channel"),
class = "data.frame",
row.names = c(NA, -10L)
)
Now I am planning to convert the above represented data as follows..
d2<-structure(
list(csm_id = c(64455670203, 64455670204),
dsp1 = c(2L, 0L),
dsp2 = c(1L, 1L),
dsp3 = c(1L, 0L),
dsp4 = 0:1,
ors1 = c(0L, 0L),
ors2 = 0:1,
ors3 = c(1L, 1L),
ors4 = 0:1
),
.Names = c("csm_id", "dsp1", "dsp2", "dsp3", "dsp4", "ors1", "ors2", "ors3", "ors4"),
class = "data.frame",
row.names = c(NA, -2L)
)
Here what I am trying to do is, first I find distinct channels and distinct days in my dataset. Now I am concatenating those two objects(distinct channels and days) and then use these as column names of my new dataset.
I wrote a simple code in R. But it is really time consuming. Can anyone help me to do this task.
How to do same operation in python also?
Thanks in advance.
Try