Like many I am new to R. I have a large data set (500M+ rows) which I have fread into a data.table logStats
which has data like the following :
head(logStats,15)
time pid mean
1: 2014-03-10 00:00:00 998 3.570000
2: 2014-03-10 00:00:00 11 4.090000
3: 2014-03-10 00:00:00 345 3.380000
4: 2014-03-10 00:05:00 998 4.866667
5: 2014-03-10 00:05:00 11 3.677778
6: 2014-03-10 00:05:00 345 4.487500
7: 2014-03-10 00:10:00 345 4.833333
8: 2014-03-10 00:10:00 998 4.333333
9: 2014-03-10 00:10:00 11 6.977778
10: 2014-03-10 00:15:00 345 3.900000
11: 2014-03-10 00:15:00 998 3.200000
12: 2014-03-10 00:15:00 11 6.030000
13: 2014-03-10 00:20:00 998 4.550000
14: 2014-03-10 00:20:00 11 4.030000
15: 2014-03-10 00:20:00 345 6.060000
There is a second very small data.table (360 rows) which has two columns that decodes a 'pid' value into something a bit more friendly to read. The 'pid' value can be either numerical or a character.
For Example:
pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))
which produces :
pid pidName
1: 998 Apple
2: 11 Bannana
3: 345 Cinnamon
I want an expression to be able to add a column to data.table logStats
which has the pidName
for that row pid
.
I should get something like :
time pid mean pidNames
1: 2014-03-10 00:00:00 998 3.570000 Apple
2: 2014-03-10 00:00:00 11 4.090000 Banana
3: 2014-03-10 00:00:00 345 3.380000 Cinnamon
4: 2014-03-10 00:05:00 998 4.866667 Apple
5: 2014-03-10 00:05:00 11 3.677778 Banana
6: 2014-03-10 00:05:00 345 4.487500 Cinnamon
7: 2014-03-10 00:10:00 345 4.833333 Cinnamon
8: 2014-03-10 00:10:00 998 4.333333 Apple
9: 2014-03-10 00:10:00 11 6.977778 Banana
10: 2014-03-10 00:15:00 345 3.900000 Cinnamon
11: 2014-03-10 00:15:00 998 3.200000 Apple
12: 2014-03-10 00:15:00 11 6.030000 Banana
13: 2014-03-10 00:20:00 998 4.550000 Apple
14: 2014-03-10 00:20:00 11 4.030000 Banana
15: 2014-03-10 00:20:00 345 6.060000 Cinnamon
I wrote a function :
pidNameLookup<-function(x) {
return(pidLookupTable[pidLookupTable$pid==x,name])
}
and then ran:
logStats[,pidName:=pidNameLookup(pid)]
But this only converts the first 3 puts NA
for the rest of the values :
logStats[1:1000]
date time pid value timestamp mean pidName
1: 10-03-2014 00:00:12 998 5.5 2014-03-10 00:00:12 3.57 Apple
2: 10-03-2014 00:00:17 11 2.1 2014-03-10 00:00:17 4.09 Bannana
3: 10-03-2014 00:00:22 345 5.7 2014-03-10 00:00:22 3.38 Cinnamon
4: 10-03-2014 00:00:47 998 1.0 2014-03-10 00:00:47 3.57 NA
5: 10-03-2014 00:00:55 11 0.3 2014-03-10 00:00:55 4.09 NA
---
996: 10-03-2014 02:49:37 345 0.7 2014-03-10 02:49:37 5.30 NA
997: 10-03-2014 02:50:01 998 9.9 2014-03-10 02:50:01 5.30 NA
998: 10-03-2014 02:50:08 11 7.0 2014-03-10 02:50:08 7.00 NA
999: 10-03-2014 02:50:18 345 2.4 2014-03-10 02:50:18 2.40 NA
1000: 10-03-2014 02:50:48 998 0.7 2014-03-10 02:50:48 5.30 NA
and gives me the warning message :
Warning message:
In pidLookupTable$pid == x
longer object length is not a multiple of shorter object length
The warning message and incorrect result means that I am doing something completely wrong.
Help!! This is driving me mental