I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.
I have many files .csv
files formatted as follows:
tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham
Rows have differing lengths and will only contain strings.
In R, how should I approach this problem?
What Have You Tried?
I've tried with read.table
:
dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6 1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class
factor
but appears to contain more data than what I expect:
dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table
for this data structure.
My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:
dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
Answer Discussion
@G.Grothendieck has provided a solution in using read.table
and subsequent processing using the reshape2
package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.
@MrFlick's suggestion of using the tm
package was useful for later analysis using DataframeSource
read.table Try read.table
with fill=TRUE
:
d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
giving:
> d1
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper
4 okay olive_oil onion potato black_pepper
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham
read.table with NAs
or to fill the empty cells with NA values add na.strings = ""
:
d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
giving:
> d2
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon <NA> <NA> <NA>
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper <NA> <NA> <NA>
4 okay olive_oil onion potato black_pepper <NA>
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham <NA>
long form
If you want it in long form:
library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
giving:
> long
id V1 value
1 1 tasty chicken
7 1 tasty cinnamon
2 2 not_tasty butter
8 2 not_tasty pepper
14 2 not_tasty onion
20 2 not_tasty cardamom
26 2 not_tasty cayenne
3 3 tasty olive_oil
9 3 tasty pepper
4 4 okay olive_oil
10 4 okay onion
16 4 okay potato
22 4 okay black_pepper
5 5 not_tasty tomato
11 5 not_tasty fenugreek
17 5 not_tasty pepper
23 5 not_tasty onion
29 5 not_tasty potato
6 6 tasty butter
12 6 tasty cheese
18 6 tasty wheat
24 6 tasty ham
wide form 0/1 binary variables
To represent the variable portion as 0/1 binary variables try this:
wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
giving this:
list in a data frame
A different representation would be the following list in a data frame so that ag$value
is a list of character vectors:
ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]
giving:
> ag
id V1 value
4 1 tasty chicken, cinnamon
1 2 not_tasty butter, pepper, onion, cardamom, cayenne
5 3 tasty olive_oil, pepper
3 4 okay olive_oil, onion, potato, black_pepper
2 5 not_tasty tomato, fenugreek, pepper, onion, potato
6 6 tasty butter, cheese, wheat, ham
> str(ag)
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ...
$ value:List of 6
..$ 15: chr "chicken" "cinnamon"
..$ 1 : chr "butter" "pepper" "onion" "cardamom" ...
..$ 17: chr "olive_oil" "pepper"
..$ 11: chr "olive_oil" "onion" "potato" "black_pepper"
..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ...
..$ 19: chr "butter" "cheese" "wheat" "ham"
I don't think shoving your data into a data.frame or data.table is going to help you much since both those forms generally assume rectangular data. If you just want a list of character vectors, you can read them in with.
strsplit(readLines("data.csv"), ",")
It all depends on exactly what you are going to do with the data after you read it in. If you plan to use existing function, what input do they expect?
Sounds like you may be tracking terms in each of these recipes. Maybe the appropriate data structure would be a "corpus" from the tm
package for text mining.