Importing and analysing non-rectangular .csv files

2019-05-10 04:20发布

问题:

I'm moving to R from Mathematica where I don't need to anticipate data structures during importation, in particular I do not need to anticipate the rectangularness of my data before import.

I have many files .csv files formatted as follows:

tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham

Rows have differing lengths and will only contain strings.

In R, how should I approach this problem?

What Have You Tried?

I've tried with read.table:

dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6   1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...

I interpret this from the documentation to be a singular column with each list of ingredients as a distinct row. I may extract the first three rows as follows, each row is of class factor but appears to contain more data than what I expect:

dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]

This is as far as I've pursued this problem for now, I would appreciate advice on suitability of read.table for this data structure.

My goal is to group the data by the first element of each row, and analyse the difference between each type of recipe. In case it helps influence data structure advice, in Mathematica I would do the following:

dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]

Answer Discussion

@G.Grothendieck has provided a solution in using read.table and subsequent processing using the reshape2 package - this seems tremendously useful and I'll investigate later. General advice here solved my issue, hence accept.

@MrFlick's suggestion of using the tm package was useful for later analysis using DataframeSource

回答1:

read.table Try read.table with fill=TRUE:

d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)

giving:

> d1
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon                            
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper                            
4      okay olive_oil     onion potato black_pepper        
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham   

read.table with NAs

or to fill the empty cells with NA values add na.strings = "" :

d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")

giving:

> d2
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon   <NA>         <NA>    <NA>
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper   <NA>         <NA>    <NA>
4      okay olive_oil     onion potato black_pepper    <NA>
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham    <NA>

long form

If you want it in long form:

library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]

giving:

> long
   id        V1        value
1   1     tasty      chicken
7   1     tasty     cinnamon
2   2 not_tasty       butter
8   2 not_tasty       pepper
14  2 not_tasty        onion
20  2 not_tasty     cardamom
26  2 not_tasty      cayenne
3   3     tasty    olive_oil
9   3     tasty       pepper
4   4      okay    olive_oil
10  4      okay        onion
16  4      okay       potato
22  4      okay black_pepper
5   5 not_tasty       tomato
11  5 not_tasty    fenugreek
17  5 not_tasty       pepper
23  5 not_tasty        onion
29  5 not_tasty       potato
6   6     tasty       butter
12  6     tasty       cheese
18  6     tasty        wheat
24  6     tasty          ham

wide form 0/1 binary variables

To represent the variable portion as 0/1 binary variables try this:

wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])

giving this:

list in a data frame

A different representation would be the following list in a data frame so that ag$value is a list of character vectors:

ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]

giving:

> ag
  id        V1                                    value
4  1     tasty                        chicken, cinnamon
1  2 not_tasty butter, pepper, onion, cardamom, cayenne
5  3     tasty                        olive_oil, pepper
3  4      okay   olive_oil, onion, potato, black_pepper
2  5 not_tasty tomato, fenugreek, pepper, onion, potato
6  6     tasty               butter, cheese, wheat, ham

> str(ag)
'data.frame':   6 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6
 $ V1   : chr  "tasty" "not_tasty" "tasty" "okay" ...
 $ value:List of 6
  ..$ 15: chr  "chicken" "cinnamon"
  ..$ 1 : chr  "butter" "pepper" "onion" "cardamom" ...
  ..$ 17: chr  "olive_oil" "pepper"
  ..$ 11: chr  "olive_oil" "onion" "potato" "black_pepper"
  ..$ 6 : chr  "tomato" "fenugreek" "pepper" "onion" ...
  ..$ 19: chr  "butter" "cheese" "wheat" "ham"


回答2:

I don't think shoving your data into a data.frame or data.table is going to help you much since both those forms generally assume rectangular data. If you just want a list of character vectors, you can read them in with.

strsplit(readLines("data.csv"), ",")

It all depends on exactly what you are going to do with the data after you read it in. If you plan to use existing function, what input do they expect?

Sounds like you may be tracking terms in each of these recipes. Maybe the appropriate data structure would be a "corpus" from the tm package for text mining.



标签: r import