可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a large dataframe (14552 rows by 15 columns) containing billing data from 2001 to 2007. I have used sqlFetch to get 2008 data. In order to append the 2008 data to the data of the preceding 7 years one would do as follows
alltime <- rbind(alltime,all2008)
Unfortunately that generates
Warning message:
In [<-.factor
(*tmp*
, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
invalid factor level, NAs generated
My guess is that there are some new patients whose names were not in the previous dataframe and therefore it would not know what level to give it. Similarly new unseen name in the referring doctor column.
The way R imports data and automatically works out what is numeric and what is not (and thereby makes it a factor) is wonderful - untill you have to manipulate it further and then it becomes a pain. How do I overcome my problem elegantly?
回答1:
It could be caused by mismatch of types in two data.frames
.
First of all check types (classes). To diagnostic purposes do this:
new2old <- rbind( alltime, all2008 ) # this gives you a warning
old2new <- rbind( all2008, alltime ) # this should be without warning
cbind(
alltime = sapply( alltime, class),
all2008 = sapply( all2008, class),
new2old = sapply( new2old, class),
old2new = sapply( old2new, class)
)
I expect there be a row looks like:
alltime all2008 new2old old2new
... ... ... ... ...
some_column "factor" "numeric" "factor" "character"
... ... ... ... ...
If so then explanation:
rbind
don't check types match. If you analyse rbind.data.frame
code then you could see that the first argument initialized output types. If in first data.frame type is a factor, then output data.frame column is factor with levels unique(c(levels(x1),levels(x2)))
. But when in second data.frame column isn't factor then levels(x2)
is NULL
, so levels don't extend.
It means that your output data are wrong! There are NA
's instead of true values
I suppose that:
- you create you old data with another R/RODBC version so types were created with different methods (different settings - decimal separator maybe)
- there are NULL's or some specific data in problematic column, eg. someone change column under database.
Solution:
find wrong column and find reason why its's wrong and fixed. Eliminate cause not symptoms.
回答2:
An "easy" way is to simply not have your strings set as factors when importing text data.
Note that the read.{table,csv,...}
functions take a stringsAsFactors
parameter, which is by default set to TRUE
. You can set this to FALSE
while you're importing and rbind
-ing your data.
If you'd like to set the column to be a factor at the end, you can do that too.
For example:
alltime <- read.table("alltime.txt", stringsAsFactors=FALSE)
all2008 <- read.table("all2008.txt", stringsAsFactors=FALSE)
alltime <- rbind(alltime, all2008)
# If you want the doctor column to be a factor, make it so:
alltime$doctor <- as.factor(alltime$doctor)
回答3:
1) create the data frame with stringsAsFactor set to FALSE. This should resolve the factor-issue
2) afterwards don't use rbind - it messes up the column names if the data frame is empty. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
/
> df <- data.frame(a = character(0), b=character(0), c=numeric(0))
> df[nrow(df)+1,] <- c("d","gsgsgd",4)
Warnmeldungen:
1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
invalid factor level, NAs generated
> df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
> df[nrow(df)+1,] <- c("d","gsgsgd",4)
> df
a b c
1 d gsgsgd 4
回答4:
As suggested in the previous answer, read the columns as character and do the conversion to factors after rbind
.
SQLFetch
(I assume RODBC) has also the stringsAsFactors
or the as.is
argument to control the conversion of characters.
Allowed values are as for read.table
, e.g., as.is=TRUE
or some column number.
回答5:
I had the same problem with type mismatches, especially with factors. I had to glue together two otherwise compatible datasets.
My solution is to convert factors in both dataframes to "character". Then it works like a charm :-)
convert.factors.to.strings.in.dataframe <- function(dataframe)
{
class.data <- sapply(dataframe, class)
factor.vars <- class.data[class.data == "factor"]
for (colname in names(factor.vars))
{
dataframe[,colname] <- as.character(dataframe[,colname])
}
return (dataframe)
}
If you want to see the types in your two dataframes run (change var names):
cbind("orig"=sapply(allSurveyData, class),
"merge" = sapply(curSurveyDataMerge, class),
"eq"=sapply(allSurveyData, class) == sapply(curSurveyDataMerge, class)
)
回答6:
When you create the dataframe you have the choice of making your string columns factors (stringsAsFactors=T
), or keeping them as strings.
For your case, don't make your string columns factors. Keep them as strings, then appending works fine. If you need them to ultimately be factors, do all the insertion and appending first as string, then finally convert them to factor.
If you make the string columns factors and then append rows containing unseen values, you get the error you mentioned on each new unseen factor level and that value gets replaced with NA...
> df <- data.frame(patient=c('Ann','Bob','Carol'), referring_doctor=c('X','Y','X'), stringsAsFactors=T)
patient referring_doctor
1 Ann X
2 Bob Y
3 Carol X
> df <- rbind(df, c('Denise','Z'))
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "Denise") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "Z") :
invalid factor level, NA generated
> df
patient referring_doctor
1 Ann X
2 Bob Y
3 Carol X
4 <NA> <NA>
So don't make your string columns factors. Keep them as strings, then appending works fine:
> df <- data.frame(patient=c('Ann','Bob','Carol'), referring_doctor=c('X','Y','X'), stringsAsFactors=F)
> df <- rbind(df, c('Denise','Z'))
patient referring_doctor
1 Ann X
2 Bob Y
3 Carol X
4 Denise Z
To change the default behavior:
options(stringsAsFactors=F)
To convert individual columns to/from string or factor
df$col <- as.character(df$col)
df$col <- as.factor(df$col)
回答7:
here's a function to take the common row names of 2 data frames and do an rbind where we basically find the fields that are factors, add the new factors then do the rbind. This should take care of any factor issues:
rbindCommonCols<-function(x, y){
commonColNames = intersect(colnames(x), colnames(y))
x = x[,commonColNames]
y = y[,commonColNames]
colClassesX = sapply(x, class)
colClassesY = sapply(y, class)
classMatch = paste( colClassesX, colClassesY, sep = "-" )
factorColIdx = grep("factor", classMatch)
for(n in factorColIdx){
x[,n] = as.factor(x[,n])
y[,n] = as.factor(y[,n])
}
for(n in factorColIdx){
x[,n] = factor(x[,n], levels = unique(c( levels(x[,n]), levels(y[,n]) )))
y[,n] = factor(y[,n], levels = unique(c( levels(y[,n]), levels(x[,n]) )))
}
res = rbind(x,y)
res
}