For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.
I have more than 200 SPSS
data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS
data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss
from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:
> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE, :
/tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
The data frame, exported as .RData
, can be found here. When I use table
(for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect Partially correct Partially correct
8 0 4 0
Correct <NA>
2 1
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
I know I can easily treat the factor as.numeric
when calling table
. However, I need the level names displayed in the output. I can use make.unique
to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:
> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")
Works like a charm. Then table
shows me the correct counts:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect 1 Partially correct
5 3 1
Partially correct 1 Correct <NA>
3 2 1
However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table
requires names. I have tried the following:
> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) :
'names' must be a character vector
No luck. I have tried many other things in the last days, including classical for
loops. Still the same: 'names' must be a character vector
. I guess the problem is in indexing the attribute levels
of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?
EDIT:
The solution provided by akrun works perfectly. Thank you once again!