How can I import SAS format files into R?

2019-04-15 07:42发布

问题:

I am trying to analyze data from the 2012-2013 NATS survey, from this location. There are three files in the zip folder there, labelled 2012-2013 NATS format.sas, formats.sas7bcat and nats2012.sas7bdat. The third file contains the actual data, but the second file contains the labels that go with the data; that is, as an example, if the variable 'Race' in the raw data file has categories 1,2,3 and 4, the labels show that these categories stand for 'Caucasian', 'African-American','Hispanic' and 'Other'. I have been able to import the sas7bdat file into R, using the 'sas7bdat' package, but when I try to do cross-tabulations, I am not able to see which category each cell represents. For example, if I try to do this:

table(SMOKSTATUS_R, RACEETHNIC)

What I get is:

RACEETHNIC
SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
           1  4045   455    55     7    63     0   675   393   373
           2  1183   222    38     2    26     0   217   255   154
           3 14480   957   238    14    95     3  1112   950   369
           4 23923  2532  1157    23   147     1  1755  3223   909
           5    81    18     4     0     1     0    11    17     9

As far as I can tell, the only way to inlcude the labels to the data is manually typing them in, but there are 240 variables and besides, there are labels currently existing, in the form of the format.sas7bcat file. Is there any way to import the format file into R, so that the labels can be attached to the variables? This is how it is done in SAS, but I do not have access t oSAS right now. Thanks for all the help.

回答1:

This should be a one-liner:

library('haven')
sas <- read_sas('nats2012.sas7bdat', 'formats.sas7bcat')

with(sas, table(SMOKSTATUS_R, RACEETHNIC))
#             RACEETHNIC
# SMOKSTATUS_R     1     2     3     4     5     6     7     8     9
#            1  4045   455    55     7    63     0   675   393   373
#            2  1183   222    38     2    26     0   217   255   154
#            3 14480   957   238    14    95     3  1112   950   369
#            4 23923  2532  1157    23   147     1  1755  3223   909
#            5    81    18     4     0     1     0    11    17     9

table(names(attr(sas[, 'SMOKSTATUS_R'], 'labels')[sas[, 'SMOKSTATUS_R']]),
      names(attr(sas[, 'RACEETHNIC'], 'labels')[sas[, 'RACEETHNIC']]))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

Use haven to read in the data, but that also gives you some useful attributes, namely the variable labels:

attributes(sas$SMOKSTATUS_R)
# $label
# [1] "SMOKER STATUS (4-level)"
# 
# $class
# [1] "labelled"
# 
# $labels
# Current everyday smoker Current some days smoker            Former smoker 
#                       1                        2                        3 
# Never smoker                  Unknown 
#            4                        5 
# 
# $is_na
# [1] FALSE FALSE FALSE FALSE FALSE

You can easily write this into a function to use more generally:

do_fmt <- function(x, fmt) {
  lbl <- if (!missing(fmt))
    unlist(unname(fmt)) else attr(x, 'labels')

  if (!is.null(lbl))
    tryCatch(names(lbl[match(unlist(x), lbl)]),
             error = function(e) {
               message(sprintf('formatting failed for %s', attr(x, 'label')),
                       domain = NA)
               x
             }) else x
}

table(do_fmt(sas[, 'SMOKSTATUS_R']),
      do_fmt(sas[, 'RACEETHNIC']))

#                          Amer. Indian, AK Nat. Only, Non-Hispanic
# Current everyday smoker                                        63
# Current some days smoker                                       26
# Former smoker                                                  95
# Never smoker                                                  147
# Unknown                                                         1

And apply to the entire data set

sas[] <- lapply(sas, do_fmt)
sas$SMOKSTATUS_R[1:4]
# [1] "Never smoker"  "Former smoker" "Former smoker" "Never smoker" 

Although sometimes this fails like below. This looks like something wrong with the haven package

attr(sas$SMOKTYPE, 'labels')
# INAPPLICABLE            REFUSED                 DK    NOT ASCERTAINED 
#     -4.00000           -0.62500           -0.50000           -0.46875 
# PREMADE CIGARETTES      ROLL-YOUR-OWN               BOTH 
#            1.00000            2.00000            3.00000 

So instead of this, you can parse the format.sas file with some simple regexes

locf <- function(x) {
  x <- data.frame(x, stringsAsFactors = FALSE)
  x[x == ''] <- NA
  indx <- !is.na(x)

  x[] <- lapply(seq_along(x), function(ii) {
    idx <- cumsum(indx[, ii])
    idx[idx == 0] <- NA
    x[, ii][indx[, ii]][idx]
  })
  x[, 1]
}

fmt <- readLines('~/desktop/2012-2013-NATS-Format/2012-2013-NATS-Format.sas')
## not sure if comments are allowed in the value definitions, but
## this will check for those in case
fmt <- gsub('\\*.*;|\\/\\*.*\\*\\/', '', fmt)

vars <- gsub('(?i)value\\W+(\\w*)|.', '\\1', fmt, perl = TRUE)
vars <- locf(vars)

regex <- '[\'\"].*[\'\"]|[\\w\\d-]+'
vals <- gsub(sprintf('(?i)\\s*(%s)\\s*(=)\\s*(%s)|.', regex, regex),
               '\\1\\2\\3', fmt, perl = TRUE)

View(dd <- na.omit(data.frame(values = vars, formats = vals,
                              stringsAsFactors = FALSE)))

sp <- split(dd$formats, dd$values)
sp <- lapply(sp, function(x) {
  x <- Filter(nzchar, x)
  x <- strsplit(x, '=')
  tw <- function(x) gsub('^\\s+|\\s+$', '', x)
  sapply(x, function(y)
    setNames(tw(y[1]), tw(y[2])))
})

So the smoke type formats (one of them that failed above), for example, gets parsed like this:

sp['A5_']
# $A5_
# 'INAPPLICABLE'            'REFUSED'                 'DK' 
#           "-1"                 "-7"                 "-8" 
# 'NOT ASCERTAINED' 'PREMADE CIGARETTES'      'ROLL-YOUR-OWN'  'BOTH' 
#              "-9"                  "1"                  "2"     "3" 

And then you can use the function again to apply to the data

table(do_fmt(sas['SMOKTYPE'], sp['A5_']))

# 'BOTH'                 'DK'       'INAPPLICABLE' 
#   736                   17                51857 
# 'PREMADE CIGARETTES'            'REFUSED'      'ROLL-YOUR-OWN' 
#                 7184                    2                  396 


回答2:

The formats.sas file should be readable and parseble into column label vectors, which you then apply as you would any column label vector.

If you're looking to label the categorical variables, which is presumably what you're mostly concerned about based on your question, this should be fairly straightforward. You'll see code that looks like this:

value RACEF
1 = 'Caucasian'
2 = 'African-American'
3 = 'Hispanic'
4 = 'Other'
;

You just need to parse that into a vector.

If you're lucky, their category format names will be identical to the column names (maybe with an F like I have in that example); if that's the case you can probably just work out how to apply them directly.

If it's not, you'll have to parse the second half of the program. It will consist of lines like this:

format
  race RACEF.
  gender SEXF.
  income INCRF.
...
;

That of course shows the relationship between column name and format name, and thus tells you which vector of column names you should use to label which column.



标签: r import sas