Using Stata Variable Labels in R

2020-02-19 08:37发布

问题:

I have a bunch of Stata .dta files that I would like to use in R.

My problem is that the variable names are not helpful to me as they are like "q0100," "q0565," "q0500," and "q0202." However, they are labelled like "psu," "number of pregnant," "head of household," and "waypoint."

I would like to be able to grab the labels ("psu," "waypoint," etc. . .) and use them as my variable/column names as those will be easier for me to work with.

Is there a way to do this, either preferably in R, or through Stata itself? I know of read.dta in library(foreign) but don't know if it can convert the labels into variable names.

回答1:

R does not have a built in way to handle variable labels. Personally I think that this is disadvantage that should be fixed. Hmisc does provide some facilitiy for hadling variable labels, but the labels are only recognized by functions in that package. read.dta creates a data.frame with an attribute "var.labels" which contains the labeling information. You can then create a data dictionary from that.

> data(swiss)
> write.dta(swiss,swissfile <- tempfile())
> a <- read.dta(swissfile)
> 
> var.labels <- attr(a,"var.labels")
> 
> data.key <- data.frame(var.name=names(a),var.labels)
> data.key
          var.name       var.labels
1        Fertility        Fertility
2      Agriculture      Agriculture
3      Examination      Examination
4        Education        Education
5         Catholic         Catholic
6 Infant_Mortality Infant.Mortality

Of course this .dta file doesn't have very interesting labels, but yours should be more meaningful.



回答2:

I would recommend that you use the new haven package (GitHub) for importing your data.

As Hadley Wickham mentions in the README.md file:

You always get a data frame, date times are converted to corresponding R classes and labelled vectors are returned as new labelled class. You can easily coerce to factors or replace labelled values with missings as appropriate. If you also use dplyr, you'll notice that large data frames are printed in a convenient way.

(emphasis mine)

If you use RStudio this will automatically display the labels under variable names in the View("data.frame") viewer pane (source).

Variable labels are attached as an attribute to each variable. These are not printed (because they tend to be long), but if you have a preview version of RStudio, you’ll see them in the revamped viewer pane.

You can install the package using:

install.packages("haven")

and import your Stata date using:

read_dta("path/to/file")

For more info see:

help("read_dta")


回答3:

Here's a function to evaluate any expression you want with Stata variable labels:

#' Function to prettify the output of another function using a `var.labels` attribute
#' This is particularly useful in combination with read.dta et al.
#' @param dat A data.frame with attr `var.labels` giving descriptions of variables
#' @param expr An expression to evaluate with pretty var.labels
#' @return The result of the expression, with variable names replaced with their labels
#' @examples
#' testDF <- data.frame( a=seq(10),b=runif(10),c=rnorm(10) )
#' attr(testDF,"var.labels") <- c("Identifier","Important Data","Lies, Damn Lies, Statistics")
#' prettify( testDF, quote(str(dat)) )
prettify <- function( dat, expr ) {
  labels <- attr(dat,"var.labels")
  for(i in seq(ncol(dat))) colnames(dat)[i] <- labels[i]
  attr(dat,"var.labels") <- NULL
  eval( expr )
}

You can then prettify(testDF, quote(table(...))) or whatever you want.

See this thread for more info.



回答4:

You can convert the variable labels to variable names from within Stata before exporting it to a R or text file.
As Ian mentions, variable labels usually do not make good variable names, but if you convert spaces and other characters to underscores and if your variable labels aren't too long, you can re-label your vars with the varlabels quite easily.

Below is an example using the inbuilt Stata dataset "cancer.dta" to replace all variable names with var labels--importantly, this code will not try to rename variable with no variable labels. Note that I also picked a dataset where there are lots of characters that aren't useful in naming a variable (e.g.: =, 1, ', ., (), etc)...you can add any characters that might be lurking in your variable labels to the list in the 5th line: "local chars "..." " and it will make the changes for you:

****************! BEGIN EXAMPLE
//copy and paste this code into a Stata do-file and click "do"//
sysuse  cancer, clear
desc
**
local chars "" " "(" ")" "." "1" "=" `"'"' "___" "__" "
ds, not(varlab "")    // <-- This will only select those vars with varlabs //
foreach v in `r(varlist)' {
    local `v'l "`:var lab `v''"
    **variables names cannot have spaces or other symbols, so::
        foreach s in `chars' {
    local `v'l: subinstr local `v'l "`s'" "_", all
              }
    rename `v' ``v'l'
    **make the variable names all lower case**
    cap rename ``v'l' `=lower("``v'l'")'
      }
desc
****************! END EXAMPLE

You might also consider taking a look at Stat Transfer and it's capabilities in converting Stata to R datafiles.



回答5:

When using the haven package:

if the data set you are importing is heavy, viewing the data in Rstudio might not be optimal.

You can instead get a data.frame with column names, column labels and an indicator for whether the column is labelled:

d <- read_dta("your_stata_data.dta") 

vars <- data.frame(
                   "name" = names(d),
                   "label" = sapply(d, function(x) attr(x, "label"))  %>% as.character(),
                   "labelled" = sapply(d, is.labelled) )

Note: need to use as.characted in order to avoid NULL in the labels being dropped and therefore ending up with different vector lengths.