I am trying to analyze data from the 2012-2013 NATS survey, from this location. There are three files in the zip folder there, labelled 2012-2013 NATS format.sas, formats.sas7bcat and nats2012.sas7bdat. The third file contains the actual data, but the second file contains the labels that go with the data; that is, as an example, if the variable 'Race' in the raw data file has categories 1,2,3 and 4, the labels show that these categories stand for 'Caucasian', 'African-American','Hispanic' and 'Other'. I have been able to import the sas7bdat file into R, using the 'sas7bdat' package, but when I try to do cross-tabulations, I am not able to see which category each cell represents. For example, if I try to do this:
table(SMOKSTATUS_R, RACEETHNIC)
What I get is:
RACEETHNIC
SMOKSTATUS_R 1 2 3 4 5 6 7 8 9
1 4045 455 55 7 63 0 675 393 373
2 1183 222 38 2 26 0 217 255 154
3 14480 957 238 14 95 3 1112 950 369
4 23923 2532 1157 23 147 1 1755 3223 909
5 81 18 4 0 1 0 11 17 9
As far as I can tell, the only way to inlcude the labels to the data is manually typing them in, but there are 240 variables and besides, there are labels currently existing, in the form of the format.sas7bcat file. Is there any way to import the format file into R, so that the labels can be attached to the variables? This is how it is done in SAS, but I do not have access t oSAS right now. Thanks for all the help.
The
formats.sas
file should be readable and parseble into column label vectors, which you then apply as you would any column label vector.If you're looking to label the categorical variables, which is presumably what you're mostly concerned about based on your question, this should be fairly straightforward. You'll see code that looks like this:
You just need to parse that into a vector.
If you're lucky, their category format names will be identical to the column names (maybe with an F like I have in that example); if that's the case you can probably just work out how to apply them directly.
If it's not, you'll have to parse the second half of the program. It will consist of lines like this:
That of course shows the relationship between column name and format name, and thus tells you which vector of column names you should use to label which column.
This should be a one-liner:
Use
haven
to read in the data, but that also gives you some usefulattributes
, namely the variable labels:You can easily write this into a function to use more generally:
And apply to the entire data set
Although sometimes this fails like below. This looks like something wrong with the
haven
packageSo instead of this, you can parse the format.sas file with some simple regexes
So the smoke type formats (one of them that failed above), for example, gets parsed like this:
And then you can use the function again to apply to the data