R: use LaF (reads fixed column width data FAST) wi

2019-05-23 05:57发布

I'm trying to read quickly into R a ASCII fixed column width dataset, based on a SAS import file (the file that declares the column widths, and etc).

I know I can use SAScii R package for translating the SAS import file (parse.SAScii) and actually importing (read.SAScii). It works but it is too slow, because read.SAScii uses read.fwf to do the data import, which is slow. I would like to change that for a fast import mathod, laf_open_fwf from the "LaF" package.

I'm almost there, using parse.SAScii() and laf_open_fwf(), but I'm able to correctly connect the output of parse.SAScii() to the arguments of laf_open_fwf().

Here is the code, the data is from PNAD, national household survey, 2013:

# Set working dir.
  setwd("C:/User/Desktop/folder")

# installing packages: 
    install.packages("SAScii")
    install.packages("LaF")
    library(SAScii)
    library(LaF)

# Donwload and unzip data and documentation files
  # Data
    file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dados.zip"
    download.file(file_url,"Dados.zip", mode="wb")
    unzip("Dados.zip")
  # Documentation files
    file_url <- "ftp://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_anual/microdados/2013/Dicionarios_e_input_20150814.zip"
    download.file(file_url,"Dicionarios_e_input.zip", mode="wb")
    unzip("Dicionarios_e_input.zip")

# importing with read.SAScii(), based on read.fwf(): Works fine
    dom.pnad2013.teste1 <- read.SAScii("Dados/DOM2013.txt","Dicionarios_e_input/input DOM2013.txt")

# importing with parse.SAScii() and laf_open_fwf() : stuck here
    dic_dom2013 <- parse.SAScii("Dicionarios_e_input/input DOM2013.txt")
    head(dic_dom2013)
    data <- laf_open_fwf("Dados/DOM2013.txt", 
                      column_types=?????  ,
                      column_widths=dic_dom2013[,"width"],
                      column_names=dic_dom2013[,"Varname"])

I'm stuck on the last commmand, passing the importing arguments to laf_open_fwf().

标签: r import
3条回答
Ridiculous、
2楼-- · 2019-05-23 06:39

You can try the read.SAScii.sqlite, also by Anthony Damico. It's 4x faster and lead to no RAM issues (as the author himself describes). But it imports data to a SQLite self-contained database file (no SQL server needed) -- not to a data.frame. Then you can open it in R by using a dbConnection. Here it goes the GitHub adress for the code:

https://github.com/ajdamico/usgsd/blob/master/SQLite/read.SAScii.sqlite.R

In the R console, you can just run:

    source("https://raw.githubusercontent.com/ajdamico/usgsd/master/SQLite/read.SAScii.sqlite.R")

It's arguments are almost the same as those for the regular read.SAScii.

I know you are asking for a tip on how to use LaF. But I thought this could also be useful to you.

查看更多
看我几分像从前
3楼-- · 2019-05-23 06:42

I think that the best choice is to use fwf2csv() from desc package (C++ code). I will illustrate the procedure with PNAD 2013. Be aware that i'm considering that you already have the dictionary with 3 variables: beginning of the field, size of the field, variable name, AND the dara at Data/

library(bit64)
library(data.table)
library(descr)
library(reshape)
library(survey)
library(xlsx)

end_dom <- dic_dom2013$beggining + dicdom$size - 1

fwf2csv(fwffile='Dados/DOM2013.txt', csvfile='dadosdom.csv', names=dicdom$variable, begin=dicdom$beggining, end=end_dom)

dadosdom <- fread(input='dadosdom.csv', sep='auto', sep2='auto', integer64='double')
查看更多
何必那么认真
4楼-- · 2019-05-23 06:48

UPDATE: here are two solutions, using packages LaF and readr.

Solution using readr (8 seconds)

readr is based on LaF but surprisingly faster. More info on readr here

# Load Packages
  library(readr)
  library(data.table)

# Parse SAS file
  dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")

setDT(dic_pes2013) # convert to data.table

# read to data frame
  pesdata2 <- read_fwf("Dados/DOM2013.txt", 
                   fwf_widths(dput(dic_pes2013[,width]),
                              col_names=(dput(dic_pes2013[,varname]))),
                              progress = interactive()
                              )

Take way: readr seems to be the best option: it's faster, you don't need to worry about column types, shorter code and it shows a progress bar :)

Solution using LaF (20 seconds)

LaFis one of the (maybe THE) fastest ways to read fixed-width files in R, according to this benchmark. It tooke me 20 sec. to read the person level file (PES) into a data frame.

Here is the code:

# Parse SAS file
  dic_pes2013 <- parse.SAScii("./Dicion rios e input/input PES2013.sas")

# Read .txt file using LaF. This is virtually instantaneous
  pesdata <- laf_open_fwf("./Dados/PES2013.txt", 
                          column_types= rep("character", length(dic_pes2013[,"width"])), 
                          column_widths=dic_pes2013[,"width"],
                          column_names=dic_pes2013[,"varname"])

# convert to data frame. This tooke me 20 sec.
  system.time( pesdata <- pesdata[,] )

Note that that I've used character in column_types. I'm not quite sure why the command returns me an error if I try integer or numeric. This shouldn't be a problem, since you can convert all columns to numeric like this:

  # convert all columns to numeric
    varposition  <- grep("V", colnames(pesdata)) 
    pesdata[varposition] <- sapply(pesdata[],as.numeric)
    sapply(pesdata, class)
查看更多
登录 后发表回答