Parse multiple XBRL files stored in a zip file

2020-06-30 04:39发布

I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).

It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)

Example file (sorry it is a bit big) using code from a previous question - download one zip file

library(XML)

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)

myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]

dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))

I can parse the files using the XBRL package if i manually extract them. This can be done as follows

library(XBRL)     
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)

I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them. I tried making a start, but don't know how to progress from here. Thanks for any advice.

# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626

# unzip  and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)

I am using Windows 8.1

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

标签: r xbrl
1条回答
家丑人穷心不美
2楼-- · 2020-06-30 05:09

Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.

  # Parse one zip file to start
  fls <- list.files(temp)[[1]]

  # Unzip 
  tmp <- tempdir()
  lst <- unzip(file.path(temp, fls), exdir=tmp)

  # Only parse first 10 records
  inst <- lst[1:10]

  # Start to parse - in parallel
  cl <- makeCluster(parallel::detectCores())
  clusterCall(cl, function() library(XBRL))

  # Start
  st <- Sys.time()

  out <- parLapply(cl, inst, function(i) 
                                  xbrlDoAll(i, 
                                            cache.dir="temp/hmrcCache", 
                                            prefix.out=NULL, verbose=T) )

  stopCluster(cl)

  Sys.time() - st

(I am not sure that I am using the tempdir() correctly as this seems to save large amounts of data to the Local\Temp directory - I would welcome comments if I have approached this incorrectly, thanks).

查看更多
登录 后发表回答