How to get a sample of a given size from a large XML file in R?
Unlike reading random lines, which is simple, it is necessary here to preserve the structure of the XML file for R to read it into a proper data.frame.
A possible solution is to read the whole file and then sample rows, but is it possible to read only necessary chunks?
A sample from the file:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<sku>967190</sku>
<productId>98611</productId>
...
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
...
The number of lines for each "product" is not equal. The final number of records is unknown before opening the file.
Instead of reading the entire file in, it's possible to use event parsing with a closure
that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the i
th record is less than or equal to the number n
of records to keep then store it, otherwise store it with probability n / i
. This could be implemented as
i <- 0L; n <- 10L
select <- function() {
i <<- i + 1L
if (i <= n)
i
else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
which behaves like this:
> i <- 0L; n <- 10L; replicate(20, select())
[1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0
This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.
We use this as part of a product
handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
The 'select' and 'product' handlers are combined with a function (get
) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n
, i
, and sku
sampler <- function(n)
{
force(n) # otherwise lazy evaluation could lead to surprises
i <- 0L
select <- function() {
i <<- i + 1L
if (i <= n) {
i
} else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
list(product=product, get=function() list(sku=sku))
}
And then we're ready to go
products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())
Once the number of nodes processed i
gets large relative to n
, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.
Here's an example based on the XML file you provided.
xml <- '<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<sku>967190</sku>
<productId>98611</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
<product>
<sku>967191</sku>
<productId>98612</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
<product>
<sku>967192</sku>
<productId>98613</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
</products>
'
# parse
p <- xmlParse(xml)
# get nodes
nodes <- xpathApply(p, '//product')
# return a random sample of notes
nodes[sample(seq_along(nodes), 2)]
Here's the result:
> nodes[sample(seq_along(nodes), 2)]
[[1]]
<product>
<sku>967191</sku>
<productId>98612</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
[[2]]
<product>
<sku>967190</sku>
<productId>98611</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>