Performance problem transforming JSON data

2020-06-29 02:28发布

问题:

I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just under 100,000 rows. I have something that works, but I think it can be done much better.

It may be easiest to understand by starting with my sample data.

Assuming you run the following command in /tmp:

curl http://public.west.spy.net/so/time-series.json.gz \
    | gzip -dc - > time-series.json

You should be able to see my desired output (after a while) here:

require(rjson)

trades <- fromJSON(file="/tmp/time-series.json")$rows


data <- do.call(rbind,
                lapply(trades,
                       function(row)
                           data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
                                      price=unlist(row$value)[1],
                                      volume=unlist(row$value)[2])))

someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
                               space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")

days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)

回答1:

You can get a 40x speedup by using plyr. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.

f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin  = function(n) do.call(rbind, lapply(trades[1:n], 
                function(row) data.frame(
                    date   = unlist(row$key)[2],
                    price  = unlist(row$value)[1],
                    volume = unlist(row$value)[2]))
                )
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n], 
               function(x){
                   list(date=x$key[2], price=x$value[1], volume=x$value[2])})))

f_mbq = function(n) data.frame(
          t(sapply(trades[1:n],'[[','key')),    
          t(sapply(trades[1:n],'[[','value')))

rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
    replications = 50)

test            elapsed   relative 
f_ramnath(100)  0.144       3.692308     
f_dustin(100)   6.244     160.102564     
f_mrflick(100)  0.039       1.000000    
f_mbq(100)      0.074       1.897436   

EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.



回答2:

I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:

data <- as.data.frame(do.call(rbind,
                              lapply(trades,
                                     function(x) {list(date=x$key[2],
                                                   price=x$value[1],
                                                   volume=x$value[2])})))

It seems to be made significantly faster by not building the inner frames.



回答3:

You are doing vectorized operations on single elements, which is very inefficient. Price and volume can be extracted like this:

t(sapply(trades,'[[','value'))

And dates like this:

strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')

Now only some sugar and the complete code looks like this:

data.frame(
 strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
 t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')

On my notebook, the whole set gets converted in about 0.7s, while 10k first rows (10%) take circa 8s using the original algorithm.



回答4:

Is batching an option? Process 1000 rows at a time perhaps depending on how deep your json is. Do you really need to transform all the data? I am not sure about r and what exactly you are dealing with, but I am thinking of a generic approach.

Also do take a look at this: http://jackson.codehaus.org/ : A High-performance JSON processor.



标签: json r