可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just under 100,000 rows. I have something that works, but I think it can be done much better.
It may be easiest to understand by starting with my sample data.
Assuming you run the following command in /tmp
:
curl http://public.west.spy.net/so/time-series.json.gz \
| gzip -dc - > time-series.json
You should be able to see my desired output (after a while) here:
require(rjson)
trades <- fromJSON(file="/tmp/time-series.json")$rows
data <- do.call(rbind,
lapply(trades,
function(row)
data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
price=unlist(row$value)[1],
volume=unlist(row$value)[2])))
someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")
days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
labels=strftime(days, "%F"),
tick=FALSE)
回答1:
You can get a 40x speedup by using plyr
. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.
f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin = function(n) do.call(rbind, lapply(trades[1:n],
function(row) data.frame(
date = unlist(row$key)[2],
price = unlist(row$value)[1],
volume = unlist(row$value)[2]))
)
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n],
function(x){
list(date=x$key[2], price=x$value[1], volume=x$value[2])})))
f_mbq = function(n) data.frame(
t(sapply(trades[1:n],'[[','key')),
t(sapply(trades[1:n],'[[','value')))
rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
replications = 50)
test elapsed relative
f_ramnath(100) 0.144 3.692308
f_dustin(100) 6.244 160.102564
f_mrflick(100) 0.039 1.000000
f_mbq(100) 0.074 1.897436
EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.
回答2:
I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:
data <- as.data.frame(do.call(rbind,
lapply(trades,
function(x) {list(date=x$key[2],
price=x$value[1],
volume=x$value[2])})))
It seems to be made significantly faster by not building the inner frames.
回答3:
You are doing vectorized operations on single elements, which is very inefficient. Price and volume can be extracted like this:
t(sapply(trades,'[[','value'))
And dates like this:
strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')
Now only some sugar and the complete code looks like this:
data.frame(
strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')
On my notebook, the whole set gets converted in about 0.7s, while 10k first rows (10%) take circa 8s using the original algorithm.
回答4:
Is batching an option? Process 1000 rows at a time perhaps depending on how deep your json is. Do you really need to transform all the data? I am not sure about r and what exactly you are dealing with, but I am thinking of a generic approach.
Also do take a look at this: http://jackson.codehaus.org/ : A High-performance JSON processor.