I have a script to scrape ~1000 webpages. I'm using Promise.all to fire them together, and it returns when all pages are done:
Promise.all(urls.map(url => scrap(url)))
.then(results => console.log('all done!', results));
This is sweet and correct, except for one thing - machine goes out of memory because of the concurrent requests. I'm use jsdom
for scrapping, it quickly takes up a few GB of mem, which is understandable considering it instantiates hundreds of window
.
I have an idea to fix but I don't like it. That is, change control flow to not use Promise.all, but chain my promises:
let results = {};
urls.reduce((prev, cur) =>
prev
.then(() => scrap(cur))
.then(result => results[cur] = result)
// ^ not so nice.
, Promise.resolve())
.then(() => console.log('all done!', results));
This is not as good as Promise.all... Not performant as it's chained, and returned values have to be stored for later processing.
Any suggestions? Should I improve the control flow or should I improve mem usage in scrap(), or is there a way to let node throttle mem allocation?
You are trying to run 1000 web scrapes in parallel. You will need to pick some number significantly less than 1000 and run only N at a time so you consume less memory while doing so. You can still use a promise to keep track of when they are all done.
Bluebird's Promise.map()
can do that for you by just passing a concurrency value as an option. Or, you could write it yourself.
I have an idea to fix but I don't like it. That is, change control
flow to not use Promise.all, but chain my promises:
What you want is N operations in flight at the same time. Sequencing is a special case where N = 1
which would often be much slower than doing some of them in parallel (perhaps with N = 10
).
This is not as good as Promise.all... Not performant as it's chained,
and returned values have to be stored for later processing.
If stored values are part of your memory problem, you may have to store them out of memory somewhere any ways. You will have to analyze how much memory the stored results are using.
Any suggestions? Should I improve the control flow or should I improve
mem usage in scrap(), or is there a way to let node throttle mem
allocation?
Use Bluebird's Promise.map()
or write something similar yourself. Writing something that runs up to N operations in parallel and keeps all the results in order is not rocket science, but it is a bit of work to get it right. I've presented it before in another answer, but can't seem to find it right now. I will keep looking.
Found my prior related answer here: Make several requests to an API that can only handle 20 request a minute