I'm a novice to this kind of javascript, so I'll give a brief explanation:
I have a web scraper built in Nodejs
that gathers (quite a bit of) data, processes it with Cheerio
(basically jQuery
for Node
) creates an object then uploads it to mongoDB.
It works just fine, except for on larger sites. What's appears to be happening is:
- I give the scraper an online store's URL to scrape
- Node goes to that URL and retrieves anywhere from 5,000 - 40,000 product urls to scrape
- For each of these new URLs, Node's
request
module gets the page source then loads up the data to Cheerio
.
- Using Cheerio I create a JS object which represents the product.
- I ship the object off to MongoDB where it's saved to my database.
As I say, this happens for thousands of URLs and once I get to, say, 10,000 urls loaded I get errors in node. The most common is:
Node: Fatal JS Error: Process out of memory
Ok, here's the actual question(s):
I think this is happening because Node's garbage cleanup isn't working properly. It's possible that, for example, the request
data scraped from all 40,000 urls is still in memory, or at the very least the 40,000 created javascript objects may be. Perhaps it's also because the MongoDB connection is made at the start of the session and is never closed (I just close the script manually once all the products are done). This is to avoid opening/closing the connection it every single time I log a new product.
To really ensure they're cleaned up properly (once the product goes to MongoDB I don't use it anymore and can be deleted from memory) can/should I just simply delete it from memory, simply using delete product
?
Moreso (I'm clearly not across how JS handles objects) if I delete one reference to the object is it totally wiped from memory, or do I have to delete all of them?
For instance:
var saveToDB = require ('./mongoDBFunction.js');
function getData(link){
request(link, function(data){
var $ = cheerio.load(data);
createProduct($)
})
}
function createProduct($)
var product = {
a: 'asadf',
b: 'asdfsd'
// there's about 50 lines of data in here in the real products but this is for brevity
}
product.name = $('.selector').dostuffwithitinjquery('etc');
saveToDB(product);
}
// In mongoDBFunction.js
exports.saveToDB(item){
db.products.save(item, function(err){
console.log("Item was successfully saved!");
delete item; // Will this completely delete the item from memory?
})
}
delete
in javascript is NOT used to delete variables or free memory. It is ONLY used to remove a property from an object. You may find this article on the delete
operator a good read.
You can remove a reference to the data held in a variable by setting the variable to something like null
. If there are no other references to that data, then that will make it eligible for garbage collection. If there are other references to that object, then it will not be cleared from memory until there are no more references to it (e.g. no way for your code to get to it).
As for what is causing the memory accumulation, there are a number of possibilities and we can't really see enough of your code to know what references could be held onto that would keep the GC from freeing up things.
If this is a single, long running process with no breaks in execution, you might also need to manually run the garbage collector to make sure it gets a chance to clean up things you have released.
Here's are a couple articles on tracking down your memory usage in node.js: http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/ and https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/.
JavaScript has a garbage collector that automatically track which variable is "reachable". If a variable is "reachable", then its value won't be released.
For example if you have a global variable var g_hugeArray and you assign it a huge array, you actually have two JavaScript object here: one is the huge block that holds the array data. Another is a property on the window object whose name is "g_hugeArray" that points to that data. So the reference chain is: window -> g_hugeArray -> the actual array.
In order to release the actual array, you make the actual array "unreachable". you can break either link the above chain to achieve this. If you set g_hugeArray to null, then you break the link between g_hugeArray and the actual array. This makes the array data unreachable thus it will be released when the garbage collector runs. Alternatively, you can use "delete window.g_hugeArray" to remove property "g_hugeArray" from the window object. This breaks the link between window and g_hugeArray and also makes the actual array unreachable.
The situation gets more complicated when you have "closures". A closure is created when you have a local function that reference a local variable. For example:
function a()
{
var x = 10;
var y = 20;
setTimeout(function()
{
alert(x);
}, 100);
}
In this case, local variable x is still reachable from the anonymous time out function even after function "a" has returned. If without the timeout function, then both local variable x and y will become unreachable as soon as function a returns. But the existence of the anonymous function change this. Depending on how the JavaScript engine is implemented, it may choose to keep both variable x and y (because it doesn't know whether the function will need y until the function actually runs, which occurs after function a returns). Or if it is smart enough, it can only keep x. Imagine that if both x and y points to big things, this can be a problem. So closure is very convenient but at times it is more likely to cause memory issues and can make it more difficult to track memory issues.
I faced same problem in my application with similar functionality. I've been looking for memory leaks or something like that. The size of consumed memory my process has reached to 1.4 GB and depends on the number of links that must be downloaded.
The first thing I noticed was that after manually running the Garbage Collector, almost all memory was freed. Each page that I downloaded took about 1 MB, was processed and stored in the database.
Then I install heapdump and looked at the snapshot of the application. More information about memory profiling you can found at Webstorm Blog.
My guess is that while the application is running, the GC does not start. To do this, I began to run application with the flag --expose-gc
, and began to run GC manually at the time of implementation of the program.
const runGCIfNeeded = (() => {
let i = 0;
return function runGCIfNeeded() {
if (i++ > 200) {
i = 0;
if (global.gc) {
global.gc();
} else {
logger.warn('Garbage collection unavailable. Pass --expose-gc when launching node to enable forced garbage collection.');
}
}
};
})();
// run GC check after each iteration
checkProduct(product._id)
.then(/* ... */)
.finally(runGCIfNeeded)