I am new to the asynchronous control flow of Node.js, my scraper works, but I can't help thinking that there must be a more optimal (elegant?) way of doing it, I am open to the use of other node library. More specifically:
- I feel that the current control flow (with all the callback) is hard to read, but maybe it's just because that this is new to me. There seems to be several library on control flow, should I be using it?
- Originally, my code made all the
request
first, parse and save everything on arecords = []
, than processed to write everything to file. I change the code here, so that it willrequest - parse - append
for each record in the for loop, I will like to confirm whether this approach is better with large number of requests. - Writing the records in JSON format caused some pain, currently I have to call a
startStep
to append the[
first, then use(flag? function(){flag = false; return "";}() : ",")
to decide whether it's the first records, if not append comma first, then appending all records, then append]
at the end. Again, I'm curious whether there are better way of doing this. To iterate, I am declaring the list on the global scope, and using
list.shift()
to iterate over the next item, it seems to be fine now, but I think that this will caused side-effect in a large scale. My intuition is that I should passed the array as an argument. Again, I will like to get confirmation on this point.var fs = require('fs'); var request = require("request"); var cheerio = require("cheerio"); function appendFile(_input, callback){ fs.appendFile("./TED/alltalk3.json", _input, function(err){ if(err){ console.log("input is" + _input + "error is :" + err); } else{ callback(); } }); } function startStep(){ appendFile("[", function(){ console.log("--start--"); getOneDay(list.shift()); }) } function finalStep(){ appendFile("]", function(){ console.log("--end--"); return; }) } var flag = true; // first item no comma function getOneDay(itm){ if(itm){ request("http://www.ted.com/talks/view/id/" + itm, function(error, response, body) { var $ = cheerio.load(body) var record = {}; record["title"] = $("#altHeadline").text(); appendFile( (flag? function(){flag = false; return "";}() : ",") + (JSON.stringify(record, null, 4)), function(){ return getOneDay(list.shift());; } ) }); } else{ return finalStep(); } } var list = []; for(var i = 1; i < 5; i++){ list.push(i); } startStep();
What your're trying to achieve with your code is a Finite State Machine (FSM), a common pattern used in asynchronous programming. Some languages have built-in support for. For example, C# 5.0 have
async/await
, which dramatically simplifies asynchronous programming by providing us with familiar liner code flow.There have already been some attempts to bring
async/await
to JavaScript. I believe, full support for it in Node.js and all major web browsers is just a matter of time.Until then, the most common pattern for asynchronous code flow in JavaScript is Promise. It represents the result of an operation which will be completed in the future, and allows to take an action upon its completion, with a JavaScript callback function. I suggest you stick to this pattern with your code.
More resources:
It is highly recommended that you take a look on https://github.com/caolan/async - especially its forEachSeries method - it looks like it is exactly what you need.
I can also recommend to use fs sync methods in this particular case. It is not recommended to use the blocking methods for the services, but for the shell-like scripts it is ok.