I have an input file which may potentially contain upto 1M records and each record would look like this
field 1 field 2 field3 \n
I want to read this input file and sort it based on field3
before writing it to another file.
here is what I have so far
var fs = require('fs'),
readline = require('readline'),
stream = require('stream');
var start = Date.now();
var outstream = new stream;
outstream.readable = true;
outstream.writable = true;
var rl = readline.createInterface({
input: fs.createReadStream('cross.txt'),
output: outstream,
terminal: false
});
rl.on('line', function(line) {
//var tmp = line.split("\t").reverse().join('\t') + '\n';
//fs.appendFileSync("op_rev.txt", tmp );
// this logic to reverse and then sort is too slow
});
rl.on('close', function() {
var closetime = Date.now();
console.log('Read entirefile. ', (closetime - start)/1000, ' secs');
});
I am basically stuck at this point, all I have is the ability to read from one file and write to another, is there a way to efficiently sort this data before writing it
DB
and sort-stream
are fine solutions, but DB might be an overkill and I think sort-stream
eventually just sorts the entire file in an in-memory array (on through
end callback), so I think performance will be roughly the same, comparing to the original solution.
(but I haven't ran any benchmarks, so I might be wrong).
So, just for the hack of it, I'll throw in another solution :)
EDIT:
I was curious to see how big a difference this will be, so I ran some benchmarks.
Results were surprising even to me, turns out sort -k3,3
solution is better by far, x10 times faster then the original solution (a simple array sort), while nedb
and sort-stream
solutions are at least x18 times slower than the original solution (i.e. at least x180 times slower than sort -k3,3
).
(See benchmark results below)
If on a *nix machine (Unix, Linux, Mac, ...) you can simply use
sort -k 3,3 yourInputFile > op_rev.txt
and let the OS do the sorting for you.
You'll probably get better performance, since sorting is done natively.
Or, if you want to process the sorted output in Node:
var util = require('util'),
spawn = require('child_process').spawn,
sort = spawn('sort', ['-k3,3', './test.tsv']);
sort.stdout.on('data', function (data) {
// process data
data.toString()
.split('\n')
.map(line => line.split("\t"))
.forEach(record => console.info(`Record: ${record}`));
});
sort.on('exit', function (code) {
if (code) {
// handle error
}
console.log('Done');
});
// optional
sort.stderr.on('data', function (data) {
// handle error...
console.log('stderr: ' + data);
});
Hope this helps :)
EDIT: Adding some benchmark details.
I was curious to see how big a difference this will be, so I ran some benchmarks.
Here are the results (running on a MacBook Pro):
sort1 uses a straightforward approach, sorting the records in an in-memory array
.
Avg time: 35.6s (baseline)
sort2 uses sort-stream
, as suggested by Joe Krill.
Avg time: 11.1m (about x18.7 times slower)
(I wonder why. I didn't dig in.)
sort3 uses nedb
, as suggested by Tamas Hegedus.
Time: about 16m (about x27 times slower)
sort4 only sorts by executing sort -k 3,3 input.txt > out4.txt
in a terminal
Avg time: 1.2s (about x30 times faster)
sort5 uses sort -k3,3
, and process the response sent to stdout
Avg time: 3.65s (about x9.7 times faster)
You can take advantage of streams for something like this. There's a few NPM modules that will be helpful -- first include them by running
npm install sort-stream csv-parse stream-transform
from the command line.
Then:
var fs = require('fs');
var sort = require('sort-stream');
var parse = require('csv-parse');
var transform = require('stream-transform');
// Create a readble stream from the input file.
fs.createReadStream('./cross.txt')
// Use `csv-parse` to parse the input using a tab character (\t) as the
// delimiter. This produces a record for each row which is an array of
// field values.
.pipe(parse({
delimiter: '\t'
}))
// Use `sort-stream` to sort the parsed records on the third field.
.pipe(sort(function (a, b) {
return a[2].localeCompare(b[2]);
}))
// Use `stream-transform` to transform each record (an array of fields) into
// a single tab-delimited string to be output to our destination text file.
.pipe(transform(function(row) {
return row.join('\t') + '\r';
}))
// And finally, output those strings to our destination file.
.pipe(fs.createWriteStream('./cross_sorted.txt'));
You have two options, depending on how much data is being processed. (1M record count with 3 columns doesn't say much about the amount of actual data)
Load the data in memory, sort in place
var lines = [];
rl.on('line', function(line) {
lines.push(line.split("\t").reverse());
});
rl.on('close', function() {
lines.sort(function(a, b) { return compare(a[0], b[0]); });
// write however you want
fs.writeFileSync(
fileName,
lines.map(function(x) { return x.join("\t"); }).join("\n")
);
function compare(a, b) {
if (a < b) return -1;
if (a > b) return 1;
return 0;
}
});
Load the data in a persistent database, read ordered
Using a database engine of your choice (for example nedb, a pure javascript db for nodejs)
EDIT: It seems that NeDB keeps the whole database in memory, the file is only a persistent copy of the data. We'll have to search for another implementation. TingoDB looks promising.
// This code is only to give an idea, not tested in any way
var Datastore = require('nedb');
var db = new Datastore({
filename: 'path/to/temp/datafile',
autoload: true
});
rl.on('line', function(line) {
var tmp = line.split("\t").reverse();
db.insert({
field0: tmp[0],
field1: tmp[1],
field2: tmp[2]
});
});
rl.on('close', function() {
var cursor = db.find({})
.sort({ field0: 1 }); // sort by field0, ascending
var PAGE_SIZE = 1000;
paginate(0);
function paginate(i) {
cursor.skip(i).take(PAGE_SIZE).exec(function(err, docs) {
// handle errors
var tmp = docs.map(function(o) {
return o.field0 + "\t" + o.field1 + "\t" + o.field2 + "\n";
});
fs.appendFileSync("op_rev.txt", tmp.join(""));
if (docs.length >= PAGE_SIZE) {
paginate(i + PAGE_SIZE);
} else {
// cleanup temp database
}
});
}
});
i had quite similar issue, needed to perform an external sort.
I figured out, after waste a few time on it that i could load up the data on a database and then query out the desired data from it.
It not even matter if the inserts aren't ordered, as long as my query result could be.
Hope it can work for you too.
In order to insert your data on a database, there are plenty of tools on node to perform such task. I have this pet project which does a similar job.
I'm also sure that if you search the subject, you'll find much more info.
Good luck.