I have a dump of a Firebase database representing our Users table stored in JSON. I want to run some data analysis on it but the issue is that it's too big to load into memory completely and manipulate with pure JavaScript (or _
and similar libraries).
Up until now I've been using the JSONStream package to deal with my data in bite-sized chunks (it calls a callback once for each user in the JSON dump).
I've now hit a roadblock though because I want to filter my user ids based on their value. The "questions" I'm trying to answer are of the form "Which users x" whereas previously I was just asking "How many users x" and didn't need to know who they were.
The data format is like this:
{
users: {
123: {
foo: 4
},
567: {
foo: 8
}
}
}
What I want to do is essentially get the user ID (123
or 567
in the above) based on the value of foo
. Now, if this were a small list it would be trivial to use something like _.each
to iterate over the keys and values and extract the keys I want.
Unfortunately, since it doesn't fit into memory that doesn't work. With JSONStream I can iterate over it by using var parser = JSONStream.parse('users.*');
and piping it into a function that deals with it like this:
var stream = fs.createReadStream('my.json');
stream.pipe(parser);
parser.on('data', function(user) {
// user is equal to { foo: bar } here
// so it is trivial to do my filter
// but I don't know which user ID owns the data
});
But the problem is that I don't have access to the key representing the star wildcard that I passed into JSONStream.parse
. In other words, I don't know if { foo: bar}
represents user 123
or user 567
.
The question is twofold:
- How can I get the current path from within my callback?
- Is there a better way to be dealing with this JSON data that is too big to fit into memory?
I went ahead and edited JSONStream to add this functionality.
If anyone runs across this and wants to patch it similarly, you can replace
line 83
which was previouslywith this:
In the code sample from the original question, rather than
user
being equal to{ foo: bar }
in the callback it will now be{ uid: { foo: bar } }
Since this is a breaking change I didn't submit a pull request back to the original project but I did leave it in the issues in case they want to add a flag or option for this in the future.