TL;DR version: I want to avoid adding duplicate Javascript objects to an array of similar objects, some of which might be really big. What's the best approach?
I have an application where I'm loading large amounts of JSON data into a Javascript data structure. While it's a bit more complex than this, assume that I'm loading JSON into an array of Javascript objects from a server through a series of AJAX requests, something like:
var myObjects = [];
function processObject(o) {
myObjects.push(o);
}
for (var x=0; x<1000; x++) {
$.getJSON('/new_object.json', processObject);
}
To complicate matters, the JSON:
- is in an unknown schema
- is of arbitrary length (probably not enormous, but could be in the 100-200 kb range)
- might contain duplicates across different requests
My initial thought is to have an additional object to store a hash of each object (via JSON.stringify
?) and check against it on each load, like this:
var myHashMap = {};
function processObject(o) {
var hash = JSON.stringify(o);
// is it in the hashmap?
if (!(myHashMap[hash])) {
myObjects.push(o);
// set the hashmap key for future checks
myHashMap[hash] = true;
}
// else ignore this object
}
but I'm worried about having property names in myHashMap
that might be 200 kb in length. So my questions are:
- Is there a better approach for this problem than the hashmap idea?
- If not, is there a better way to make a hash function for a JSON object of arbitrary length and schema than
JSON.stringify
? - What are the possible issues with super-long property names in an object?
I'd suggest you create an MD5 hash of the JSON.stringify(o) and store that in your hashmap with a reference to your stored object as the data for the hash. And to make sure that there are no object key order differences in the
JSON.stringify()
, you have to create a copy of the object that orders the keys.Then, when each new object comes in, you check it against the hash map. If you find a match in the hash map, then you compare the incoming object with the actual object that you've stored to see if they are truly duplicates (since there can be MD5 hash collisions). That way, you have a manageable hash table (with only MD5 hashes in it).
Here's code to create a canonical string representation of an object (including nested objects or objects within arrays) that handles object keys that might be in a different order if you just called JSON.stringify().
And, the algorithm
Test case for
jsonStringifyCanonical()
is here: https://jsfiddle.net/jfriend00/zfrtpqcL/JSON.stringify(o)
guarantees same key ordering. Because{foo: 1, bar: 2}
and{bar: 2, foo: 1}
is equal as objects, but not as strings.One possible optimization:
Instead of using
getJSON
use$.get
and pass"text"
asdataType
param. Than You can use result as Your hash and convert to object afterwards.Actually by writing last sentence I though about another solution:
$.get
into arrayArray.sort
for
Again different JSON strings can make same JavaScript object.