We are in the situation that we will have to update large amounts of data (ca. 5 Mio Records) in firebase periodically. At the moment we have a few json files that are around ~1 GB in size.
As existing third party solutions (here and here) have some reliability issues (import object per object; or need for open connection) and are quite disconnected to the google cloud platform ecosystem. I wonder if there is now an "official" way using i.e. the new google cloud functions? Or a combination with app engine / google cloud storage / google cloud datastore.
I really like not to deal with authentication — something that cloud functions seems to handle well, but I assume the function would time out (?)
With the new firebase tooling available, how to:
- Have long running cloud functions to do data fetching / inserts? (does it make sense?)
- Get the json files into & from somewhere inside the google cloud platform?
- Does it make sense to first throw large data into google-cloud-datastore (i.e. too $$$ expensive to store in firebase) or can the firebase real-time database be reliably treaded as a large data storage.
I finally post the answer as it aligns with the new Google Cloud Platform tooling of 2017.
The newly introduced Google Cloud Functions have a limited run-time of approximately 9 minutes (540 seconds). However, cloud functions are able to create a node.js read stream from cloud storage like so (@googlecloud/storage on npm)
var gcs = require('@google-cloud/storage')({
// You don't need extra authentication when running the function
// online in the same project
projectId: 'grape-spaceship-123',
keyFilename: '/path/to/keyfile.json'
});
// Reference an existing bucket.
var bucket = gcs.bucket('json-upload-bucket');
var remoteReadStream = bucket.file('superlarge.json').createReadStream();
Even though it is a remote stream, it is highly efficient. In tests I was able to parse jsons larger than 3 GB under 4 minutes, doing simple json transformations.
As we are working with node.js streams now, any JSONStream Library can efficiently transform the data on the fly (JSONStream on npm), dealing with the data asynchronously just like a large array with event streams (event-stream on npm).
es = require('event-stream')
remoteReadStream.pipe(JSONStream.parse('objects.*'))
.pipe(es.map(function (data, callback(err, data)) {
console.error(data)
// Insert Data into Firebase.
callback(null, data) // ! Return data if you want to make further transformations.
}))
Return only null in the callback at the end of the pipe to prevent a memory leak blocking the whole function.
If you do heavier transformations that require a longer run time, either use a "job db" in firebase to track where you are at and only do i.e. 100.000 transformations and call the function again, or set up an additional function which listens on inserts into a "forimport db" that finally transforms the raw jsons object record into your target format and production system asynchronously. Splitting import and computation.
Additionally, you can run cloud functions code in a nodejs app engine. But not necessarily the other way around.