Per a prior thread here:
Node async loop - how to make this code run in sequential order?
...I'm looking for broader advice on processing large data upload files.
Scenario:
User uploads a very large CSV file with hundreds-of-thousands to millions of rows. It's streaming into an endpoint using multer:
const storage = multer.memoryStorage();
const upload = multer({ storage: storage });
router.post("/", upload.single("upload"), (req, res) => {
//...
});
Each row is transformed into a JSON object. That object is then mapped into several smaller ones, which need to be inserted into several different tables, spread out across, and accessed by, various microservice containers.
async.forEachOfSeries(data, (line, key, callback) => {
let model = splitData(line);
//save model.record1, model.record2, etc. sequentially
});
It's obvious I'm going to run into memory limitations with this approach. What is the most efficient manner for doing this?
To avoid memory issues you need to process the file using streams - in plain words, incrementally.
You can do this with a combination of a CSV stream parser, to stream the binary contents as CSV rows and through2, a stream utility that allows you to control the flow of the stream.
The Process
The process goes as follows:
cb()
to move on to the next item.I'm not familiar with
multer
but here's an example that uses a stream from a File.The example
foo.csv
CSV is this:Notes
row
is processed it goes out of scope/becomes unreacheable, hence it's eligible for Garbage Collection. This is what makes this approach so memory efficient. Read the Streams Handbook for more info on streams.row
s into an Array, process/save the entire Array and then callcb
to move on to the next chunk - repeating the process.end
/error
events are particularly useful for responding back whether the operation was a success or a failure.multer
at all.