In short, I have a 20,000,000 line csv file that has different row lengths. This is due to archaic data loggers and proprietary formats. We get the end result as a csv file in the following format. MY goal is to insert this file into a postgres database. How Can I do the following:
- Keep the first 8 columns and my last 2 columns, to have a consistent CSV file
- Add a new column to the csv file ether at the first or last position.
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50
Read a row with
csv
, then:then add your new field and write it out (also with
csv
).Sorry, you will need to write some code with this one. When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. If you let the unhappy data into your database, you will never get all of it out.
Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful.
I recommend you:
Your development time will be faster from not touching a database until you are completely done. Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution.
Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is:
Of course, your mileage my vary with this code. It's a first draft of pseduo-code. Expect writing solid code for the input to take most of a day.
You can open the file as a textfile and read the lines one at a time. Are there quoted or escaped commas that don't "split fields"? If not, you can do
I suspect that where I have
+ ['onemore']
you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be.Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an
executemany
to add all those entries.Edit: to clarify, I don't recommend using
csv
to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line).I would recommend using the
csv
module. Here's some code based off CSV processing that I've done elsewhere