Python - CSV: Large file with rows of different le

2019-02-15 11:46发布

In short, I have a 20,000,000 line csv file that has different row lengths. This is due to archaic data loggers and proprietary formats. We get the end result as a csv file in the following format. MY goal is to insert this file into a postgres database. How Can I do the following:

  • Keep the first 8 columns and my last 2 columns, to have a consistent CSV file
  • Add a new column to the csv file ether at the first or last position.

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50

4条回答
看我几分像从前
2楼-- · 2019-02-15 12:18

Read a row with csv, then:

newrow = row[:8] + row[-2:]

then add your new field and write it out (also with csv).

查看更多
淡お忘
3楼-- · 2019-02-15 12:31

Sorry, you will need to write some code with this one. When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. If you let the unhappy data into your database, you will never get all of it out.

Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful.

I recommend you:

  • run a 'tail' command to look at the last few lines.
  • if it appears well behaved, run the whole file through csv reader to see it breaks. Make a quick histogram of "fields per line".
  • Think about "valid" ranges and character types and rigorously check them as you read. Especially watch for unusual unicode or characters outside of the printable range.
  • Seriously consider if you want to keep the extra, odd-ball values in a "rest of the line" text field.
  • Toss any unexpected lines into an exception file.
  • Fix up your code to handle the new pattern in exceptions file. Rinse. Repeat.
  • Finally, run the whole thing again, actually dumping data into the database.

Your development time will be faster from not touching a database until you are completely done. Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution.

Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is:

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

Of course, your mileage my vary with this code. It's a first draft of pseduo-code. Expect writing solid code for the input to take most of a day.

查看更多
做自己的国王
4楼-- · 2019-02-15 12:35

You can open the file as a textfile and read the lines one at a time. Are there quoted or escaped commas that don't "split fields"? If not, you can do

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

I suspect that where I have + ['onemore'] you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be.

Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an executemany to add all those entries.

Edit: to clarify, I don't recommend using csv to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line).

查看更多
劫难
5楼-- · 2019-02-15 12:40

I would recommend using the csv module. Here's some code based off CSV processing that I've done elsewhere

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )
查看更多
登录 后发表回答