Handling extra newlines (carriage returns) in csv

2019-01-22 21:05发布

问题:

I have a CSV file that has fields that contain newlines e.g.:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

(In this case third field in the second row is "tree\n"

I tried the following:

import csv
catalog = csv.reader(open('test.csv', 'rU'), delimiter=",", dialect=csv.excel_tab)
for row in catalog:
    print "Length: ", len(row), row

and the result I got was this:

Length:  6 ['A', ' B', ' C', ' D', ' E', ' F']
Length:  3 ['123', ' 456', ' tree']
Length:  4 ['   ', ' very', ' bla', ' indigo']

Does anyone have any idea how I can quickly remove extraneous newlines?

Thanks!

回答1:

Suppose you have this Excel spreadsheet:

Note:

  1. the multi-line cell in C2;
  2. embedded comma in C1 and D3;
  3. blank cells, and cell with a space in D4.

Saving that as CSV in Excel, you will get this csv file:

A1,B1,"C1,+comma",D1
,B2,"line 1
line 2",D2
,,C3,"D3,+comma"
,,,D4 space

Assumably, you will want to read that into Python with the blank cells still having meaning and the embedded comma treated correctly.

So, this:

with open("test.csv", 'rU') as csvIN:
    outCSV=(line for line in csv.reader(csvIN, dialect='excel'))

    for row in outCSV:
        print("Length: ", len(row), row) 

correctly produces the 4x4 List of List matrix represented in Excel:

Length:  4 ['A1', 'B1', 'C1,+comma', 'D1']
Length:  4 ['', 'B2', 'line 1\nline 2', 'D2']
Length:  4 ['', '', 'C3', 'D3,+comma']
Length:  4 ['', '', '', 'D4 space']

The example CSV file you posted lacks quotes around the field with an 'extra newline' rendering the meaning of that newline ambiguous. Is it a new row or a multi-line field?

Therefor, you can only interpret this csv file:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

as a one dimension list like so:

with open("test.csv", 'rU') as csvIN:
   outCSV=[field.strip() for row in csv.reader(csvIN, delimiter=',') 
              for field in row if field]

Which produces this one dimensional list:

['A', 'B', 'C', 'D', 'E', 'F', '123', '456', 'tree', 'very', 'bla', 'indigo']

This can then be interpreted and regrouped into any sub grouping as you wish.

The idiomatic regrouping method in python uses zip like so:

>>> zip(*[iter(outCSV)]*6)
[('A', 'B', 'C', 'D', 'E', 'F'), ('123', '456', 'tree', 'very', 'bla', 'indigo')]

Or, if you want a list of lists, this is also idiomatic:

>>> [outCSV[i:i+6] for i in range(0, len(outCSV),6)]
[['A', 'B', 'C', 'D', 'E', 'F'], ['123', '456', 'tree', 'very', 'bla', 'indigo']]

If you can change how your CSV file is created, it will be less ambiguous to interpret.



回答2:

This will work if you have non blanks cells

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

flat_list = chain.from_iterable(data)
flat_list = [cell for cell in flat_list if cell.strip() != ''] # remove blank cells

rows = [flat_list[i:i+6] for i in range(0, len(flat_list), 6)] # chunk into groups of 6 
print rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree', ' very', ' bla', ' indigo']]

If you have blank cells in the input, this will work most of the time:

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

clean_rows = []
saved_row = []

for row in data:
    if len(saved_row):
        row_tail = saved_row.pop()
        row[0] = row_tail + row[0]  # reconstitute field broken by newline
        row = saved_row + row       # and reassemble the row (possibly only partially)
    if len(row) >= 6:
        clean_rows.append(row)
        saved_row = []
    else:
        saved_row = row


print clean_rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree   ', ' very', ' bla', ' indigo']]

However even the second solution will fail with input such

A,B,C,D,E,F\nG
1,2,3,4,5,6

In this case the input is ambiguous and no algorithm will be able to guess if you meant:

A,B,C,D,E,F
G\n1,2,3,4,5,6 

(or the input give above)

If this could be the case for you, you'll have to go back to the person saving the data and make them save it in cleaner format (btw open office quotes newlines in CSV files far better then Excel).



回答3:

This should work. (Warning: Brain compiled code)

with open('test.csv', 'rU') as infile:
   data = []
   for line in infile:
       temp_data = line.split(',')
       try:
           while len(temp_data) < 6: #column length
               temp_data.extend(infile.next())
       except StopIteration: pass
       data.append(temp_data)


回答4:

If number of fields in each row is the same and fields can't be empty:

from itertools import izip_longest

nfields = 6
with open(filename) as f:
     fields = (field.strip() for line in f for field in line.split(',') if field)
     for row in izip_longest(*[iter(fields)]*nfields): # grouper recipe*
         print(row)

* grouper recipe

Output

('A', 'B', 'C', 'D', 'E', 'F')
('123', '456', 'tree', 'very', 'bla', 'indigo')


回答5:

This works with the CSV module and cleans blank fields and lines:

import csv
import StringIO

data="""A, B, C, D, E, F
123, 456, tree

,,
, very, bla, indigo"""

f=StringIO.StringIO(data)   #used just to simulate a file. Use your file here...
reader = csv.reader(f)
out=[]
for line in reader:
    line=[x.strip() for x in line if x]   # remove 'if x' if you want blank fields
    if len(line):
        out.append(line)

print out        

Prints:

[['A', ' B', ' C', ' D', ' E', ' F'], 
 ['123', '456', 'tree'], 
 ['very', 'bla', 'indigo']]

If you want that in 6 col chunks:

cols=6        
out=[i for sl in out for i in sl]                      # flatten out
out=[out[i:i+cols] for i in range(0, len(out), cols)]  # rechunk into 'cols' 

Prints:

[['A', 'B', 'C', 'D', 'E', 'F'],
 ['123', '456', 'tree', 'very', 'bla', 'indigo']]


回答6:

If you know the number of columns, the best way is to ignore end of lines and then split.

Something like this

with open(filename, 'rU') as fp:
    data = ''.join(fp.readlines())

data = data.split(',')
for n in range(0, len(data), 6)
    print(data[n:n+6])

You can convert it easily into a generator if you prefer:

def read_ugly_file(filename, delimiter=',', columns=6):
    with open(filename, 'rU') as fp:
        data = ''.join(fp.readlines())

    data = data.split(delimiter)
    for n in range(0, len(data), columns)
        yield data[n:n+columns]

for row in read_ugly_file('myfile.csv'):
    print(row)