sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.
I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.
I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.
Thanks in advance.
Sample file:
record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF
Sample output file should be:
record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF
CR = carriage Return = x0d
LF = Line Feed = x0a
EDIT -- Wed Sep 18, 23.46 UTC+2
NOTE:
record1|111. texta|textb|111CR|LF
record2|111. teCR|LF
xta|text|111CR|LF
record3|111. texta|textb|111CR|LF
This is the file that we are going to analyze
Due to the fact that we have a csv file, we can be quite sure that the datatype will be consistent among the rows for a given columns.
Due to this assumption, we can regexp the delimiter in scope (CL|RF
) with a regexp (\|\d+CR\|LF
).
If the regexp is not matched, we can remove the carriage return, cause is not the end of the line.
import pandas as pd
from io import StringIO
import re
# Verify that the pattern `|ARBITRARY NUMBER + CR|LF`
pattern = re.compile("\|\d+CR\|LF")
# Open the file and read the content
with open("a.txt") as f:
data = f.readlines()
not_parsed = data.copy()
_max = len(data)
i = 0
parsed_data = []
# Iterate the data
while i < _max:
# Remove unnecessary new line
line = data[i].strip()
# If the pattern does not match, we need to strip the carriage return
if not pattern.search(line) and i + 1 < _max:
line = line.replace("CR|LF", "").strip()
line = line + data[i + 1].strip()
i += 1
line = line
if line != "":
parsed_data.append(line)
i += 1
# Comment
data = [line.replace("CR|LF", "") for line in parsed_data]
# Load the csv using pandas
print("DATA BEFORE -> {}".format("".join(not_parsed)))
print("DATA NOW -> {}".format("\n".join(data)))
DATA = StringIO("\n".join(data))
df = pd.read_csv(DATA, delimiter="|")
How does this remove the unwanted CL|RF, but leave the wanted ones?
The file will not be modified, instead it will be saved as a list of single ('line by line').
Then we are going to replace the carriage return only when the regexp not match and load as a dataframe
NOTE:
Tested on Linux
that use \n
as new line
Run this script (e.g. name it fix_csv.py
) on your file to sanitize it:
#!/usr/bin/env python3
import sys
import os
if len(sys.argv) < 3:
sys.stderr.write('Please give the input filename and an output filename.\n')
sys.exit(1)
# set the correct number of fields
nf = 3
# set the delimiter
delim = ','
inpf = sys.argv[1]
outf = sys.argv[2]
newline = os.linesep
cache = []
with open(inpf, 'r') as inf, open(outf, 'w') as of:
for line in inf:
line = line.strip()
ls = line.split(delim)
if len(ls) < nf or cache:
if not cache:
cache = cache + ls
elif cache:
cache[-1] += ls[0]
cache = cache + ls[1:]
if len(cache) == nf:
of.write(f'{delim}'.join(cache) + newline)
cache = []
else:
of.write(line + newline)
Call it like
./fix_csv input.dat output.dat
Output:
record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb