I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.
I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():
def in_stream():
with open("some.csv") as csvfile:
for line in csvfile:
l = re.sub(r'","',r',',line)
yield l
df = pd.read_csv(in_stream())
Sadly, this just throws a
ValueError: Invalid file path or buffer object type: <class 'generator'>
Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.
I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.
So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().
After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).
So, I built a generator the old way
This works in pandas.read_csv():
To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.
Here's a solution that will work for smaller CSV files. All lines are first read into memory, processed, and concatenated. This will probably perform badly for larger files.