What I am essentially looking for is the `paste' command in bash, but in Python2. Suppose I have a csv file:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
And another such:
e1,f1
e2,f2
e3,f3
I want to pull them together into this:
a1,b1,c1,d1,e1,f1
a2,b2,c2,d2,e2,f2
a3,b3,c3,d3,e3,f3
This is the simplest case where I have a known number and only two. What if I wanted to do this with an arbitrary number of files without knowing how many I have.
I am thinking along the lines of using zip with a list of csv.reader iterables. There will be some unpacking involved but seems like this much python-foo is above my IQ level ATM. Can someone suggest how to implement this idea or something completely different?
I suspect this should be doable with a short snippet. Thanks.
Assuming the number of files is unknown, and that all the files are properly formatted as csv have the same number of lines:
files = ['csv1', 'csv2', 'csv3']
fs = map(open, files)
done = False
while not done:
chunks = []
for f in fs:
try:
l = next(f).strip()
chunks.append(l)
except StopIteration:
done = True
break
if not done:
print ','.join(chunks)
for f in fs:
f.close()
There seems to be no easy way of using context managers with a variable list of files easily, at least in Python 2 (see a comment in the accepted answer here), so manual closing of files will be required as above.
file1 = open("file1.csv", "r")
file2 = open("file2.csv", "r")
for line in file1:
print(line.strip().strip(",") +","+ file2.readline().strip()+"\n")
Extendable for as many files as you wish. Just keep adding to the print statement. Instead of print you can also have a append to a list or whatever you wish. You may have to worry about length of files, I did not as you did not specify.
You could try pandas
In your case, group of [a,b,c,d] and [e,f] could be treated as DataFrame in Pandas, and it's easy to do join because Pandas has function called concat.
import pandas as pd
# define group [a-d] as df1
df1 = pd.read_csv('1.csv')
# define group [e-f] as df2
df2 = pd.read_csv('2.csv')
pd.concat(df1,df2,axis=1)