My source data is in a TSV file, 6 columns and greater than 2 million rows.
Here's what I'm trying to accomplish:
- I need to read the data in 3 of the columns (3, 4, 5) in this source file
- The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
- I want to write the output of #2 to an output file in CSV format.
Below is what I came up with.
My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.
First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6
Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6
Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6
Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
then I have this code:
import csv
with open('sample.txt','r') as tsv:
AoA = [line.strip().split('\t') for line in tsv]
for a in AoA:
count = int(a[4])
while count > 0:
with open('sample_new.csv','ab') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerow([a[2], a[3]])
count = count - 1