I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv.
I've already written a script that input all the data into a .csv in columns based on what kind of data it is (Flight ID, Latitude, Longitude, etc), but it's not in the correct order. All values are meant to be grouped based on the same Flight ID, in order from earliest time stamp to the latest. Fortunately, my .csv has all values in the correct time order, but not grouped together appropriately according to Flight ID's.
To clear my description up, it looks like this right now,
("Time x" is just to illustrate):
20110117559515, , , , , , , , ,2446,6720,370,42 (Time 0)
20110117559572, , , , , , , , ,2390,6274,410,54 (Time 0)
20110117559574, , , , , , , , ,2391,6284,390,54 (Time 0)
20110117559587, , , , , , , , ,2385,6273,390,54 (Time 0)
20110117559588, , , , , , , , ,2816,6847,250,32 (Time 0)
...
and it's supposed to be ordered like this:
20110117559515, , , , , , , , ,2446,6720,370,42 (Time 0)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42 (Time 1)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42 (Time 2)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42 (Time 3)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42 (Time N)
20110117559572, , , , , , , , ,2390,6274,410,54 (Time 0)
20110117559572, , , , , , , , ,23xx,62xx,4xx,54 (Time 1)
... and so on
There are 1.3 million some rows in the .csv I output to make things easier. I'm 99% confident the logic in the next script I wrote to fix the ordering is correct, but my fear is that it's extremely inefficient. I ended up adding a progress bar just to see if it's making any progress, and unfortunately this is what I see:
Here's my code handling the crunching (skip down to problem area if you like):
## a class I wrote to handle the huge .csv's ##
from BIGASSCSVParser import BIGASSCSVParser
import collections
x = open('newtrajectory.csv') #file to be reordered
linetlist = []
tidict = {}
'' To save braincells I stored all the required values
of each line into a dictionary of tuples.
Index: Tuple ''
for line in x:
y = line.replace(',',' ')
y = y.split()
tup = (y[0],y[1],y[2],y[3],y[4])
linetlist.append(tup)
for k,v in enumerate(linetlist):
tidict[k] = v
x.close()
trj = BIGASSCSVParser('newtrajectory.csv')
uniquelFIDs = []
z = trj.column(0) # List of out of order Flight ID's
for i in z: # like in the example above
if i in uniquelFIDs:
continue
else:
uniquelFIDs.append(i) # Create list of unique FID's to refer to later
queue = []
p = collections.OrderedDict()
for k,v in enumerate(trj.column(0)):
p[k] = v
All good so far, but it's in this next segment my computer either chokes, or my code just sucks:
for k in uniquelFIDs:
list = [i for i, x in p.items() if x == k]
queue.extend(list)
The idea was that for every unique value, in order, iterate over the 1.3 million values and return, in order, each occurrence's index, and append those values to a list. After that I was just going to read off that large list of indexes and write the contents of that row's data into another .csv file. Ta da! Probably hugely inefficient.
What's wrong here? Is there a more efficient way to do this problem? Is my code flawed, or am I just being cruel to my laptop?
Update:
I've found that with the amount of data I'm crunching, it'll take 9-10 hours. I had half of it correctly spat out in 4.5. An overnight crunch I can get away with for now, but will probably look to use a database or another language next time. I would have if I knew what I was getting into ahead of time, lol.
After adjusting sleep settings for my SSD, it only took 3 hours to crunch.