I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.
Normally I would use, say:
pv myfile.data | sort | uniq > myfile.data.uniq
and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.
I was thinking I could use split
, perhaps, and do a streaming uniq
on maybe 500K lines at a time into a new file. Is there a way to do something like that?
I thought I might be able to do something like
tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data
but I couldn't figure out a way to truncate the file properly.