I would like your help on trimming a file by removing the columns with the same value.
# the file I have (tab-delimited, millions of columns)
jack 1 5 9
john 3 5 0
lisa 4 5 7
# the file I want (remove the columns with the same value in all lines)
jack 1 9
john 3 0
lisa 4 7
Could you please give me any directions on this problem? I prefer a sed or awk solution, or maybe a perl solution.
Thanks in advance. Best,
The main problem here is that you said "millions of columns", and did not specify how many rows. In order to check each value in each row against its counterpart in every other column.. you are looking at a great many checks.
Granted, you would be able to reduce the number of columns as you go, but you would still need to check each one down to the last row. So... much processing.
We can make a "seed" hash to start off with from the two first lines:
Then with this "seed" hash, you could read the rest of the lines, and remove non-matching values from the hash, such as:
One could imagine a solution where one stripped away all the rows which were already proven to be unique, but in order to do that, you need to make an array of the row, or make a regex, and I am not sure that would not take equally long as simply passing through the string.
Then, after processing all the rows, you would have a hash with the values of duplicated numbers, so you could re-open the file, and do your print:
This is a rather heavy operation, and this code is untested. This will give you a hint to a solution, it will probably take a while to process the whole file. I suggest running some tests to see if it works with your data, and tweak it.
If you only have a few matching columns, it is much easier to simply extract them from the line, but I hesitate to use
split
on such long lines. Something like:Note that we would have to sort the keys in descending numerical order, so that we trim values from the end. Otherwise we screw up the uniqueness of the subsequent array numbers.
Anyway, it might be one way to go. It's a rather large operation, though. I'd keep backups. ;)
Well, this was assuming it was the third column. If it is by value:
With the question edit, OP's desires become clear. How about: