I have a tab-delimited input.txt file like this
A B C
A B D
E F G
E F T
E F K
These are tab-delimited.
I want to remove duplicates only when multiple rows have the same 1st and 2nd columns.
So, even though 1st and 2nd rows are different in 3rd column, they have the same 1st and 2nd columns, so I want to remove "A B D" that appears later.
So output.txt will be like this.
A B C
E F G
If I was to remove duplicates in usual way, I just make the lists into "set" function, and I am all set.
But now I am trying to remove duplicates using only "some" columns.
Using excel, it's just so easy.
Data -> Remove Duplicates -> Select columns
Using MatLab, it's easy, too.
import input.txt -> Use "unique" function with respect to 1st and 2nd columns -> Remove the rows numbered "1"
But using python, I couldn't find how to do this because all I knew about removing duplicate was using "set" in python.
===========================
This is what I experimented following undefined_is_not_a_function's answer.
I am not sure how to overwrite the result to output.txt, and how to alter the code to let me specify the columns to use for duplicate-removing (like 3 and 5).
import sys
input = sys.argv[1]
seen = set()
data = []
for line in input.splitlines():
key = tuple(line.split(None, 2)[0])
if key not in seen:
data.append(line)
seen.add(key)
Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:
if you have access to a Unix system, sort is a nice utility that is made for your problem.
I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.
from the below code, you can do it.
sorry for variable names.
please notice that I am not an expert but I still have ideas that may help you.
There is a csv module useful for csv files, you might go see there if you find something interesting.
First I would ask how are you storing those datas ? In a list ?
something like
Could be suitable. (maybe not the best choice)
Second, is it possible to go through the whole list ?
You can simply store a line, compare it to all lines.
I would do this : suposing list contains the letters.
this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove
You should use
itertools.groupby
for this. Here I am grouping the data based on first first two columns and then usingnext()
to get the first item from each group.Simply replace
s.splitlines()
with file object if input is coming from a file.Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a
set
here.