Python to remove duplicates using only some, not a

I have a tab-delimited input.txt file like this

A    B    C
A    B    D
E    F    G
E    F    T
E    F    K

These are tab-delimited.

I want to remove duplicates only when multiple rows have the same 1st and 2nd columns.

So, even though 1st and 2nd rows are different in 3rd column, they have the same 1st and 2nd columns, so I want to remove "A B D" that appears later.

So output.txt will be like this.

A    B    C
E    F    G

If I was to remove duplicates in usual way, I just make the lists into "set" function, and I am all set.

But now I am trying to remove duplicates using only "some" columns.

Using excel, it's just so easy.

Data -> Remove Duplicates -> Select columns

Using MatLab, it's easy, too.

import input.txt -> Use "unique" function with respect to 1st and 2nd columns -> Remove the rows numbered "1"

But using python, I couldn't find how to do this because all I knew about removing duplicate was using "set" in python.

===========================

This is what I experimented following undefined_is_not_a_function's answer.

I am not sure how to overwrite the result to output.txt, and how to alter the code to let me specify the columns to use for duplicate-removing (like 3 and 5).

import sys
input = sys.argv[1]

seen = set()
data = []
for line in input.splitlines():
    key = tuple(line.split(None, 2)[0])
    if key not in seen:
        data.append(line)
        seen.add(key)

标签： python file duplicate-removal tab-delimited-text

5条回答

迷人小祖宗

2楼-- · 2019-09-02 14:39

Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:

entries = set()
keys = set()
for row in rows:
   key = (row[0], row[1]) # Only the first two columns

   if key not in keys:
      keys.add(key)
      entries.add((row[0], row[1], row[2]))

0人赞添加讨论(0) 举报

仙女界的扛把子

3楼-- · 2019-09-02 14:41

if you have access to a Unix system, sort is a nice utility that is made for your problem.

sort -u -t$'\t' --key=1,2 filein.txt

I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-09-02 14:41

from the below code, you can do it.

file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
    li = each_line .split()
    lst.append(li)
dic = {}
for l in lst:
    if (l[0], l[1]) not in dic:
        dic[(l[0], l[1])] = l[2]

print dic

sorry for variable names.

0人赞添加讨论(0) 举报

霸刀☆藐视天下

5楼-- · 2019-09-02 14:50

please notice that I am not an expert but I still have ideas that may help you.

There is a csv module useful for csv files, you might go see there if you find something interesting.

First I would ask how are you storing those datas ? In a list ?

something like

[[A,B,C],
[A,B,D],
[E,F,G],...]

Could be suitable. (maybe not the best choice)

Second, is it possible to go through the whole list ?

You can simply store a line, compare it to all lines.

I would do this : suposing list contains the letters.

copy = list
index_list = []
for i in range(0, len(list)-1):
    for j in range(0, len(list)-1): #and exclude i of course
     if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
          index_list.append(j)
for i in index_list: #just loop over the index list and remove
list.pop(index_list[i])

this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove

0人赞添加讨论(0) 举报

劳资没心，怎么记你

6楼-- · 2019-09-02 14:56

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.

>>> from itertools import groupby                                   
>>> s = '''A    B    C                                              
A    B    D
E    F    G
E    F    T
E    F    K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
    print next(g)
...     
A    B    C
E    F    G

Simply replace s.splitlines() with file object if input is coming from a file.

Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.

>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A    B    C
A    B    D
E    F    G
E    F    T
E    F    K
A    B    F'''     
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
...     key = ig(line.split())
...     if key not in seen:
...         data.append(line)
...         seen.add(key)
...         
>>> data
['A    B    C', 'E    F    G']

0人赞添加讨论(0) 举报

Python to remove duplicates using only some, not a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间