sort and get uniq lines of file in python

i always use this commmand line to sort and get uniq lines only and it works as a charm even with large files (over 500,000 lines)

sort filename.txt | uniq | sponge filename.txt

shortest equivalent python code would be

f = open("filename.txt", "r")
lines = [line for line in f]
lines = lines.sort()
lines = set(lines)

but of course this is not scalable because of memory constrains and writing scalable code in python would take time , so i wonder what is the shortest equivalent code (package) in python

标签： python command-line unique

4条回答

该账号已被封号

2楼-- · 2019-02-13 11:26

There is an iterator that does what sort does, sorted. Let's make one that mimics uniq, by only yielding lines that aren't equal to the previous line:

def uniq(iterator):
    previous = float("NaN")  # Not equal to anything
    for value in iterator:
        if previous != value:
            yield value
            previous = value

Now you can do the same thing, with:

with open('/path/to/filename') as f:
    for line in uniq(sorted(f)):
        print(line)

BUt sorted (and shell's sort) has to store everything anyway (what if the last line in the file should be output first), so it's worse than just using set(f) instead of uniq(sorted(f)).

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-02-13 11:29

You don't need to do a sort in python since set would take care of uniqueness even without sorting.

f = open("filename.txt", "r")
lines = set(f.readlines())

The shell sort command would also load the lines into memory, so using that would not get you any memory savings. If you have really large files or you are adamant on not using additional memory, you can try some crazy tricks like the one shown here: http://neopythonic.blogspot.in/2008/10/sorting-million-32-bit-integers-in-2mb.html

0人赞添加讨论(0) 举报

倾城　Initia

4楼-- · 2019-02-13 11:32

use shell commands from python:

import os
os.system("sort filename.txt | uniq | sponge filename.txt")

0人赞添加讨论(0) 举报

SAY GOODBYE

5楼-- · 2019-02-13 11:40

Here is a shorter example:

with open("filename.txt", 'r') as f:
    lines = set(f)

Also, one thing, that should be noticed, that in this case, only one line at a time will be loaded into memory. The reason for this is that the above code is equivalent to:

lines = set()
f = open("filename.txt", 'r')
for line in f: # now f works as a generator of lines, reading only one line at a time
     lines.add(line)

0人赞添加讨论(0) 举报

sort and get uniq lines of file in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间