How get unique lines from a very large file in lin

2019-07-09 02:02发布

站内文章 / Linux

26 0

老娘就宠你

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a very large data file (255G; 3,192,563,934 lines). Unfortunately I only have 204G of free space on the device (and no other devices I can use). I did a random sample and found that in a given, say, 100K lines, there are about 10K unique lines... but the file isn't sorted.

Normally I would use, say:

pv myfile.data | sort | uniq > myfile.data.uniq

and just let it run for a day or so. That won't work in this case because I don't have enough space left on the device for the temporary files.

I was thinking I could use split, perhaps, and do a streaming uniq on maybe 500K lines at a time into a new file. Is there a way to do something like that?

I thought I might be able to do something like

tail -100000 myfile.data | sort | uniq >> myfile.uniq && trunc --magicstuff myfile.data

but I couldn't figure out a way to truncate the file properly.

回答1:

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

标签： linux large-files uniq

老娘就宠你

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

How get unique lines from a very large file in lin

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮