I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;

Entry
-----
Yabba
Dabba
Doo

Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.

标签： search parsing scripting

22条回答

手持菜刀，她持情操

2楼-- · 2020-06-01 01:48

A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.

Why the arbitrary limitation that you need a "script" that does it?

What exactly is a script anyway?

Would Python do?

If so, then this is one solution:

import sys;

s = set([]);
while True:
    line = sys.stdin.readline();
    if not line:
        break;
    line = line.rstrip();
    for c in line.lower():
        s.add(c);

print("".join(sorted(s)));

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-06-01 01:48

in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.

0人赞添加讨论(0) 举报

我想做一个坏孩纸

4楼-- · 2020-06-01 01:50

Python w/sets (quick and dirty)

s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))

Python w/sets (with nicer output)

import re

text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric

print "Unique Characters: {%s}" % unique

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2020-06-01 01:50

This answer above mentioned using a dictionary.

If so, the code presented there can be streamlined a bit, since the Python documentation states:

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:

    if character not in letters:

And that should make it a little faster.

0人赞添加讨论(0) 举报

SAY GOODBYE

6楼-- · 2020-06-01 01:51

Print unique characters (ASCII and Unicode UTF-8)

import codecs
file = codecs.open('my_file_name', encoding='utf-8')

# Runtime: O(1)
letters = set()

# Runtime: O(n^2)
for line in file:
  for character in line:
    letters.add(character)

# Runtime: O(n)
letter_str = ''.join(letters)

print(letter_str)

Save as unique.py, and run as python unique.py.

0人赞添加讨论(0) 举报

ら.Afraid

7楼-- · 2020-06-01 01:51

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

0人赞添加讨论(0) 举报

Find Unique Characters in a File

Update

Update 2

Print unique characters (ASCII and Unicode UTF-8)

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间