Find Unique Characters in a File

2020-06-01 01:33发布

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;

Entry
-----
Yabba
Dabba
Doo

Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.

22条回答
手持菜刀,她持情操
2楼-- · 2020-06-01 01:48

A very fast solution would be to make a small C program that reads its standard input, does the aggregation and spits out the result.

Why the arbitrary limitation that you need a "script" that does it?

What exactly is a script anyway?

Would Python do?

If so, then this is one solution:

import sys;

s = set([]);
while True:
    line = sys.stdin.readline();
    if not line:
        break;
    line = line.rstrip();
    for c in line.lower():
        s.add(c);

print("".join(sorted(s)));
查看更多
▲ chillily
3楼-- · 2020-06-01 01:48

in c++ i would first loop through the letters in the alphabet then run a strchr() on each with the file as a string. this will tell you if that letter exists, then just add it to the list.

查看更多
我想做一个坏孩纸
4楼-- · 2020-06-01 01:50

Python w/sets (quick and dirty)

s = open("data.txt", "r").read()
print "Unique Characters: {%s}" % ''.join(set(s))

Python w/sets (with nicer output)

import re

text = open("data.txt", "r").read().lower()
unique = re.sub('\W, '', ''.join(set(text))) # Ignore non-alphanumeric

print "Unique Characters: {%s}" % unique
查看更多
Bombasti
5楼-- · 2020-06-01 01:50

This answer above mentioned using a dictionary.

If so, the code presented there can be streamlined a bit, since the Python documentation states:

It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary).... If you store using a key that is already in use, the old value associated with that key is forgotten.

Therefore, this line of the code can be removed, since the dictionary keys will always be unique anyway:

    if character not in letters:

And that should make it a little faster.

查看更多
SAY GOODBYE
6楼-- · 2020-06-01 01:51

Print unique characters (ASCII and Unicode UTF-8)

import codecs
file = codecs.open('my_file_name', encoding='utf-8')

# Runtime: O(1)
letters = set()

# Runtime: O(n^2)
for line in file:
  for character in line:
    letters.add(character)

# Runtime: O(n)
letter_str = ''.join(letters)

print(letter_str)

Save as unique.py, and run as python unique.py.

查看更多
ら.Afraid
7楼-- · 2020-06-01 01:51

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

查看更多
登录 后发表回答