Find Unique Characters in a File

2020-06-01 01:33发布

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;

Entry
-----
Yabba
Dabba
Doo

Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.

22条回答
Juvenile、少年°
2楼-- · 2020-06-01 01:42

Algorithm: Slurp the file into memory.

Create an array of unsigned ints, initialized to zero.

Iterate though the in memory file, using each byte as a subscript into the array.
    increment that array element.

Discard the in memory file

Iterate the array of unsigned int
       if the count is not zero,
           display the character, and its corresponding count.
查看更多
Explosion°爆炸
3楼-- · 2020-06-01 01:44

Alternative solution using bash:

sed "s/./\l\0\n/g" inputfile | sort -u | grep -vc ^$

EDIT Sorry, I actually misread the question. The above code counts the unique characters. Just omitting the c switch at the end obviously does the trick but then, this solution has no real advantage to saua's (especially since he now uses the same sed pattern instead of explicit captures).

查看更多
该账号已被封号
4楼-- · 2020-06-01 01:44

A C solution. Admittedly it is not the fastest to code solution in the world. But since it is already coded and can be cut and pasted, I think it counts as "fast to implement" for the poster :) I didn't actually see any C solutions so I wanted to post one for the pure sadistic pleasure :)

#include<stdio.h>

#define CHARSINSET 256
#define FILENAME "location.txt"

char buf[CHARSINSET + 1];

char *getUniqueCharacters(int *charactersInFile) {
    int x;
    char *bufptr = buf;
    for (x = 0; x< CHARSINSET;x++) {
        if (charactersInFile[x] > 0)
            *bufptr++ = (char)x;
    }
    bufptr = '\0';
    return buf;
}

int main() {
    FILE *fp;
    char c;
    int *charactersInFile = calloc(sizeof(int), CHARSINSET);
    if (NULL == (fp = fopen(FILENAME, "rt"))) {
        printf ("File not found.\n");
        return 1;
    }
    while(1) {
        c = getc(fp);
        if (c == EOF) {
            break;
        }
        if (c != '\n' && c != '\r')
            charactersInFile[c]++;
    }

    fclose(fp);
    printf("Unique characters: {%s}\n", getUniqueCharacters(charactersInFile));
    return 0;
}
查看更多
做自己的国王
5楼-- · 2020-06-01 01:45

Python using a dictionary. I don't know why people are so tied to sets or lists to hold stuff. Granted a set is probably more efficient than a dictionary. However both are supposed to take constant time to access items. And both run circles around a list where for each character you search the list to see if the character is already in the list or not. Also Lists and Dictionaries are built in Python datatatypes that everyone should be using all the time. So even if set doesn't come to mind, dictionary should.

file = open('location.txt', 'r')
letters = {}
for line in file:
  if line == "":
    break
  for character in line.strip():
    if character not in letters:
      letters[character] = True
file.close()
print "Unique Characters: {" + "".join(letters.keys()) + "}"
查看更多
Melony?
6楼-- · 2020-06-01 01:45

Well my friend, I think this is what you had in mind....At least this is the python version!!!

f = open("location.txt", "r") # open file

ll = sorted(list(f.read().lower())) #Read file into memory, split into individual characters, sort list
ll = [val for idx, val in enumerate(ll) if (idx == 0 or val != ll[idx-1])] # eliminate duplicates
f.close()
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return

It does not iterate through each character, it is relatively short as well. You wouldn't want to open a 500 MB file with it (depending upon your RAM) but for shorter files it is fun :)

I also have to add my final attack!!!! Admittedly I eliminated two lines by using standard input instead of a file, I also reduced the active code from 3 lines to 2. Basically if I replaced ll in the print line with the expression from the line above it, I could have had 1 line of active code and one line of imports.....Anyway now we are having fun :)

import itertools, sys

# read standard input into memory, split into characters, eliminate duplicates
ll = map(lambda x:x[0], itertools.groupby(sorted(list(sys.stdin.read().lower()))))
print "Unique Characters: {%s}" % "".join(ll) #print list of characters, carriage return will throw in a return
查看更多
Summer. ? 凉城
7楼-- · 2020-06-01 01:45

As requested, a pure shell-script "solution":

sed -e "s/./\0\n/g" inputfile | sort -u

It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.

For even more ridiculousness, I present the version that dumps the output on one line:

sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
查看更多
登录 后发表回答