Find Unique Characters in a File

2020-06-01 01:33发布

I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;

Entry
-----
Yabba
Dabba
Doo

Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.

22条回答
Root(大扎)
2楼-- · 2020-06-01 01:52

Quick and dirty solution using grep (assuming the file name is "file"):

for char in a b c d e f g h i j k l m n o p q r s t u v w x y z; do 
    if [ ! -z "`grep -li $char file`" ]; then 
        echo -n $char; 
    fi; 
done; 
echo

I could have made it a one-liner but just want to make it easier to read.

(EDIT: forgot the -i switch to grep)

查看更多
兄弟一词,经得起流年.
3楼-- · 2020-06-01 01:55

Python without using a set.

file = open('location', 'r')

letters = []
for line in file:
    for character in line:
        if character not in letters:
            letters.append(character)

print(letters)
查看更多
我欲成王,谁敢阻挡
4楼-- · 2020-06-01 02:00
cat yourfile | 
 perl -e 'while(<>){chomp;$k{$_}++ for split(//, lc $_)}print keys %k,"\n";'
查看更多
Rolldiameter
5楼-- · 2020-06-01 02:00

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s
查看更多
迷人小祖宗
6楼-- · 2020-06-01 02:02

Where C:/data.txt contains 454,863 rows of seven random alphabetic characters, the following code

using System;
using System.IO;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication {
    class Program {
        static void Main(string[] args) {
            FileInfo fileInfo = new FileInfo(@"C:/data.txt");
            Console.WriteLine(fileInfo.Length);

            Stopwatch sw = new Stopwatch();
            sw.Start();

            Hashtable table = new Hashtable();

            StreamReader sr = new StreamReader(@"C:/data.txt");
            while (!sr.EndOfStream) {
                char c = Char.ToLower((char)sr.Read());
                if (!table.Contains(c)) {
                    table.Add(c, null);
                }
            }
            sr.Close();

            foreach (char c in table.Keys) {
                Console.Write(c);
            }
            Console.WriteLine();

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

produces output

4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .

The first line of output tells you the number of bytes in C:/data.txt (454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters in C:/data.txt (including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.

查看更多
啃猪蹄的小仙女
7楼-- · 2020-06-01 02:04

While not an script this java program will do the work. It's easy to understand an fast ( to run )

import java.util.*;
import java.io.*;
public class  Unique {
    public static void main( String [] args ) throws IOException { 
        int c = 0;
        Set s = new TreeSet();
        while( ( c = System.in.read() ) > 0 ) {
            s.add( Character.toLowerCase((char)c));
        }
        System.out.println( "Unique characters:" + s );
    }
}

You'll invoke it like this:

type yourFile | java Unique

or

cat yourFile | java Unique

For instance, the unique characters in the HTML of this question are:

Unique characters:[ , , ,  , !, ", #, $, %, &, ', (, ), +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, \, ], ^, _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, {, |, }]
查看更多
登录 后发表回答