I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry ----- Yabba Dabba Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
Quick and dirty solution using grep (assuming the file name is "file"):
I could have made it a one-liner but just want to make it easier to read.
(EDIT: forgot the -i switch to grep)
Python without using a set.
BASH shell script version (no sed/awk):
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
Where
C:/data.txt
contains 454,863 rows of seven random alphabetic characters, the following codeproduces output
4093767
mytojevqlgbxsnidhzupkfawr
c
889
Press any key to continue . . .
The first line of output tells you the number of bytes in
C:/data.txt
(454,863 * (7 + 2) = 4,093,767 bytes). The next two lines of output are the unique characters inC:/data.txt
(including a newline). The last line of output tells you the number of milliseconds the code took to execute on a 2.80 GHz Pentium 4.While not an script this java program will do the work. It's easy to understand an fast ( to run )
You'll invoke it like this:
or
For instance, the unique characters in the HTML of this question are: