I have an ASCII text file. I want to generate a list of all "words" from that file using one or more Ubuntu commands. A word is defined as an alpha-num sequence between delimiters. Delimiters are by default whitespaces but I also want to experiment with other characters like punctuation etc. IN other words, i want to be able to specify a delimiter char set. How do I produce only a unique set of words? What if I also want to list only those words that are at least N characters long?
相关问题
- Why doesn't php sleep work in the windows-subs
- Installing Pydev for Eclipse throws error
- Error building gcc 4.8.3 from source: libstdc++.so
- what's the role of libopenssl-ruby?
- Docker why isn't $USER environment variable se
相关文章
- 为什么nfs在不同版本的Linux下安装的文件都不一样
- How to use ALTER TABLE to add a new column and mak
- Incompatible JavaHl library loaded
- Python - Node.js (V8) runtime is not available on
- Ubuntu graphviz 'sfdp' not working
- Gearman , php extension problem : Class 'Gearm
- Decrease the tabs bar height in gnome terminal
- “/usr/sbin/sendmail/” Not found
Here's my word-cloud like chain
cat myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr
if you have a tex file, replace
cat
withdetex
:detex myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr
This ought to work for you:
If you want the characters that are at least five characters long, pipe the output of
tr
throughgrep .....
. If you want case-insensitivity, sticktr A-Z a-z
someplace in the pipeline beforesort
.Note that
LC_ALL=C
is necessary forsort
to work correctly.I'd recommend reading the
man
pages for ant commands you don't understand here.You could use grep:
-E '\w+' searches for words -o only prints the portion of the line that matches % cat temp Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text.
if you don't care whether words repeat
If you want to only print each word once, disregarding case, you can use sort
-u only prints each word once -f tells sort to ignore case when comparing words
if you only want each word once
you can also use the
tr
commandThe
-c
is for the complement of the specified characters; the-s
squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics, if you add a character here, the input won't get delimited on that character (see another example below); the '\n' is the replacement character (newline).As we added '-' in the list of non-delimiters list, lazy-dog was printed. Other the output is
Summary for tr: any character not in argument of
-c
, will act as a delimiter. I hope this solves your delimiter problem too.