I have recently discovered the Vim Tip n° 1531 (Word frequency statistics for a file).
As suggested I put the following code in my .vimrc
function! WordFrequency() range
let all = split(join(getline(a:firstline, a:lastline)), '\A\+')
let frequencies = {}
for word in all
let frequencies[word] = get(frequencies, word, 0) + 1
endfor
new
setlocal buftype=nofile bufhidden=hide noswapfile tabstop=20
for [key,value] in items(frequencies)
call append('$', key."\t".value)
endfor
sort i
endfunction
command! -range=% WordFrequency <line1>,<line2>call WordFrequency()
It works fine except for accents and other french specifics (latin small ligature a or o, etc…).
What am I supposed to add in this function to make it suit my needs ?
Thanks in advance
For 8-bit characters you can try to change the split pattern from \A\+
to
[^[:alpha:]]\+
.
The pattern \A\+
matches any number of consecutive non-alphabetic characters which — unfortunately — includes multibytes characters like our beloved çàéô
and friends.
That means that your text is split at spaces AND at multibyte characters.
With \A\+
, the phrase
Rendez-vous après l'apéritif.
gives:
ap 1
apr 1
l 1
Rendez 1
ritif 1
s 1
vous 1
If you are sure your text doesn't include fancy spaces you could replace this pattern with \s\+
that matches whitespace only but it's probably to liberal.
With this pattern, \s\+
, the same phrase gives:
après 1
l'apéritif. 1
Rendez-vous 1
which, I think, is closer to what you want.
Some customizing may be necessary to exclude punctuations.
function! WordFrequency() range
" Whitespace and all punctuation characters except dash and single quote
let wordSeparators = '[[:blank:],.;:!?%#*+^@&/~_|=<>\[\](){}]\+'
let all = split(join(getline(a:firstline, a:lastline)), wordSeparators)
"...
endfunction
If all punctuation characters should be word separators, the expression shortens to
let wordSeparators = '[[:blank:][:punct:]]\+'