I have recently discovered the Vim Tip n° 1531 (Word frequency statistics for a file).
As suggested I put the following code in my .vimrc
function! WordFrequency() range
let all = split(join(getline(a:firstline, a:lastline)), '\A\+')
let frequencies = {}
for word in all
let frequencies[word] = get(frequencies, word, 0) + 1
endfor
new
setlocal buftype=nofile bufhidden=hide noswapfile tabstop=20
for [key,value] in items(frequencies)
call append('$', key."\t".value)
endfor
sort i
endfunction
command! -range=% WordFrequency <line1>,<line2>call WordFrequency()
It works fine except for accents and other french specifics (latin small ligature a or o, etc…).
What am I supposed to add in this function to make it suit my needs ?
Thanks in advance
For 8-bit characters you can try to change the split pattern from
\A\+
to[^[:alpha:]]\+
.The pattern
\A\+
matches any number of consecutive non-alphabetic characters which — unfortunately — includes multibytes characters like our belovedçàéô
and friends.That means that your text is split at spaces AND at multibyte characters.
With
\A\+
, the phrasegives:
If you are sure your text doesn't include fancy spaces you could replace this pattern with
\s\+
that matches whitespace only but it's probably to liberal.With this pattern,
\s\+
, the same phrase gives:which, I think, is closer to what you want.
Some customizing may be necessary to exclude punctuations.
If all punctuation characters should be word separators, the expression shortens to