Vim, word frequency function and French accents

2019-07-02 14:51发布

问题:

I have recently discovered the Vim Tip n° 1531 (Word frequency statistics for a file).

As suggested I put the following code in my .vimrc

function! WordFrequency() range
  let all = split(join(getline(a:firstline, a:lastline)), '\A\+')
  let frequencies = {}
  for word in all
    let frequencies[word] = get(frequencies, word, 0) + 1
  endfor
  new
  setlocal buftype=nofile bufhidden=hide noswapfile tabstop=20
  for [key,value] in items(frequencies)
    call append('$', key."\t".value)
  endfor
  sort i
endfunction
command! -range=% WordFrequency <line1>,<line2>call WordFrequency()

It works fine except for accents and other french specifics (latin small ligature a or o, etc…).

What am I supposed to add in this function to make it suit my needs ?

Thanks in advance

回答1:

For 8-bit characters you can try to change the split pattern from \A\+ to [^[:alpha:]]\+.



回答2:

The pattern \A\+ matches any number of consecutive non-alphabetic characters which — unfortunately — includes multibytes characters like our beloved çàéô and friends.

That means that your text is split at spaces AND at multibyte characters.

With \A\+, the phrase

Rendez-vous après l'apéritif.

gives:

ap      1
apr     1
l       1
Rendez  1
ritif   1
s       1
vous    1

If you are sure your text doesn't include fancy spaces you could replace this pattern with \s\+ that matches whitespace only but it's probably to liberal.

With this pattern, \s\+, the same phrase gives:

après       1
l'apéritif. 1
Rendez-vous 1

which, I think, is closer to what you want.

Some customizing may be necessary to exclude punctuations.



回答3:

function! WordFrequency() range
  " Whitespace and all punctuation characters except dash and single quote
  let wordSeparators = '[[:blank:],.;:!?%#*+^@&/~_|=<>\[\](){}]\+'
  let all = split(join(getline(a:firstline, a:lastline)), wordSeparators)
  "...
endfunction

If all punctuation characters should be word separators, the expression shortens to

let wordSeparators = '[[:blank:][:punct:]]\+'