Russian input for word count

2019-08-29 02:35发布

问题:

Ok, so this is what I have (special thx to Tushar Gupta, for fixing the code)

HTML

<input type='checkbox' value='2' name='v'>STS
<input type='checkbox' value='4' name='v'>NTV

js

$(function () {
var wordCounts = {};
$("input[type='text']:not(:disabled)").keyup(function () {
    var matches = this.value.match(/\b/g);
    wordCounts[this.id] = matches ? matches.length / 2 : 0;
    var finalCount = 0;
    var x = 0;
    $('input:checkbox:checked').each(function () {
        x += parseInt(this.value);
    });
    x = (x == 0) ? 1 : x;
    $.each(wordCounts, function (k, v) {
        finalCount += v * x;
    });
    $('#finalcount').val(finalCount)
}).keyup();
$('input:checkbox').change(function () {
    $('input[type="text"]:not(:disabled)').trigger('keyup');
});
});

I want it to be able to count up Russian words e.g "Привет как дела", so far it only works with English input

回答1:

The problem is in your regex - \b doesn't match UTF-8 word boundaries.

Try changing this:

    var matches = this.value.match(/\b/g);

To this:

    var matches = this.value.match(/[^\s\.\!\?]+/g);

and see if that gives a result for Cyrillic input. If it works then you no longer need to divide by 2 to get the word count.



回答2:

The \b notation is defined in terms of “word boundaries”, but with “word” meaning a sequence of ASCII letters, so it cannot be used for Russian texts. A simple approach is to count sequences of Cyrillic letters, and the range from U+0400 to U+0481 covers the Cyrillic letters used in Russian.

var matches = this.value.match(/\b/g);
wordCounts[this.id] = matches ? matches.length / 2 : 0;

by the lines

var matches = this.value.match(/[\u0400-\u0481]+/g);
wordCounts[this.id] = matches ? matches.length : 0;

You should perhaps treat a hyphen as corresponding to a letter (and therefore add \- inside the brackets), so that a hyphenated compound would be counted as one word, but this is debatable (is e.g. “жили-были” two words or one?)