Can regex match all the words outside quotation ma

2019-02-15 18:55发布

问题:

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?

Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.

回答1:

This is easy enough using PCRE (or Perl of course):

".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+

Use the g modifier, and s if you want to handle multiline quotes.

Demo

Here's the x version for readability:

  ".*?"              (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+

The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.

Possible modifications:

  • To count the text isn't as two words, replace [\w']+ with \w+.
  • To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.

You get the point ;)

And here's a full Perl script that uses this regex:

#!/usr/bin/env perl
use strict;
use warnings;

$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";

Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.



回答2:

Depending on the requirements, could use The Greatest Regex Trick Ever

"[^"]*"|(\w+)

And count the matches of the first capture group.

\w+ matches one or more word characters.

See test at regex101.com


Also skip single quoted strings:

"[^"]*"|'[^']*'|(\w+)

test at regex101



回答3:

A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.

On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.

In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called @paragraphs, this might look like:

my $wordCount = 0;
foreach my $paragraph (@paragraphs) {
    $paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
    $paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
    $wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";


回答4:

It would work better this way:

Total Number of characters - Sum(characters inside quotes)

You can use this regex to find all "Quoted" strings: \"[^"]*\"