I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a script that calculates that for you? I could, of course, do this the boring way by going though the whole text and ignoring the words inside quotation marks, but I have a feeling that there's a neater way using Regex and Array.count
. As I know next to nothing about Regex, can someone help me/tell me that it's impossible with Regex?
Tl;dr: use Regex to match all words (or spaces, doesn't matter) that are outside quotation marks from a text, and count the items in the resulting array.
This is easy enough using PCRE (or Perl of course):
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
Use the g
modifier, and s
if you want to handle multiline quotes.
Demo
Here's the x
version for readability:
".*?" (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
The first part will match everything inside "
or '
quotes and will discard it ((*SKIP)(?!)
). The second part will match all words (I've included '
as being part of a word in this example). The '
character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.
Possible modifications:
- To count the text isn't as two words, replace
[\w']+
with \w+
.
- To count text like mother-in-law as one word instead of 3, replace
[\w']+
with [-\w']+
.
You get the point ;)
And here's a full Perl script that uses this regex:
#!/usr/bin/env perl
use strict;
use warnings;
$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.
Depending on the requirements, could use The Greatest Regex Trick Ever
"[^"]*"|(\w+)
And count the matches of the first capture group.
\w+
matches one or more word characters.
See test at regex101.com
Also skip single quoted strings:
"[^"]*"|'[^']*'|(\w+)
test at regex101
A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.
On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.
In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called @paragraphs, this might look like:
my $wordCount = 0;
foreach my $paragraph (@paragraphs) {
$paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
$paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
$wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";
It would work better this way:
Total Number of characters - Sum(characters inside quotes)
You can use this regex to find all "Quoted" strings: \"[^"]*\"