-->

Regex: Find groups of lowercase letters between HT

2019-04-11 23:30发布

问题:

I'm attempting to develop a regular expression that can be run in Sigil, the ePub 2 editor.

Small-caps are a well-known problem within the current ePub reader ecosystem. Many readers, such as Adobe Digital Editions, do not support "font-variant: small-caps". After trying several different workarounds, I've settled on creating fake small caps by transforming the text to uppercase and setting the previously lowercase letters to "font-size: 0.75em".

This process is extremely tedious, especially when working with books that have lots of endnotes with citations of other books.

Say that I have a bunch of phrases in an HTML page tagged with an "SC" class. I've created a test phrase:

<span class="SC">Hello World! Testing: one tWo thrEE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

The goal is to write a regex that matches any lowercase letters within the "SC" span tag only, and replace them with:

<span class="FSC">LETTERS</span>

I can manage to match and replace the letters in the first word "Hello", but everything breaks down after that.

Here's what I've got so far:

Find:

(<span class="SC">.*?)([a-z]+)(.*</span>)

Replace:

\1<span class="FSC">\U\2\E</span>\3

The tricky part is then continuing to find the rest of the lowercase letters within that tag, now that a new "FSC" (Fake Small Caps) span tag has been introduced. Trying the same regex again results in "span" and then "class" getting the FSC treatment. Ideally, I'd like to be able to just keep hitting the "Replace All" button until no more matches are found.

The above example would look like this when finished:

<span class="SC">H<span class="FSC">ELLO</span> W<span class="FSC">ORLD</span>! T<span class="FSC">ESTING</span>: <span class="FSC">ONE</span> <span class="FSC">T</span>W<span class="FSC">O</span> <span class="FSC">THR</span>EE &amp; W.T.F.</span>
<span class="foo">Don't touch me!</span>

It's not pretty, but it works on every ePub reader that I've tested it on.

If you google "epub small caps regex", you'll come across a MobileRead wiki article that I edited to include this regex, which I've decided is not satisfactory:

(<span class="[a-zA-Z0-9\- ]*?(?<!F)SC[a-zA-Z0-9\-]*?">(?:.+?<span class="FSC">.+?</span>)*[\.|,|:|;|-|–|—|!|\?]? ?(?:&amp;)? ?[A-Z]+)([a-z'’\. ]+)(.*?</span>)

This ends up miniaturizing a bunch of punctuation and sometimes stops in the middle of a phrase. I started over, thinking there was probably a better solution that doesn't attempt to plan for every single possibility up front.

If someone comes up with a better solution to this, you'll be the hero of the entire ePub publishing industry.

Update

I've added the accepted (and only) answer to the Mobile Read wiki. Please note that this regex has been altered specifically for use in Sigil; YMMV in other environments.

回答1:

Perfect usecase for: Collapse and Capture a Repeating Pattern in a Single Regex Expression

Modified it for your case:

(<span class="SC">(?:(?!<\/span>)(?:[^a-z&]|&[^;]+;))*|(?!^)\G(?:(?!<\/span>)(?:[^a-z&]|&[^;]+;))*)([a-z]+)

Replace with: \1<span class="FSC">\U\2\E</span>

And here's the RegEx explained: http://regex101.com/r/jU6bA5

This is a solution for "Replace All" as it works via RegEx global modifier /g !



标签: html regex epub