I'm using the code below to separate all text within a paragraph tag into sentences. It is working okay with a few exceptions. However, tags within paragraphs are chewed up and spit out. Example:
<p>This is a sample of a <a href="#">link</a> getting chewed up.</p>
So, how can I ignore tags such that I could just parse sentences and place span tags around them and keep , , etc...tags in place? Or is it smarter to somehow walk the DOM and do it that way?
// Split text on page into clickable sentences
$('p').each(function() {
var sentences = $(this)
.text()
.replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g,
'<span class="sentence">$1</span>$3');
$(this).html(sentences);
});
I am using this in a Chrome extension content script; which means that the javascript is injected into any page that it comes in contact with and parses up the <p>
tags on the fly. Therefore, it needs to be javascript.
Soapbox
We could craft a regex to match your specific case, but given this is HTML parsing and that your use case hints that any number of tags could be in there, you'd be best off using the DOM or using a product like HTML Agility (free)
However
If you're just looking to pull out the inner text and not interested in retaining any of the tag data, you could use this regex and replace all matches with a null
(<[^>]*>)
Retain sentence as is including subtags
((?:<p(?:\s[^>]*)?>).*?</p>)
- retain the paragraph tags and entire sentence, but not any data outside the paragraph(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)
- retain just the paragraph innertext including all subtags, and store sentence into group 1(<p(?:\s[^>]*)?>)(.*?)(</p>)
- capture open and close paragraph tags and the innertext including any sub tagsGranted these are PowerShell examples, the regex and replace function should be similar
Yields