Splitting HTML Content Into Sentences, But Keeping

2019-07-16 12:04发布

I'm using the code below to separate all text within a paragraph tag into sentences. It is working okay with a few exceptions. However, tags within paragraphs are chewed up and spit out. Example:

<p>This is a sample of a <a href="#">link</a> getting chewed up.</p>

So, how can I ignore tags such that I could just parse sentences and place span tags around them and keep , , etc...tags in place? Or is it smarter to somehow walk the DOM and do it that way?

// Split text on page into clickable sentences
$('p').each(function() {
    var sentences = $(this)
        .text()
        .replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g, 
                 '<span class="sentence">$1</span>$3');
    $(this).html(sentences);
});

I am using this in a Chrome extension content script; which means that the javascript is injected into any page that it comes in contact with and parses up the <p> tags on the fly. Therefore, it needs to be javascript.

1条回答
We Are One
2楼-- · 2019-07-16 12:33

Soapbox

We could craft a regex to match your specific case, but given this is HTML parsing and that your use case hints that any number of tags could be in there, you'd be best off using the DOM or using a product like HTML Agility (free)

However

If you're just looking to pull out the inner text and not interested in retaining any of the tag data, you could use this regex and replace all matches with a null

(<[^>]*>)

enter image description here enter image description here

Retain sentence as is including subtags

  • ((?:<p(?:\s[^>]*)?>).*?</p>) - retain the paragraph tags and entire sentence, but not any data outside the paragraph

  • (?:<p(?:\s[^>]*)?>)(.*?)(?:</p>) - retain just the paragraph innertext including all subtags, and store sentence into group 1

  • (<p(?:\s[^>]*)?>)(.*?)(</p>) - capture open and close paragraph tags and the innertext including any sub tags

Granted these are PowerShell examples, the regex and replace function should be similar

$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'

Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'

Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'

Yields

replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>

insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a> 
getting chewed up.</span></p><a> other stuff</a>
查看更多
登录 后发表回答