I'm using the code below to separate all text within a paragraph tag into sentences. It is working okay with a few exceptions. However, tags within paragraphs are chewed up and spit out. Example:

<p>This is a sample of a <a href="#">link</a> getting chewed up.</p>

So, how can I ignore tags such that I could just parse sentences and place span tags around them and keep , , etc...tags in place? Or is it smarter to somehow walk the DOM and do it that way?

// Split text on page into clickable sentences
$('p').each(function() {
    var sentences = $(this)
        .text()
        .replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g, 
                 '<span class="sentence">$1</span>$3');
    $(this).html(sentences);
});

I am using this in a Chrome extension content script; which means that the javascript is injected into any page that it comes in contact with and parses up the <p> tags on the fly. Therefore, it needs to be javascript.

标签： javascript regex parsing nlp text-segmentation

1条回答

We Are One

2楼-- · 2019-07-16 12:33

Soapbox

We could craft a regex to match your specific case, but given this is HTML parsing and that your use case hints that any number of tags could be in there, you'd be best off using the DOM or using a product like HTML Agility (free)

However

If you're just looking to pull out the inner text and not interested in retaining any of the tag data, you could use this regex and replace all matches with a null

(<[^>]*>)

enter image description here

Retain sentence as is including subtags

((?:<p(?:\s[^>]*)?>).*?</p>) - retain the paragraph tags and entire sentence, but not any data outside the paragraph
(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>) - retain just the paragraph innertext including all subtags, and store sentence into group 1
(<p(?:\s[^>]*)?>)(.*?)(</p>) - capture open and close paragraph tags and the innertext including any sub tags

Granted these are PowerShell examples, the regex and replace function should be similar

$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'

Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'

Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'

Yields

replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>

insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a> 
getting chewed up.</span></p><a> other stuff</a>

0人赞添加讨论(0) 举报

Splitting HTML Content Into Sentences, But Keeping

Soapbox

However

Retain sentence as is including subtags

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间