SQL - Remove all HTML tags in a string

2019-02-08 16:52发布

站内文章 / SQL-Server

7 0

狗以群分

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:

<html><head></head><body>My text.</body></html>

I could attempt to solve the problem by doing the following:

REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')

However, this is not a strict rule as some of entries break W3C Standards and do not include <head> tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE function for each opening and closing tag that could exist.

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    Table.HtmlData,
    '<html>', ''),
    '</html>', ''),
    '<head>', ''),
    '</head>', ''),
    '<body>', ''),
    '</body>', ''),
    '<p>', ''),
    '</p>', '')

I was wondering if there was a better way to accomplish this than using multiple nested REPLACE functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).

回答1:

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'

SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'

SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r

回答2:

If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

Here's an example to output the text from all tags:

declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
   <p>My text.</p>
   <p>My other text.</p>
</body>
</html>';

select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');

select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)

For example:

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');

select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing error.

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
  SELECT 1, 1, CHARINDEX(char(13), @htmlData)
  UNION ALL
  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
  WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;

select 
@xmlData as xmlData, 
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;

回答3:

Firstly create a user defined function that strips the HTML out like so:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
     BEGIN
         DECLARE @Start INT;
         DECLARE @End INT;
         DECLARE @Length INT;
         SET @Start = CHARINDEX('<', @HTMLText);
         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
         SET @Length = (@End - @Start) + 1;
         WHILE @Start > 0
               AND @End > 0
               AND @Length > 0
             BEGIN
                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
                 SET @Start = CHARINDEX('<', @HTMLText);
                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
                 SET @Length = (@End - @Start) + 1;
             END;
         RETURN LTRIM(RTRIM(@HTMLText));
     END;
GO

When you're trying to select it:

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

回答4:

One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:

--add any replace templates here:

CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
 ('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO

--This function will perform the "trick"

CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    SELECT @Content=REPLACE(@Content,HTML,'')
    FROM ReplaceTags;

    RETURN @Content;
END
GO

--All examples I found in your question and in comments

DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
 ('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');

--this is the actual query

SELECT dbo.DoReplace(Content) FROM @content;
GO

--Clean-Up

DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;

UPDATE

If you add a replace-value to the template-table you might even use different values as replacements like replace a   with an actual line break...

回答5:

This is just an example. You can use this in script to rmeove any html tags:

 DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'

set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)

set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value

回答6:

This is the simplest way.

DECLARE @str VARCHAR(299)

SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'

SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')

回答7:

You mention the XML is not always valid, but does it always contain the and tags?

In that case the following would work:

SUBSTRING(Table.HtmlData, 
    CHARINDEX('<p>', Table.HtmlData) + 1, 
    CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)

For finding all positions of a within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string

Alternatively I suggest using Visual Basic, as you mentioned that is also an option.