Regular Expression to Extract HTML Body Content

2019-01-14 06:23发布

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>
    </title>
  </head>
  <body contenteditable="true">
    <p>
      Example paragraph content
    </p>
    <p>
      &nbsp;
    </p>
    <p>
      <br />
      &nbsp;
    </p>
    <h1>Header 1</h1>
  </body>
</html>

Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:

((.|\n)*<body (.)*>)|((</body>(*|\n)*)

...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

6条回答
Ridiculous、
2楼-- · 2019-01-14 06:55

Match the first body tag: <\s*body.*?>

Match the last body tag: <\s*/\s*body.*?>

(note: we account for spaces in the middle of the tags, which is completely valid markup btw)

Combine them together like this and you will get everything in-between, including the body tags: <\s*body.*?>.*?<\s*/\s*body.*?>. And make sure you are using Singleline mode which will ignore line breaks.

This works in VB.NET, and hopefully others too!

查看更多
够拽才男人
3楼-- · 2019-01-14 07:04
String toMatch="aaaaaaaaaaabcxx sldjfkvnlkfd <body>i m avinash</body>";
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?");
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
    System.out.println(matcher.group(1));
}
查看更多
乱世女痞
4楼-- · 2019-01-14 07:05

Why can't you just split it by

</{0,1}body[^>]*> 

and take the second string? I believe it will be much faster than looking for a huge regexp.

查看更多
我欲成王,谁敢阻挡
5楼-- · 2019-01-14 07:08

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
查看更多
爷的心禁止访问
6楼-- · 2019-01-14 07:13

XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.

EDIT: In response to a comment here; that an XML parser is too slow.

There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.

The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.

查看更多
爷、活的狠高调
7楼-- · 2019-01-14 07:17
/<body[^>]*>(.*)</body>/s

replace with

\1
查看更多
登录 后发表回答