Python BeautifulSoup extract text between element

2019-01-13 02:58发布

问题:

I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

I tried it this way:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

回答1:

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contents and string.

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

For the above, that is to say you can get

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>


回答2:

You can use .contents:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT


回答3:

Use .children instead:

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

Yes, this is a bit of a dance.

Output:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT


回答4:

with your own soup object:

soup.p.next_sibling.strip()
  1. you grab the <p> directly with soup.p *(this hinges on it being the first <p> in the parse tree)
  2. then use next_sibling on the tag object that soup.p returns since the desired text is nested at the same level of the parse tree as the <p>
  3. .strip() is just a Python str method to remove leading and trailing whitespace

*otherwise just find the element using your choice of filter(s)

in the interpreter this looks something like:

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode


回答5:

Short answer: soup.findAll('p')[0].next

Real answer: You need an invariant reference point from which you can get to your target.

You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.

For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p') will find paragraph elements, soup.find('p')[0] will be the first paragraph element.

You could in this case use soup.find('p') but soup.findAll('p')[n] is more general since maybe your actual scenario needs the 5th paragraph or something like that.

The next field attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].next contains the text of the paragraph, and soup.findAll('p')[0].next.next will return your target in the HTML provided.



回答6:

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
  hit = hit.text.strip()
  print hit

This will print: THIS IS MY TEXT Try this..



回答7:

The BeautifulSoup documentation provides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:

Removing Elements

Once you have a reference to an element, you can rip it out of the tree with the extract method. This code removes all the comments from a document:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                    <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>