可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?
回答1:
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents
and string
.
contents is an ordered list of the Tag and NavigableString objects
contained within a page element
if a tag has only one child node, and that child node is a string,
the child node is made available as tag.string, as well as
tag.contents[0]
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents
and get contents at the index you want.
You also can iterate over a Tag, this is a shortcut. For instance,
for i in soup.body:
print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
回答2:
You can use .contents
:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print hit.contents[6].strip()
...
THIS IS MY TEXT
回答3:
Use .children
instead:
from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children
if isinstance(child, NavigableString) and not isinstance(child, Comment))
Yes, this is a bit of a dance.
Output:
>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
... print ''.join(unicode(child) for child in hit.children
... if isinstance(child, NavigableString) and not isinstance(child, Comment))
...
THIS IS MY TEXT
回答4:
with your own soup object:
soup.p.next_sibling.strip()
- you grab the <p> directly with
soup.p
*(this hinges on it being the first <p> in the parse tree)
- then use
next_sibling
on the tag object that soup.p
returns since the desired text is nested at the same level of the parse tree as the <p>
.strip()
is just a Python str method to remove leading and trailing whitespace
*otherwise just find the element using your choice of filter(s)
in the interpreter this looks something like:
In [4]: soup.p
Out[4]: <p>something</p>
In [5]: type(soup.p)
Out[5]: bs4.element.Tag
In [6]: soup.p.next_sibling
Out[6]: u'\n THIS IS MY TEXT\n '
In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString
In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'
In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode
回答5:
Short answer: soup.findAll('p')[0].next
Real answer: You need an invariant reference point from which you can get to your target.
You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.
For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p')
will find paragraph elements, soup.find('p')[0]
will be the first paragraph element.
You could in this case use soup.find('p')
but soup.findAll('p')[n]
is more general since maybe your actual scenario needs the 5th paragraph or something like that.
The next
field attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].next
contains the text of the paragraph, and soup.findAll('p')[0].next.next
will return your target in the HTML provided.
回答6:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
hit = hit.text.strip()
print hit
This will print: THIS IS MY TEXT
Try this..
回答7:
The BeautifulSoup documentation provides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:
Removing Elements
Once you have a reference to an element, you can rip it out of the
tree with the extract method. This code removes all the comments
from a document:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>