Get data between two tags in Python

<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>

Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set

I tried using lxml

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)                   
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)              
print rawResponse

and getting the following output

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

标签： python lxml scrape

2条回答

beautiful°

2楼-- · 2019-04-02 03:18

You could use the text_content method:

import lxml.html as LH

html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''

root = LH.fromstring(html)
for elt in root.xpath('//a'):
    print(elt.text_content())

yields

Granular computing based
data
mining
in the views of rough set and fuzzy set

or, to remove whitespace, you could use

print(' '.join(elt.text_content().split()))

to obtain

Granular computing based data mining in the views of rough set and fuzzy set

Here is another option which you might find useful:

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

yields

Granular computing based data  mining in the views of rough set and fuzzy set

(Note it leaves an extra space between data and mining however.)

'//a/descendant-or-self::text()' is a more generalized version of "//a/child::text() | //a/span/child::text()". It will iterate through all children and grandchildren, etc.

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2019-04-02 03:20

With BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> html = (the html you posted above)
>>> soup = BeautifulSoup(html)
>>> print " ".join(soup.h3.text.split())
Granular computing based data mining in the views of rough set and fuzzy set

Explanation:

BeautifulSoup parses the HTML, making it easily accessible. soup.h3 accesses the h3 tag in the HTML.

.text, simply, gets everything from the h3 tag, excluding all the other tags such as the spans.

I use split() here to get rid of the excess whitespace and newlines, then " ".join() as the split function returns a list.

0人赞添加讨论(0) 举报

Get data between two tags in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间