<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>
Using Python I want to get the values from the anchor tag which should be Granular computing based data mining in the views of rough set and fuzzy set
I tried using lxml
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)
print rawResponse
and getting the following output
['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]
You could use the
text_content
method:yields
or, to remove whitespace, you could use
to obtain
Here is another option which you might find useful:
yields
(Note it leaves an extra space between
data
andmining
however.)'//a/descendant-or-self::text()'
is a more generalized version of"//a/child::text() | //a/span/child::text()"
. It will iterate through all children and grandchildren, etc.With
BeautifulSoup
:Explanation:
BeautifulSoup
parses the HTML, making it easily accessible.soup.h3
accesses theh3
tag in the HTML..text
, simply, gets everything from theh3
tag, excluding all the other tags such as thespan
s.I use
split()
here to get rid of the excess whitespace and newlines, then" ".join()
as the split function returns a list.