I'm using beautiful soup. There is a tag like this:
<li><a href="example"> s.r.o., <small>small</small></a></li>
I want to get the text within the anchor <a>
tag only, without any from the <small>
tag in the output; i.e. " s.r.o.,
"
I tried find('li').text[0]
but it does not work.
Is there a command in BS4 which can do that?
One option would be to get the first element from the contents
of the a
element:
>>> from bs4 import BeautifulSoup
>>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
s.r.o.,
Another one would be to find the small
tag and get the previous sibling:
>>> print soup.find('small').previous_sibling
s.r.o.,
Well, there are all sorts of alternative/crazy options also:
>>> print next(soup.find('a').descendants)
s.r.o.,
>>> print next(iter(soup.find('a')))
s.r.o.,
Use .children
soup.find('a').children.next()
s.r.o.,
If you would like to loop to print all content of anchor tags located in html string/web page (must utilise urlopen from urllib), this works:
from bs4 import BeautifulSoup
data = '<li><a href="example">s.r.o., <small>small</small</a></li> <li><a href="example">2nd</a></li> <li><a href="example">3rd</a></li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
print(tag.contents[0]) #.contents method to locate text within <a> tags
Output:
s.r.o.,
2nd
3rd
a_tag
is a list containing all anchor tags; collecting all anchor tags in a list, enables group editing (if more than one <a>
tags present.
>>>print(a_tag)
[<a href="example">s.r.o., <small>small</small></a>, <a href="example">2nd</a>, <a href="example">3rd</a>]