I need to use beautiful soup to accomplish the following
Example HTML
<div id = "div1">
Text1
<div id="div2>
Text2
<div id="div3">
Text3
</div>
</div>
</div>
I need to do a search over this to return to me in separate instances of a list
Text1
Text2
Text3
I tried doing a findAll('div'), but it repeated the same Text multiple times ie it would return
Text1 Text2 Text3
Text2 Text3
Text3
Well, you problem is that .text
also includes text from all the child nodes. You'll have to manually get only those text nodes that are immediate children of a node. Also, there might be multiple text nodes inside a given one, for example:
<div>
Hello
<div>
foobar
</div>
world!
</div>
How do you want them to be concatenated? Here is a function that joins them with a space:
def extract_text(node):
return ' '.join(t.strip() for t in node(text=True, recursive=False))
With my example:
In [27]: t = """
<div>
Hello
<div>
foobar
</div>
world!
</div>"""
In [28]: soup = BeautifulSoup(t)
In [29]: map(extract_text, soup('div'))
Out[29]: [u'Hello world!', u'foobar']
And your example:
In [32]: t = """
<div id = "div1">
Text1
<div id="div2">
Text2
<div id="div3">
Text3
</div>
</div>
</div>"""
In [33]: soup = BeautifulSoup(t)
In [34]: map(extract_text, soup('div'))
Out[34]: [u'Text1 ', u'Text2 ', u'Text3']