i want to grab different content (classes) from an lokal saved website (the python documentation) using BeautifulSoup4, so i use this code for doing that (index.html is this saved website: https://docs.python.org/3/library/stdtypes.html )
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
f = open('test.html','w')
f.truncate
classes= soup.find_all('dl', attrs={'class': ['class', 'method','function','describe', 'attribute', 'data', 'clasmethod', 'staticmethod']})
print(classes,file=f)
f.close()
The filehandler is only for result-output and has no effect on the problem itself.
My problem is that the results are nested. For example the method "__eq__ (exporter) will be found 1. inside of the class and 2. as a method standalone.
So i want to remove all the results inside of other results to have every result on the same hierarchical level . How can i do this? Or is it even possible to "ignore" that content in the first step? I hope you understand what i mean.
You cannot tell
find
to ignore nesteddl
elements; all you can do is ignore matches that appear in the.descendants
:If you want nested elements and no parents, use:
If you wanted to pull apart the tree and remove elements from the tree, use:
but you may want to adjust your text extracting instead.