BeautifulSoup counting tags without parsing deep i

2019-02-20 06:15发布

问题:

I thought about the following while writing an answer to this question.

Suppose I have a deeply nested xml file like this (but much more nested and much longer):

<section name="1">
    <subsection name"foo">
        <subsubsection name="bar">
            <deeper name="hey">
                <much_deeper name"yo">
                    <li>Some content</li>
                </much_deeper>
            </deeper>
        </subsubsection>
    </subsection>
</section>
<section name="2">
    ... and so forth
</section>

The problem with len(soup.find_all("section")) is that while doing find_all("section"), BS keeps searching deep into a tag that I know won't contain any other section tag.

So, two questions:

  1. Is there a way to make BS not search recursively into an already found tag?
  2. If the answer to 1 is yes, will it be more efficient or is it the same internal process?

回答1:

BeautifulSoup cannot give you just a count/number of tags it found.

What you, though, can improve is: don't let BeautifulSoup go searching sections inside other sections by passing recursive=False:

len(soup.find_all("section", recursive=False))

Aside from that improvement, lxml would do the job faster:

tree.xpath('count(//section)')