Find all the span styles with font size larger tha

2019-03-04 17:01发布

I understand how to obtain the text from a specific div or span style from this question: How to find the most common span styles

Now the difficulty is trying to find all the span styles with font sizes larger than the most common one?

I suspect I should use regular expressions, but first I need to extract the specific most common font size?

Also, how do you determine "larger than" when the condition is a string?

2条回答
叼着烟拽天下
2楼-- · 2019-03-04 17:52

This may help you:-

    from bs4 import BeautifulSoup
    import re

    usedFontSize = [] #list of all font number used

    #Find all the span contains style 
    spans = soup.find_all('span',style=True)
    for span in spans:
        #print span['style']
        styleTag = span['style']
        fontSize = re.findall("font-size:(\d+)px",styleTag)
        usedFontSize.append(int(fontSize[0]))

    #Find most commanly used font size
    from collections import Counter
    count = Counter(usedFontSize)
    #Print list of all the font size with it's accurence.
    print count.most_common()
查看更多
Melony?
3楼-- · 2019-03-04 18:06

To find all the span styles with font sizes larger than the most common span style using BeautifulSoup, you need to parse each CSS style that has been returned.

Parsing CSS is better done using a library such as cssutils. This would then let you access the fontSize attribute directly.

This would have a value such as 12px which does not naturally sort correctly. To get around this, you could use a library such as natsort.

So, first parse each of the styles into css objects. At the same time keep a list of all the soup for each span, along with the parsed CSS for the style.

Now use the fontSize attribute as the key for sorting with natsort. This would give you a correctly sorted list of styles according to their font size, largest first (by using reverse=True). takewhile() is then used to create a list of all entries in the list up to the point where the size matches the most common one resulting in a list of entries larger than the most common one.

from bs4 import BeautifulSoup
from collections import Counter
from itertools import takewhile    
import cssutils
import natsort

html = """
    <span style="font-family: ArialMT; font-size:12px">1</span>
    <span style="font-family: ArialMT; font-size:14px">2</span>
    <span style="font-family: ArialMT; font-size:1px">3</span>
    <span style="font-family: Arial; font-size:12px">4</span>
    <span style="font-family: ArialMT; font-size:18px">5</span>
    <span style="font-family: ArialMT; font-size:15px">6</span>
    <span style="font-family: ArialMT; font-size:12px">7</span>
    """

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()
parsed_css_style = []       # Holds list of tuples (css_style, span)

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1
    parsed_css_style.append((cssutils.parseStyle(span['style']), span))

most_common_style = style_counts.most_common(1)[0][0]
most_common_css_style = cssutils.parseStyle(most_common_style)
css_styles = natsort.natsorted(parsed_css_style, key=lambda x: x[0].fontSize, reverse=True)

print "Styles larger than most common font size of {} are:".format(most_common_css_style.fontSize)

for css_style, span in takewhile(lambda x: x[0].fontSize != most_common_css_style.fontSize, css_styles):
    print "  Font size: {:5}  Text: {}".format(css_style.fontSize, span.text)

In the example shown, the most commonly used font size is 12px, so there are 3 other entries larger than this as follows:

Styles larger than most common font size of 12px are:
  Font size: 18px   Text: 5
  Font size: 15px   Text: 6
  Font size: 14px   Text: 2

To install you will probably need:

pip install natsort
pip install cssutils    

Note, this does assume the font sizes used are consistent on your website, it is not able to compare different font metrics, only the numerical value.

查看更多
登录 后发表回答