I understand how to obtain the text from a specific div
or span
style from this question: How to find the most common span styles
Now the difficulty is trying to find all the span styles with font sizes larger than the most common one?
I suspect I should use regular expressions, but first I need to extract the specific most common font size?
Also, how do you determine "larger than" when the condition is a string?
This may help you:-
from bs4 import BeautifulSoup
import re
usedFontSize = [] #list of all font number used
#Find all the span contains style
spans = soup.find_all('span',style=True)
for span in spans:
#print span['style']
styleTag = span['style']
fontSize = re.findall("font-size:(\d+)px",styleTag)
usedFontSize.append(int(fontSize[0]))
#Find most commanly used font size
from collections import Counter
count = Counter(usedFontSize)
#Print list of all the font size with it's accurence.
print count.most_common()
To find all the span styles with font sizes larger than the most common span style using BeautifulSoup, you need to parse each CSS style that has been returned.
Parsing CSS is better done using a library such as cssutils
. This would then let you access the fontSize
attribute directly.
This would have a value such as 12px
which does not naturally sort correctly. To get around this, you could use a library such as natsort
.
So, first parse each of the styles into css objects. At the same time keep a list of all the soup for each span, along with the parsed CSS for the style.
Now use the fontSize
attribute as the key for sorting with natsort. This would give you a correctly sorted list of styles according to their font size, largest first (by using reverse=True
). takewhile()
is then used to create a list of all entries in the list up to the point where the size matches the most common one resulting in a list of entries larger than the most common one.
from bs4 import BeautifulSoup
from collections import Counter
from itertools import takewhile
import cssutils
import natsort
html = """
<span style="font-family: ArialMT; font-size:12px">1</span>
<span style="font-family: ArialMT; font-size:14px">2</span>
<span style="font-family: ArialMT; font-size:1px">3</span>
<span style="font-family: Arial; font-size:12px">4</span>
<span style="font-family: ArialMT; font-size:18px">5</span>
<span style="font-family: ArialMT; font-size:15px">6</span>
<span style="font-family: ArialMT; font-size:12px">7</span>
"""
soup = BeautifulSoup(html, "html.parser")
style_counts = Counter()
parsed_css_style = [] # Holds list of tuples (css_style, span)
for span in soup.find_all('span', style=True):
style_counts[span['style']] += 1
parsed_css_style.append((cssutils.parseStyle(span['style']), span))
most_common_style = style_counts.most_common(1)[0][0]
most_common_css_style = cssutils.parseStyle(most_common_style)
css_styles = natsort.natsorted(parsed_css_style, key=lambda x: x[0].fontSize, reverse=True)
print "Styles larger than most common font size of {} are:".format(most_common_css_style.fontSize)
for css_style, span in takewhile(lambda x: x[0].fontSize != most_common_css_style.fontSize, css_styles):
print " Font size: {:5} Text: {}".format(css_style.fontSize, span.text)
In the example shown, the most commonly used font size is 12px
, so there are 3 other entries larger than this as follows:
Styles larger than most common font size of 12px are:
Font size: 18px Text: 5
Font size: 15px Text: 6
Font size: 14px Text: 2
To install you will probably need:
pip install natsort
pip install cssutils
Note, this does assume the font sizes used are consistent on your website, it is not able to compare different font metrics, only the numerical value.