I have this HTML:
<td class="0">
<b>Bold Text</b>
<a href=""></a>
</td>
<td class="0">
Regular Text
<a href=""></a>
</td>
Which, when formatted with xpath...
new_html = tree.xpath('//td[@class="0"]/text() | //td[@class="0"]/b/text()')
Prints:
['Bold Text', '', 'Regular Text']
As you can see, the
character hasn't been ignored and is actually read as an extra entry in td. How can I get a better output?
Instead, I'd iterate over all the desired
td
elements and get the.text_content()
:Prints:
Note: I'm posting this not so much as an answer, but as an interesting thing (I did not know) about XPath's
normalize-space()
. This might help other users.It looks like
normalize-space()
which I would have suggested here, does not remove'NO-BREAK SPACE' (U+00A0)
Edit:
So I continued looking into whitespace characters and how they are stripped or not using Python's
strip()
or XPath'snormalize-space()
.The following is a bit longer than I first wanted, but he's the whole script to test Unicode whitespace codepoints:
Do
strip()
andnormalize-space()
strip these whitespace characters?Whitespace chars: