How to split the tags from html tree

2020-08-02 07:07发布

问题:

This is my html tree

 <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>

From this html i need to extract the lines beforeth of < br > tag

line1 : Get the IndianOil Citibank Card. Apply Now!

line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel

how it would supposed to do in python?

回答1:

I think you just asked for the line before each <br/>.

This following code will do it for the sample you've provided, by striping out the <b> and <a> tags and printing the .tail of each element whose following-sibling is a <br/>.

from lxml import etree

doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")

etree.strip_tags(doc,'a','b')

for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
  print repr(element.tail.strip())

Yields:

'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n   Save Over 5% On Fuel'


回答2:

As with all parsing of HTML you need to make some assumptions about the format of the HTML. If we can assume that the previous line is everything before the <br> tag up to a block level tag, or another <br> then we can do the following...

from BeautifulSoup import BeautifulSoup

doc = """
   <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
    </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
    <br />
    <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
    <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
    <br />
    <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""

soup = BeautifulSoup(doc)

Now we have parsed the HTML, next we define the list of tags we don't want to treat as part of the line. There are other block tags really, but this does for this HTML.

block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]

We cycle through each <br> tag stepping back through its siblings until we either have no more, or we hit a block level tag. Each time we loop we get add the node to the front of our line. NavigableStrings don't have name attributes, but we want to include them hence the two part test in the while loop.

for node in soup.findAll("br"):
    line = ""
    sibling = node.previousSibling
    while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
        line = unicode(sibling) + line
        sibling = sibling.previousSibling
    print line


回答3:

Solution without relaying on <br> tags:

import lxml.html

html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[@class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[@class="taf"]//a[not(@id)]/text()'))


回答4:

I dont know whether you want to use lxml or beautiful soup. But for lxml using xpath here is an example

import lxml
from lxml import etree
import urllib2

response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]

The xpath I used is for your given html snippet. It may change in the original context.

You can also give cssselect in lxml a try.

import lxml.html
import urllib
data = urllib.urlopen('your url').read() 
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
      print element.text_content()