可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This is my html tree

 <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>

From this html i need to extract the lines beforeth of tag

line1 : Get the IndianOil Citibank Card. Apply Now!

line2 : Get 10X Rewards On Shopping - Save Over 5% On Fuel

how it would supposed to do in python?

回答1:

I think you just asked for the line before each  .

This following code will do it for the sample you've provided, by striping out the  and <a> tags and printing the .tail of each element whose following-sibling is a  .

from lxml import etree

doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")

etree.strip_tags(doc,'a','b')

for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
  print repr(element.tail.strip())

Yields:

'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n   Save Over 5% On Fuel'

回答2:

As with all parsing of HTML you need to make some assumptions about the format of the HTML. If we can assume that the previous line is everything before the   tag up to a block level tag, or another   then we can do the following...

from BeautifulSoup import BeautifulSoup

doc = """
   <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
    </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
    <br />
    <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
    <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
    <br />
    <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""

soup = BeautifulSoup(doc)

Now we have parsed the HTML, next we define the list of tags we don't want to treat as part of the line. There are other block tags really, but this does for this HTML.

block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]

We cycle through each   tag stepping back through its siblings until we either have no more, or we hit a block level tag. Each time we loop we get add the node to the front of our line. NavigableStrings don't have name attributes, but we want to include them hence the two part test in the while loop.

for node in soup.findAll("br"):
    line = ""
    sibling = node.previousSibling
    while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
        line = unicode(sibling) + line
        sibling = sibling.previousSibling
    print line

回答3:

Solution without relaying on   tags:

import lxml.html

html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[@class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[@class="taf"]//a[not(@id)]/text()'))

回答4:

I dont know whether you want to use lxml or beautiful soup. But for lxml using xpath here is an example

import lxml
from lxml import etree
import urllib2

response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]

The xpath I used is for your given html snippet. It may change in the original context.

You can also give cssselect in lxml a try.

import lxml.html
import urllib
data = urllib.urlopen('your url').read() 
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
      print element.text_content()