Unable to get the full content using selector

2019-08-16 04:55发布

问题:

I've written some selector used within python to get some items and it's value. I wish to scrape the items not to style. However, when I run my script, It only gets the items but can't reach the value of those items which are separated by "br" tag. How can I grab them? I do not with to use xpath in this very case to serve the purpose. Thanks in advance.

Here are the elements:

html = '''
<div class="elems"><br>
    <ul>
    <li><b>Item Name:</b><br>
            titan
                </li>
        <li><b>Item No:</b><br>
                23003400
                    </li>
        <li><b>Item Sl:</b><br>
            2760400
                </li>
        </ul>
    </div>
'''

Here is my script with css selectors in it:

from lxml import html as e

root = e.fromstring(html)
for items in root.cssselect(".elems li"):
    item = items.cssselect("b")[0].text_content()
    print(item)

Upon execution, the result I'm having:

Item Name:
Item No:
Item Sl:

The result I'm after:

Item Name: titan
Item No: 23003400
Item Sl: 2760400

回答1:

Generally I use .itertext method to extract text:

from lxml.html import fromstring

def extract_text(el, sep=' '):
    return sep.join(s.strip() for s in li.itertext() if s.strip())

tree = fromstring(html)
for li in tree.cssselect('.elems li'):
    print(extract_text(li))


回答2:

The easiest solution ever. Values are within "li" tag not "b".

from lxml import html as e

root = e.fromstring(html)
for items in root.cssselect(".elems"):
    item = [item.text_content() for item in items.cssselect("li")]
    print(''.join(item))