Unable to write extracted items properly in an exc

2019-02-19 06:27发布

问题:

I've written some code in python to parse title and link from a webpage. Initially, I tried to parse the links from the left sided bar then scrape those aforesaid documents from each page by tracking down each links. I did this flawlessly. I tried to save the documents of different links in different pages in a single excel file. However, It creates several "Sheets" extracting the desired portion as the sheet name from heading variable from my script. The problem I'm facing is- when the data are saved, only the last record of each page from the links are saved in my excel sheets instead of the full records. Here is the script I tried with:

import requests
from lxml import html
from pyexcel_ods3 import save_data

web_link = "http://www.wiseowl.co.uk/videos/"
main_url = "http://www.wiseowl.co.uk"

def get_links(page):

    response = requests.Session().get(page)
    tree = html.fromstring(response.text)
    data = {}
    titles = tree.xpath("//ul[@class='woMenuList']//li[@class='woMenuItem']/a/@href")
    for title in titles:
        if "author" not in title and "year" not in title:
            get_docs(data, main_url + title)

def get_docs(data, url):

    response = requests.Session().get(url)
    tree = html.fromstring(response.text)

    heading = tree.findtext('.//h1[@class="gamma"]')

    for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
        title = item.findtext('.//a')
        link = item.xpath('.//a/@href')[0]
        # print(title, link)
        data.update({heading.split(" ")[-4]: [[(title)]]})
    save_data("mth.ods", data)

if __name__ == '__main__':
    get_links(web_link)

回答1:

When you update the values in the data dict the previous values get replaced.

You can fix this if you replace this line:

data.update({heading.split(" ")[-4]: [[(title)]]})

With this ( it's a bit ugly but it works ) :

data[heading.split(" ")[-4]] = data.get(heading.split(" ")[-4], []) + [[(title)]]


回答2:

Or if you would like it to be more readable:

def get_docs(data, url):

    response = requests.Session().get(url)
    tree = html.fromstring(response.text)

    heading = tree.findtext('.//h1[@class="gamma"]')

    for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
        title = item.findtext('.//a')
        sheetname = heading.split(" ")[-4]
        if sheetname in data:  
            data[sheetname].append([title])  
        else:  
            data[sheetname] = [[title]]
    save_data("mth.ods", data)

Edit: To insert link to the next column, you should simply add it to your list like this:

if sheetname in data:  
    data[sheetname].append([title, str(link)])  
else:  
    data[sheetname] = [[title, str(link)]]

Edit2: To have them on the same page, you need to append them to the same key, since key represents sheet and value represents rows and columns in save_data. Like this:

sheetname = 'You are welcome'
for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
            title = item.findtext('.//a')
            if sheetname in data:  
                data[sheetname].append([title])  
            else:  
                data[sheetname] = [[title]]
        save_data("mth.ods", data)