Downloading articles from multiple urls with newsp

2019-05-30 18:26发布

问题:

I've been trying to extract mulitple articles from a webpage (zeit online, german newspaper), for which I have a list of urls I want to download articles from, so I do not need to crawl the page for urls.

The newspaper package for python does an awesome job for parsing the content of a single page. What I would need to do ist automatically change the urls, until all the articles are downloaded. I do unfortunately have limited coding knowledge and haven't found a way to do that. I'd be very grateful if anyone could help me.

One of the things I tried was the following:

import newspaper
from newspaper import Article

lista = ['url','url']


for list in lista:

 first_article = Article(url="%s", language='de') % list

 first_article.download()

 first_article.parse()

 print(first_article.text)

it returned the following error: unsupported operand type for %:'article' and 'str'

This seems to do the job, although I'd expect there to be an easier way involving less apples and bananas.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import newspaper
from newspaper import Article

lista = ['http://www.zeit.de/1946/01/unsere-aufgabe', 'http://www.zeit.de/1946/04/amerika-baut-auf', 'http://www.zeit.de/1946/04/bedingung', 'http://www.zeit.de/1946/04/bodenrecht']

apple = 0
banana = lista[apple]


while apple <4 :

 first_article = Article(url= banana , language='de') 

 first_article.download()

 first_article.parse()

 print(first_article.text).encode('cp850', errors='replace')

 apple += 1
 banana = lista[apple]

回答1:

You get the exception

it returned the following error: unsupported operand type for %:'article' and 'str'

because you are populating the wrong variable and on line 9 you should have:

first_article = Article(url="%s" % list, language='de')

and here's the full code:

import newspaper
from newspaper import Article

lista = ['url','url']


for list in lista:

   first_article = Article(url="%s" % list, language='de')

   first_article.download()

   first_article.parse()

   print(first_article.text)