I've been trying to extract mulitple articles from a webpage (zeit online, german newspaper), for which I have a list of urls I want to download articles from, so I do not need to crawl the page for urls.
The newspaper package for python does an awesome job for parsing the content of a single page. What I would need to do ist automatically change the urls, until all the articles are downloaded. I do unfortunately have limited coding knowledge and haven't found a way to do that. I'd be very grateful if anyone could help me.
One of the things I tried was the following:
import newspaper
from newspaper import Article
lista = ['url','url']
for list in lista:
first_article = Article(url="%s", language='de') % list
first_article.download()
first_article.parse()
print(first_article.text)
it returned the following error: unsupported operand type for %:'article' and 'str'
This seems to do the job, although I'd expect there to be an easier way involving less apples and bananas.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import newspaper
from newspaper import Article
lista = ['http://www.zeit.de/1946/01/unsere-aufgabe', 'http://www.zeit.de/1946/04/amerika-baut-auf', 'http://www.zeit.de/1946/04/bedingung', 'http://www.zeit.de/1946/04/bodenrecht']
apple = 0
banana = lista[apple]
while apple <4 :
first_article = Article(url= banana , language='de')
first_article.download()
first_article.parse()
print(first_article.text).encode('cp850', errors='replace')
apple += 1
banana = lista[apple]