I'm using the Newspaper module for python found here.
In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)
Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?
from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])
I was able to do this by creating a
Source
for each article URL. (disclaimer: not a python developer)I know this question is really old but it's one of the first links that shows up when I googled how to get multithread newspaper. While Kyles answer is very helpful, it is not complete and I think it has some typos...
I changed the Stubsource to Singlesource and one of the urls to articleURL. Of course this just downloads the webpages, you still need to parse them to be able to get the text.
In my sample of 100 urls, this took half the time compared to just working with each url in sequence. (Edit: After increasing the sample size to 2000 there is a reduction of about a quarter.)
(Edit: Got the whole thing working with multithreading!) I used this very good explanation for my implementation. With a sample size of 100 urls, using 4 threads takes comparable time to the code above but increasing the thread count to 10 gives a further reduction of about a half. A larger sample size needs more threads to give a comparable difference.
I'm not familiar with the Newspaper module but the following code uses a list of URLs and should be equivalent to the one provided in the linked page: