I have an input file which will contain a long list of URLs. Lets assume this in mylines.txt
:
https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com
What I need to do is:
Read a line from the input file
mylines.txt
Execute
myFun
function. Which will perform some tasks. And produce an output that consists of a line. It is more complex in my real code. But something like this in concept.Write the output to the
results.txt
file
Since I have large input. I need to leverage python multithreading. I looked at this good post here. But unfortunately, it assumes input in a simple list, and does not assume I want to write the output of the function in a file.
I need to ensure that each input's output is written in a single line (i.e. the danger if multithreads are writing to the same line so I get incorrect data).
I tried to mess around. But no success. I did not use python's multithreading before but it is time to learn as it is inevitable in my case. I have a very long list which can not finish in a reasonable time without multithreading. My function will not do this simple task, but more operations that are not necessary for the concept.
Here is my attempt. Please correct me (in the code itself):
import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import Queue
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
results = open("myresults","a") # "a" to append results
results.write("url is:",url, ", response is:", response.url)
results.close()
worker_data = open("mylines.txt","r") # open my input file.
#load up a queue with your data, this will handle locking
q = Queue.Queue()
for url in worker_data:
q.put(url)
# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)
# close the pool and wait for the work to finish
pool.close()
pool.join()
Q: How to fix the above code (please be concise and help me in the code itself) to read a line from the input file, execute the function, write the result associated with the input in a line using python multithreading to execute the requests
concurrently so I can finish my list in a reasonable time.
UPDATE:
Based on the answer, the code becomes:
import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
return "url is:" + url + ", response is:" + response.url
worker_data = open("mylines.txt","r") # open my input file.
#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
for url in f:
q.put(url)
# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)
with open("myresults","w") as f:
for line in results:
f.write(line + '\n')
The mylines.txt contains:
https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com
Note that I first was using:
import Queue
And: q = Queue.Queue(4)
But got an error saying:
Traceback (most recent call last):
File "test3.py", line 4, in <module>
import Queue
ModuleNotFoundError: No module named 'Queue'
Based on some search I changed to:
import queue
And the concerned line to: q = queue.Queue(4)
I also added:
from multiprocessing import Queue
But nothing works. Can any expert in python multithreading help?
Rather than have the worker pool threads print the result out, which is not guaranteed to buffer the output correctly, instead create one more thread, which reads results from a second
Queue
and prints them.I've modified your solution so it builds its own threadpool of workers. There's little point giving the queue an inifinite length, since the main thread will block when the queue reaches maximum size: you only need it to be long enough to make sure there's always work to be processed by the worker threads - the main thread will block and unblock as the queue size increases and decreases.
It also identifies the thread responsible for each item on the output queue, which should give you some confidence that the multithreading approach is working, and prints the response code from the server. I found I had to strip the newlines from the URLs.
Since now only one thread is writing to the file, writes are always perfectly in sync and there is no chance of them interfering with each other.
Using the data given in
mylines.txt
I see the following output:You should change your function to return a string:
and write these strings to the file later:
This keeps the multithreading working for the
requests.get
, but serialises the writing the results to the output file.Update:
And you should also make use of
with
for reading the input file: