How to change position of progress bar – multiproc

2020-04-16 20:03发布

问题:

First of, I am new to Python. It's irrelevant to the question, but I have to mention it.

I am creating an crawler as my first project, to understand how things work in Python, but so far this is my major issue... Understanding "how to get multiple progress bars" in Terminal while using requests and pathos.multiprocessing.

I managed to go through everything, I just want to have prettier output, so I decide to add progressbars. I am using tqdm as I like the looks and it seems easiest to implement.

Here's my method which purpose is to download the file.

def download_lesson(self, lesson_data):
    if not 'file' in lesson_data:
        return print('=> Skipping... File {file_name} already exists.'.format(file_name=lesson_data['title']))

    response = requests.get(lesson_data['video_source'], stream=True)
    chunk_size = 1024

    with open(lesson_data['file'], 'wb') as file:
        progress = tqdm(
            total=int(response.headers['Content-Length']),
            unit='B',
            unit_scale=True
        )

        for chunk in response.iter_content(chunk_size=chunk_size):
            if chunk:
                progress.update(len(chunk))
                file.write(chunk)

        progress.close()
        print('=> Success... File "{file_name}" has been downloaded.'.format(file_name=lesson_data['title']))

I run that method through Processing:

# c = instance of my crawling class
# cs = returns the `lesson_data` for `download_lesson` method

p = Pool(1)
p.map(c.download_lesson, cs)

So everything works great, as I am using processes=1 in the Pool. But when I run multiple processes, let's say processes=3 then things start to get weird and I get multiple progresses one inside of another.

I've found in tqdm documentation that there is parameter for position. Which clearly states the purpose of what I do need in this case.

position : int, optional Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).

However, I have no clues how to set that position. I tried some weird stuff, such as adding an variable that's suppoused to increment itself by one, but whenever the method download_lesson is being ran, it doesn't seem to do any incrementing. Always 0 so position is always 0.

So seems like I don't understand much in this case... Any tips, hints or complete solutions are welcome. Thank you!


UPDATE #1:

I found out that I can pass another argument to the map as well, so I am passing amount of processes that were being set. (e.g. processes=2)

p = Pool(config['threads'])
p.map(c.download_lesson, cs, range(config['threads']))

So, in my method I tried to print out that argument and indeed I do get 0 and 1, as I am running 2 processes in the example.

But this does not seem to do anything at all...

progress = tqdm(
    total=int(response.headers['Content-Length']),
    unit='B',
    unit_scale=True,
    position=progress_position
)

I still get the same issue of overlapping progress bars. When I manually set position to (e.g. 10) it jumps in Terminal so position does move, still with overlapping ofc because now both are set to 10. But when set dynamically, it does not seem to work either. I don't understand what's my issue here... It's like when map run this method two times, it still gives the latest set position to both progress bars. What the heck am I doing wrong?

回答1:

Ok, first of I'd like to thank @MikeMcKerns for his comment... So there are lots of changes to my script, because I wanted different approach, but in the end it comes down to these important changes.

My init.py now looks that much cleaner...

from scraper.Crawl import Crawl

if __name__ == '__main__':
    Crawl()

My method inside of scraper.Crawl class, for download_lesson, now looks like this...

def download_lesson(self, lesson):

    response = requests.get(lesson['link'], stream=True)
    chunk_size = 1024

    progress = tqdm(
        total=int(response.headers['Content-Length']),
        unit='B',
        unit_scale=True
    )

    with open(lesson['file'], 'wb') as file:
        for chunk in response.iter_content(chunk_size=chunk_size):
            progress.update(len(chunk))
            file.write(chunk)

    progress.close()

And finally, I have a method dedicated to multiprocessing, which looks like this:

def begin_processing(self):
    pool = ThreadPool(nodes=Helper.config('threads'))

    for course in self.course_data:
        pool.map(self.download_lesson, course['lessons'])
        print(
            'Course "{course_title}" has been downloaded, with total of {lessons_amount} lessons.'.format(
                course_title=course['title'],
                lessons_amount=len(course['lessons'])
            )
        )

So as you can tell, I made some major changes to my class, but most importantly I had to add this bit to my init.py

if __name__ == '__main__':

And secondly, I had to use what @MikeMcKerns suggested me to take a look at:

from pathos.threading import ThreadPool

So with those changes, I finally got everything working as I needed. Here's a quick screenshot.

Even tho, I still have no clues why pathos.multiprocessing is making tqdm progress very buggy, I managed to solve my problem thanks to the suggestion of Mike. Thank you!