Python crawler does not work properly

2019-08-21 07:30发布

I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:

headers = {
    'Referer': 'https://freemidi.org/download-20225',
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.

midi = requests.get(url, headers=headers).content

However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.

I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?

1条回答
倾城 Initia
2楼-- · 2019-08-21 07:46

It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.

You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):

midi = requests.get(url, headers=headers, allow_redirects=False)

Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).

with open('example.mid', 'wb') as ex:
    ex.write(midi.content)
查看更多
登录 后发表回答