Python crawler does not work properly

I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:

headers = {
    'Referer': 'https://freemidi.org/download-20225',
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.

midi = requests.get(url, headers=headers).content

However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.

I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?

It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.

You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):

midi = requests.get(url, headers=headers, allow_redirects=False)

Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).

with open('example.mid', 'wb') as ex:
    ex.write(midi.content)

Python crawler does not work properly

问题:

回答1:

收藏的人(0)

Python crawler does not work properly

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮