Script that uses parameters and reads results

2019-08-14 13:19发布

I am trying to write a script that takes in a URL with certain parameters, reads from the resulting web page a list of new URLs, and downloads them locally. I am very new to programming and have never used Python 3, so I am a little lost.

Here is example code to explain further:

param1 = 
param2 = 
param3 = 

requestURL = "http://examplewebpage.com/live2/?target=param1&query=param2&other=param3"

html_content = urllib2.urlopen(requestURL).read()

#I don't know where to go from here
#Something that can find when a URL appears on the page and append it to a list 
#Then download everything from that list

#this can download something from a link:
#file = urllib.URLopener()
#file.retrieve(url, newfilelocation)

The output from the request-URL is a very long page that can be in XML or JSON and has a lot of information not necessarily needed, so some form of searching is needed to find the URLs that need to be downloaded from later. The URLs found on the page lead directly to the needed files (They end in .jpg, .cat, etc).

Please let me know if you need any other information! My apologies if this is confusing.

Also, ideally I would have the downloaded files all go to a new folder (sub-dir) created for them with the filename as the current date and time, but I think I can figure this part out myself.

2条回答
成全新的幸福
2楼-- · 2019-08-14 13:55

I would recommend checking out BeautifulSoup for parsing the returned page. With it, you can loop through the links and extract the link address fairly easy and append them to a list of the links.

查看更多
冷血范
3楼-- · 2019-08-14 13:56

It looks like you are trying to build something similar to a web crawler, unless you want to render the content. You should explore the source code from scrapy this will help in understanding how others wrote the similar logic. I would suggest using requests library instead of urllib since it's easier. python library has builtin html, Json and XML parsers.

You should inspect the content-type header to understand what kind of content you are trying to download if the page type is unknown. There can be alternative strategies, scrapy should give you more ideas.

Hope this helps.

查看更多
登录 后发表回答