Python to list HTTP-files and directories

2019-02-24 20:19发布

How can I list files and folders if I only have an IP-address?

With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?

I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).

5条回答
狗以群分
2楼-- · 2019-02-24 20:31

Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:

from bs4 import BeautifulSoup
import requests

url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'

def listFD(url, ext=''):
    page = requests.get(url).text
    print page
    soup = BeautifulSoup(page, 'html.parser')
    return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

for file in listFD(url, ext):
    print file
查看更多
在下西门庆
3楼-- · 2019-02-24 20:32

You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.

from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
    url = url.replace(" ","%20")
    req = Request(url)
    a = urlopen(req).read()
    soup = BeautifulSoup(a, 'html.parser')
    x = (soup.find_all('a'))
    for i in x:
        file_name = i.extract().get_text()
        url_new = url + file_name
        url_new = url_new.replace(" ","%20")
        if(file_name[-1]=='/' and file_name[0]!='.'):
            read_url(url_new)
        print(url_new)

read_url("www.example.com")
查看更多
Ridiculous、
4楼-- · 2019-02-24 20:38

Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:

import requests
from bs4 import BeautifulSoup

def get_url_paths(url, ext='', params={}):
    response = requests.get(url, params=params)
    if response.ok:
        response_text = response.text
    else:
        return response.raise_for_status()
    soup = BeautifulSoup(response_text, 'html.parser')
    parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
    return parent

url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
查看更多
看我几分像从前
5楼-- · 2019-02-24 20:39

You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.

For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.

Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.

If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.

查看更多
冷血范
6楼-- · 2019-02-24 20:44

HTTP does not work with "files" and "directories". Pick a different protocol.

查看更多
登录 后发表回答