How can I find the source code for a website using

2019-09-21 04:09发布

问题:

Ok, so here's the web page: https://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/

What I want to do is download the source of that web page (the equivalent of right-clicking in a browser and selectin View Source), but I need to do it in a batch file without the use of outside tools like wget. I know how to download files using bitsadmin in a batch file, but I'm running into trouble because I don't know the actual URL of the web page. I've tried adding index.html and index.htm and all sorts of page names to the end and none of the are valid. So how can I find the ACTUAL page name to download?

More info for those who care: the purpose is to parse the code to determine the ever-changing filenames of the GEO-TIFF files on the page, then download those files automatically (rather than needing to manually right-click on each file and save-as about 55 times).

回答1:

You could use the Microsoft.XMLHTTP COM object in Windows Scripting Host (VBScript or JScript). Here's a hybrid Batch + JScript example (should be saved with a .bat extension):

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=https://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/"

cscript /nologo /e:JScript "%~f0" "%url%"

goto :EOF
@end // end Batch / begin JScript

var xhr = WSH.CreateObject('Microsoft.XMLHTTP');

xhr.open('GET', WSH.Arguments(0), true);
xhr.setRequestHeader('User-Agent','XMLHTTP/1.0');
xhr.send('');
while (xhr.readyState != 4) WSH.Sleep(50);

WSH.Echo(xhr.responseText);

Example usage would be something like scriptname.bat > saved.html. Or since you're going this far, you might as well let JScript turn that raw HTML data into something useful. Here's an example that scrapes all the tables on that page using DOM methods, builds an object of the table data, then serializes it into JSON for easier parsing or deserialization by other tools:

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=https://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/"

cscript /nologo /e:JScript "%~f0" "%url%"

goto :EOF
@end // end Batch / begin JScript

var xhr = WSH.CreateObject('Microsoft.XMLHTTP'),
    DOM = WSH.CreateObject('htmlfile'),
    JSON, obj = {};

xhr.open('GET', WSH.Arguments(0), true);
xhr.setRequestHeader('User-Agent','XMLHTTP/1.0');
xhr.send('');
while (xhr.readyState != 4) WSH.Sleep(50);

DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />'
    + xhr.responseText);

JSON = DOM.parentWindow.JSON;

var tables = DOM.getElementsByTagName('table');

for (var i=0; i<tables.length; i++) {
    var cols = [],
        rows = tables[i].rows,
        caption = tables[i].caption ? tables[i].caption.innerText : i;

    for (var j=0; j<rows.length; j++) {
        if (!cols.length) {
            for (var k=0; k < rows[j].cells.length; k++) {
                var cell = rows[j].cells[k].innerText;
                cols.push(cell);
            }
            obj[caption] = {};
        } else {
            var row = rows[j].cells[0].innerText;
            obj[caption][row] = {};
            for (var k=1; k < rows[j].cells.length; k++) {
                var a = rows[j].cells[k].getElementsByTagName('a'),
                    links = new DOM.parentWindow.Array();
                if (a && a.length) {
                    for (var l=0; l<a.length; l++) links.push(a[l].href);
                    obj[caption][row][cols[k]] = links;
                } else {
                    obj[caption][row][cols[k]] = rows[j].cells[k].innerText;
                }
            }
        }
    }
}

WSH.Echo(JSON.stringify(obj, null, '    '));
DOM.close();

That lets you do neat stuff like query the data in a hierarchical structure, like this PowerShell script (saved with a .ps1 extension):

add-type -as System.Web.Extensions
$JSON = New-Object Web.Script.Serialization.JavaScriptSerializer
$data = cmd /c test.bat
$obj = $JSON.DeserializeObject($data)
$obj['Helicopter Route Charts']['Boston']['Current Edition No. and Date']

This all works with functionality built into Windows without requiring any 3rd party applications or downloads beyond the web request to faa.gov.



回答2:

You can use curl.

When you type curl followed by an HTTP address the output will be the source code of the page.

curl http://yourAddress.com > tmp.txt

The result will be stored in a tmp.txt file.