Script for a changing URL

I am having a bit of trouble in coding a process or a script that would do the following:

I need to get data from the URL of:

nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z

But the file URL's (the days and model runs change), so it has to assume this base structure for variables.

Y - Year 
M - Month
D - Day
C - Model Forecast/Initialization Hour
F- Model Frame Hour

Like so:

nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz

This script would run, and then import that date (in the YYYYMMDD, as well as CC) with those variables coded -

So while the mission is to get

http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z

While these variables correspond to get the current dates in the format of:

http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz

Can you please advise how to go about and get the URL's to find the latest date in this format? Whether it'd be a script or something with wget, I'm all ears. Thank you in advance.

标签： python perl wget

3条回答

祖国的老花朵

2楼-- · 2019-07-30 22:39

The easiest solution would be just to mirror the parent directory:

wget -np -m -r http://nomads.ncep.noaa.gov:9090/dods/gfs_hd

However, if you just want the latest date, you can use Mojo::UserAgent as demonstrated on Mojocast Episode 5

use strict;
use warnings;

use Mojo::UserAgent;

my $url = 'http://nomads.ncep.noaa.gov:9090/dods/gfs_hd';

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;

my @links = $dom->find('a')->attr('href')->each;

my @gfs_hd = reverse sort grep {m{gfs_hd/}} @links;

print $gfs_hd[0], "\n";

On May 23rd, 2014, Outputs:

http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/gfs_hd20140523

0人赞添加讨论(0) 举报

劳资没心，怎么记你

3楼-- · 2019-07-30 22:53

In Python, the requests library can be used to get at the URLs.

You can generate the URL using a combination of the base URL string plus generating the timestamps using the datetime class and its timedelta method in combination with its strftime method to generate the date in the format required.

i.e. start by getting the current time with datetime.datetime.now() and then in a loop subtract an hour (or whichever time gradient you think they're using) via timedelta and keep checking the URL with the requests library. The first one you see that's there is the latest one, and you can then do whatever further processing you need to do with it.

If you need to scrape the contents of the page, scrapy works well for that.

0人赞添加讨论(0) 举报

祖国的老花朵

4楼-- · 2019-07-30 23:02

I'd try scraping the index one level up at http://nomads.ncep.noaa.gov/dods/gfs_hd ; the last link-of-particular-form there should take you to the daily downloads pages, where you could do something similar.

Here's an outline of scraping the daily downloads page:

import BeautifulSoup
import urllib
grdd = urllib.urlopen('http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140522')
soup = BeautifulSoup.BeautifulSoup(grdd)
datalinks = 'http://nomads.ncep.noaa.gov:80/dods/gfs_hd/gfs_hd'
for link in soup.findAll('a'):
    if link.get('href').startswith(datalinks):
        print('Suitable link: ' + link.get('href')[len(datalinks):])
        # Figure out if you already have it, choose if you want info, das, dds, etc etc.

and scraping the page with the last thirty would, of course, be very similar.

0人赞添加讨论(0) 举报

Script for a changing URL

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间