How can I extract td from html in bash?

2019-07-20 18:01发布

I am querying London postcode data from geonames:

http://www.geonames.org/postalcode-search.html?q=london&country=GB

I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?

3条回答
Ridiculous、
2楼-- · 2019-07-20 18:34

I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).

Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.

Something like

sed -e 's/>/>\n/g' region.html |
   egrep -i "^\s*[A-Z]+[0-9]+</td>" |
   sed -e 's|</td>||g'

worked extracting 200 lines, assuming a specific format for the code.

ADD

If there's no limit to the software you can use to parse the data, then you could use a line like

wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
     sgrep '"<table class=\"restable\"" .. "</table>"' | 
     sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;;&nbsp;.*$| |g' |
     grep -v "^\s*$" |
     tail -n+2 | cut -d";" -f2,3

which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:

wget -q "$html" -O - | 
     w3m -dump -T 'text/html' |
     awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'

which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.

查看更多
放荡不羁爱自由
3楼-- · 2019-07-20 18:44

If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:

mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'

The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

查看更多
你好瞎i
4楼-- · 2019-07-20 18:52

I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)

html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
  w3m -dump -T 'text/html'|
    sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'

w3m is a: "WWW browsable pager with excellent tables/frames support"

output (first 10 lines)

London Bridge   
Kilburn         
Ealing          
Wandsworth      
Pimlico         
Kensington      
Leyton          
Leytonstone     
Plaistow        
Poplar          
查看更多
登录 后发表回答