How can I extract td from html in bash?

I am querying London postcode data from geonames:

http://www.geonames.org/postalcode-search.html?q=london&country=GB

I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?

标签： html regex bash shell screen-scraping

3条回答

Ridiculous、

2楼-- · 2019-07-20 18:34

I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).

Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.

Something like

sed -e 's/>/>\n/g' region.html |
   egrep -i "^\s*[A-Z]+[0-9]+</td>" |
   sed -e 's|</td>||g'

worked extracting 200 lines, assuming a specific format for the code.

ADD

If there's no limit to the software you can use to parse the data, then you could use a line like

wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
     sgrep '"<table class=\"restable\"" .. "</table>"' | 
     sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;;&nbsp;.*$| |g' |
     grep -v "^\s*$" |
     tail -n+2 | cut -d";" -f2,3

which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:

wget -q "$html" -O - | 
     w3m -dump -T 'text/html' |
     awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'

which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-07-20 18:44

If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:

mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'

The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

0人赞添加讨论(0) 举报

你好瞎i

4楼-- · 2019-07-20 18:52

I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)

html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
  w3m -dump -T 'text/html'|
    sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'

w3m is a: "WWW browsable pager with excellent tables/frames support"

output (first 10 lines)

London Bridge   
Kilburn         
Ealing          
Wandsworth      
Pimlico         
Kensington      
Leyton          
Leytonstone     
Plaistow        
Poplar

0人赞添加讨论(0) 举报

How can I extract td from html in bash?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间