xpath html combine columns

2019-09-23 09:44发布

I'm trying to extract data from socks-proxy.net with the IP and port from the website table.

I'm using these commands in linux to get the IP and port. How can I combine theme?

wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[@id=\"proxylisttable\"]//tr//td[1]//text()" - 2>/dev/null

Output:

103.254.12.3393.12.55.94192:12:44:11 

It combines the IP and it its not good

that will get all the IP's from the website table

wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[@id=\"proxylisttable\"]//tr//td[2]//text()" - 2>/dev/null

that will get all the ports

Output:

108025951082

It combines the port and its not good.

Question: how can I combine them with the desired example output:

103.254.12.33:1080
93.12.55.94:2595
192:12:44:11:1082

and so on...

2条回答
Rolldiameter
2楼-- · 2019-09-23 10:22

A bit late, but seeing you're using 4(!) different tools to accomplish something so simple I just had to jump in to show you another amazing XML parser, called Xidel, which can do it all by itself:

xidel -s https://pastebin.com/raw/F14VRNBc -e '//table[@id="proxylisttable"]/tbody/tr/concat("my",td[5],"://",td[1],":",td[2])'
mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
[...]
查看更多
\"骚年 ilove
3楼-- · 2019-09-23 10:28

Complex solution:

wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[@id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null 
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - | paste - - -d':'

The output:

103.254.126.130:1080
192.228.194.87:25950
173.162.95.122:62168
183.166.22.194:1080
70.44.216.252:40656
66.83.161.74:34036
37.191.146.151:10200
101.100.171.69:52769
120.92.164.154:62080
216.37.80.226:61226
75.180.14.170:17694
74.221.106.14:10200
208.180.142.167:14846
...

Extended approach to cover additional fields:

wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[@id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null 
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - \
| awk -F'\n' -v RS= '{ for(i=1;i<=NF;i+=5) printf "my%s://%s:%s\n",$(i+4),$i,$(i+1) }'

The output:

mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
mySocks5://66.83.161.74:34036
mySocks5://37.191.146.151:10200
mySocks5://101.100.171.69:52769
mySocks5://120.92.164.154:62080
....

P.S. Tested on your input file given by https://pastebin.com/F14VRNBc.

查看更多
登录 后发表回答