I'm trying to extract data from socks-proxy.net with the IP and port from the website table.
I'm using these commands in linux to get the IP and port. How can I combine theme?
wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[@id=\"proxylisttable\"]//tr//td[1]//text()" - 2>/dev/null
Output:
103.254.12.3393.12.55.94192:12:44:11
It combines the IP and it its not good
that will get all the IP's from the website table
wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[@id=\"proxylisttable\"]//tr//td[2]//text()" - 2>/dev/null
that will get all the ports
Output:
108025951082
It combines the port and its not good.
Question: how can I combine them with the desired example output:
103.254.12.33:1080
93.12.55.94:2595
192:12:44:11:1082
and so on...
A bit late, but seeing you're using 4(!) different tools to accomplish something so simple I just had to jump in to show you another amazing XML parser, called Xidel, which can do it all by itself:
xidel -s https://pastebin.com/raw/F14VRNBc -e '//table[@id="proxylisttable"]/tbody/tr/concat("my",td[5],"://",td[1],":",td[2])'
mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
[...]
Complex solution:
wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[@id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - | paste - - -d':'
The output:
103.254.126.130:1080
192.228.194.87:25950
173.162.95.122:62168
183.166.22.194:1080
70.44.216.252:40656
66.83.161.74:34036
37.191.146.151:10200
101.100.171.69:52769
120.92.164.154:62080
216.37.80.226:61226
75.180.14.170:17694
74.221.106.14:10200
208.180.142.167:14846
...
Extended approach to cover additional fields:
wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[@id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - \
| awk -F'\n' -v RS= '{ for(i=1;i<=NF;i+=5) printf "my%s://%s:%s\n",$(i+4),$i,$(i+1) }'
The output:
mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
mySocks5://66.83.161.74:34036
mySocks5://37.191.146.151:10200
mySocks5://101.100.171.69:52769
mySocks5://120.92.164.154:62080
....
P.S. Tested on your input file given by https://pastebin.com/F14VRNBc.