Parse IP addresses from txt

2019-03-01 06:06发布

问题:

I'm trying to download a txt file which you can find here. Downloading the file is not a problem:

    testfile = urllib.URLopener()
    testfile.retrieve(_proxy_list_download_, "proxies.txt")

But the problem is that when it is downloaded it acts weird. When I open it in any txt editor, I can see the content and IP addresses but when I try to print the content into the console it prints this:

212.3.183.210:8080; 0; 0; anonymous proxy; Italy; ; a;  in); an Jose); ree download proxy IP

And when I try to get IP addresses from there, there is no address in the output.

with open('proxies.txt') as f:
            content = f.read()
            ip = re.findall( r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$", content )

I've tried already another regex:

r'([0-9]+)(?:\.[0-9]+){3}' 

This regex returned only 3-digit numbers.

Do you have any idea how to parse those IPs?

EDIT: Here is the copy+pasted text from text editor but in the editor everything is in one line:

 # http://proxy-ip-list.com/ provides you this fresh txt proxy list to free download proxy IP
# Date: Sat, 27 Jun 2015 12:53:02 +0000

39.166.95.9:8123; 0; 0; high-anonymous; China; 
178.189.92.118:3129; 16.83; 405; high-anonymous; Austria; 
198.2.202.33:8090; 8.05; 884; anonymous; United States (CA, San Jose); 
171.96.152.89:8080; 0; 0; anonymous; Thailand; 
153.149.104.76:80; 0; 0; anonymous; Japan (Tokyo); 
106.187.52.191:80; 0; 0; anonymous proxy; Japan; 
194.187.214.204:80; 0.91; 6374; anonymous proxy; Finland; 
59.78.160.247:8080; 0; 0; anonymous; China (Shanghai); 
61.156.3.166:80; 1.12; 1449; anonymous proxy; China (Jinan); 
221.238.140.164:8080; 1.39; 257; anonymous; China (Tianjin); 
117.178.157.107:8123; 8.44; 847; high-anonymous; China; 
39.166.205.95:8123; 0; 0; high-anonymous; China; 
117.163.216.8:8123; 4.21; 1577; high-anonymous; China; 
189.31.143.250:3128; 0; 0; high-anonymous; Brazil; 
183.89.84.82:8080; 0; 0; anonymous proxy; Thailand; 
183.88.41.42:8080; 0; 0; anonymous; Thailand; 
212.3.183.210:8080; 0; 0; anonymous proxy; Italy; 

回答1:

You need to remove anchors, since a line won't contain only a single ip-address.

ip = re.findall( r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", content )

second regex

r'([0-9]+)(?:\.[0-9]+){3}' 

must return three digit number because only the first three digits are captured, re.findall method would return captures first if there any. If there are no captures, then only it would return the matches. By turning the capturing group into non-capturing group will give you the desired output.

r'\b[0-9]+(?:\.[0-9]+){3}\b'