Extract all urls in a string with python3

I am trying to find a clean way to extract all urls in a text string.

After an extensive search, i have found many posts suggesting using regular expressions to do the task and they give the regular expressions that suppose to do that. Each of the RegExs have some advantages and some short comings. Also, editing them to change their behaviour is not straight forward. Anyway at this point i am happy with any RegEx that could detect the urls in this text correctly:

Input:

Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org

标签： python regex python-3.x url

5条回答

冷血范

2楼-- · 2019-09-20 02:34

Using an existing library is probably the best solution.

But it was too much for my tiny script, and -- inspired by @piotr-wasilewiczs answer-- I came up with:

from string import ascii_letters
links = [x for x in line.split() if x.strip(str(set(x) - set(ascii_letters))).startswith(('http', 'https', 'www'))]

for each word in the line,
strip (from the beginning and the end) the non ASCII letters found in the word itself)
and filter by the words starting with one of https, http, www.

A bit too dense for my taste and I have no clue how fast it is, but it should detect most "sane" urls in a string.

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-09-20 02:44

If you want a regex, you can use this:

import re


string = "Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org


             
            
                                  
            
            
            
            
            
            劳资没心，怎么记你                          
            
             
             4楼-- · 2019-09-20 02:48
             
             
             
                          
             
                                                                          
Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract. 

Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).

You have a couple of examples here.

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']


It seems that this module also has an update() method which lets you update the TLD list cache file

However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:

result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk'] 


You can then build another lists which hold the excluded domains / TLDs / etc:

allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']

for each_url in results:
    # here, check each url against your rules

    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                
             
            
                                  
            
            
            
            
            
            Deceive 欺骗                          
            
             
             5楼-- · 2019-09-20 02:51
             
             
             
                          
             
                                                                          
import re
import string
text = """
Lorem ipsum dolor sit amet https://www.lore-m.com/ipsum.php?q=suas, 
nusquam tincidunt ex per, ftp://link.com ius modus integre no, quando utroque placerat qui no. 
Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. 
Elit ftp://link.work.in pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org                                                                    
                                                        
            

              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                

                  
                



                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...
                            
                        

                

             
            
                                  
            
            
            
            
            
            贪生不怕死                          
            
             
             6楼-- · 2019-09-20 02:52
             
             
             
                          
             
                                                                          
output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
print(output)


your example:
http://ideone.com/wys57x

After all you can also cut last character in elements of list if it is not a letter.

EDIT:

output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
newOutput = []
for link in output:
    copy = link
    while not copy[-1].isalpha():
        copy = copy[:-1]
    newOutput.append(copy)
print(newOutput)


Your example: http://ideone.com/gHRQ8w
    
                                                                    
                                                        
            
              
                查看更多
                
             
              0人赞

                                                     添加讨论(0)

                                                                                                            
                               举报
                
                
                
                  
                


                        
                            

                               
             
                        
               
            

                            
                            
                                 加载中...


     
                      登录 后发表回答



   
   
   
  
   相关问题
      
    
    
   
   

     


   
   how to define constructor for Python's new Nam   

   



     


   
   streaming md5sum of contents of a large remote tar   

   



     


   
   How to get the background from multiple images by   

   



     


   
   Django __str__ returned non-string (type NoneType)   

   



     


   
   Evil ctypes hack in python   

   



        
      
    查看全部
   
   
  
   相关文章
 
   
   

     


   
   问个python基础问题，为什么时间不更新 及 name 'ss' is not   

     


   
   c#调用python3程序   

     


   
   如何安全的关闭程序   

     


   
   反爬能检测到JS模拟的键盘输入吗   

     


   
   有没有方法即使程序最小化也能对其发送按键   

     


   
   tkinter这样怎么不能分别赋值？   

     


   
   mouseMoveEvent奇怪的崩溃   

     


   
   在liunx 安装Levenshtein错误   

        
        
    查看全部
                 收藏的人(4)





  
    
      
      采纳回答
    
    

     
        
        
        
            
                向帮助了您的知道网友说句感谢的话吧!
            
            
                
                    
                        非常感谢!

Extract all urls in a string with python3

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间