可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to remove all URLs inside a string (replace them with "")
I searched around but couldn't really find what I want.
Example:
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/
I want the result to be:
text1
text2
text3
text4
text5
text6
回答1:
Python script:
import re
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
Output:
text1
text2
text3
text4
text5
text6
Test this code here.
回答2:
the shortest way
re.sub(r'http\S+', '', stringliteral)
回答3:
This worked for me:
import re
thestring = "text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"
URLless_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', thestring)
print URLless_string
Result:
text1
text2
text3
text4
text5
text6
回答4:
It should be simple using regular expressions. You can use them via the re
module in python.
For which regular expression can best detect a valid url, check these SO questions:
What is the best regular expression to check if a string is a valid URL?
What's the cleanest way to extract URLs from a string using Python?
How to match URIs in text?
There are quite a few highly voted answers in these, so that should give you some direction.
回答5:
This solution caters for http, https and the other normal url type special characters :
import re
def remove_urls (vTEXT):
vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
return(vTEXT)
print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))
回答6:
Removal of HTTP links/URLs mixed up in any text:
import re
re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)
回答7:
You could also look at it from the other way around...
from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]
回答8:
I wasn't able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:
(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*
here's an explanation:
(https?:\/\/)
matches http:// or https://
(\s)*
optional whitespaces
(www\.)?
optionally matches www.
(\s)*
optionally matches whitespaces
((\w|\s)+\.)*
matches 0 or more of one or more word characters followed by a period
([\w\-\s]+\/)*
matches 0 or more of one or more words(or a dash or a space) followed by '\'
([\w\-]+)
any remaining path at the end of the url followed by an optional ending
((\?)?[\w\s]*=\s*[\w\%&]*)*
matches ending query params (even with white spaces,etc)
test this out here:https://regex101.com/r/NmVGOo/8
回答9:
The following regular expression in Python works well for detecting URL(s) in the text:
source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6 '''
import re
url_reg = r'[a-z]*[:.]+\S+'
result = re.sub(url_reg, '', source_text)
print(result)
Output:
text1
text2
text3
text4
text5
text6
回答10:
First of all it should find a pattern in you text file for URLs. when you found the it, you can use regular expressions .
It's possible for you to do the same job, but reg expr makes your job mmuch easier and also worthy to learn .
回答11:
I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.
[^ ]+\.[^ ]+
It can be used like
re.sub('[^ ]+\.[^ ]+','',sentence)
回答12:
What you really want to do is to remove any string that starts with either http://
or https://
plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz
#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''
import re
re.sub('http://\S+|https://\S+', '', text)
And the result of running the above code is
>>> 'The link to this post is '