Extract string with Python re.match

2019-01-09 11:33发布

问题:

import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"

str2=re.match("[a-zA-Z]*//([a-zA-Z]*)",str)
print str2.group()

current result=> error
expected => wwwqqqzzz

I want to extract the string wwwqqqzzz. How I do that?

Maybe there are a lot of dots, such as:

"whatever..s#$@.d.:af//wwww.xxx.yn.zsdfsd.asfds.f.ds.fsd.whatever/123.dfiid"

In this case, I basically want the stuff bounded by // and /. How do I achieve that?

One additional question:

import re
str="xxx.yyy.xxx:80"

m = re.search(r"([^:]*)", str)
str2=m.group(0)
print str2
str2=m.group(1)
print str2

Seems that m.group(0) and m.group(1) are the same.

回答1:

match tries to match the entire string. Use search instead. The following pattern would then match your requirements:

m = re.search(r"//([^/]*)", str)
print m.group(1)

Basically, we are looking for /, then consume as many non-slash characters as possible. And those non-slash characters will be captured in group number 1.

In fact, there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind:

m = re.search(r"(?<=//)[^/]*", str)
print m.group()

Lookarounds are not included in the actual match, hence the desired result.

This (or any other reasonable regex solution) will not remove the .s immediately. But this can easily be done in a second step:

m = re.search(r"(?<=//)[^/]*", str)
host = m.group()
cleanedHost = host.replace(".", "")

That does not even require regular expressions.

Of course, if you want to remove everything except for letters and digits (e.g. to turn www.regular-expressions.info into wwwregularexpressionsinfo) then you are better off using the regex version of replace:

cleanedHost = re.sub(r"[^a-zA-Z0-9]+", "", host)


回答2:

print re.sub(r"[.]","",re.search(r"(?<=//).*?(?=/)",str).group(0))

See this demo.



回答3:

output=re.findall("(?<=//)\w+.*(?=/)",str)

final=re.sub(r"[^a-zA-Z0-9]+", "", output [0])

print final


回答4:

import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"
re.findall('//([a-z.]*)', str)