How to extract the substring between two markers?

2018-12-31 23:24发布

问题:

Let\'s say I have a string \'gfgfdAAA1234ZZZuijjk\' and I want to extract just the \'1234\' part.

I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.

With sed it is possible to do something like this with a string:

echo \"$STRING\" | sed -e \"s|.*AAA\\(.*\\)ZZZ.*|\\1|\"

And this will give me 1234 as a result.

How to do the same thing in Python?

回答1:

Using regular expressions - documentation for further reference

import re

text = \'gfgfdAAA1234ZZZuijjk\'

m = re.search(\'AAA(.+?)ZZZ\', text)
if m:
    found = m.group(1)

# found: 1234

or:

import re

text = \'gfgfdAAA1234ZZZuijjk\'

try:
    found = re.search(\'AAA(.+?)ZZZ\', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = \'\' # apply your error handling

# found: 1234


回答2:

>>> s = \'gfgfdAAA1234ZZZuijjk\'
>>> start = s.find(\'AAA\') + 3
>>> end = s.find(\'ZZZ\', start)
>>> s[start:end]
\'1234\'

Then you can use regexps with the re module as well, if you want, but that\'s not necessary in your case.



回答3:

regular expression

import re

re.search(r\"(?<=AAA).*?(?=ZZZ)\", your_text).group(0)

The above as-is will fail with an AttributeError if there are no \"AAA\" and \"ZZZ\" in your_text

string methods

your_text.partition(\"AAA\")[2].partition(\"ZZZ\")[0]

The above will return an empty string if either \"AAA\" or \"ZZZ\" don\'t exist in your_text.

PS Python Challenge?



回答4:

import re
print re.search(\'AAA(.*?)ZZZ\', \'gfgfdAAA1234ZZZuijjk\').group(1)


回答5:

You can use re module for that:

>>> import re
>>> re.compile(\".*AAA(.*)ZZZ.*\").match(\"gfgfdAAA1234ZZZuijjk\").groups()
(\'1234,)


回答6:

With sed it is possible to do something like this with a string:

echo \"$STRING\" | sed -e \"s|.*AAA\\(.*\\)ZZZ.*|\\1|\"

And this will give me 1234 as a result.

You could do the same with re.sub function using the same regex.

>>> re.sub(r\'.*AAA(.*)ZZZ.*\', r\'\\1\', \'gfgfdAAA1234ZZZuijjk\')
\'1234\'

In basic sed, capturing group are represented by \\(..\\), but in python it was represented by (..).



回答7:

You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = \"Thanks for contributing an answer to Stack Overflow!\"
subText = \"to\"

print(\"Start of first substring in a text:\")
start = FindSubString(Text, subText, 0)
print(start); print(\"\")

print(\"Exact substring in a text:\")
print(Text[start:start+len(subText)]); print(\"\")

print(\"What is after substring \\\"%s\\\"?\" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = \"gfgfdAAA1234ZZZuijjk\"
subText1 = \"AAA\"
subText2 = \"ZZZ\"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print(\"\\nYour answer:\\n%s\" %(Text[AfterText1:BeforText2]))


回答8:

you can do using just one line of code

>>> import re

>>> re.findall(r\'\\d{1,5}\',\'gfgfdAAA1234ZZZuijjk\')

>>> [\'1234\']

result will receive list...



回答9:

Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like \'US president (Barack Obama) met with ...\' and I want to get only \'Barack Obama\' this is solution:

regex = \'.*\\((.*?)\\).*\'
matches = re.search(regex, line)
line = matches.group(1) + \'\\n\'

I.e. you need to block parenthesis with slash \\ sign. Though it is a problem about more regular expressions that Python.

Also, in some cases you may see \'r\' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.



回答10:

In python, extracting substring form string can be done using findall method in regular expression (re) module.

>>> import re
>>> s = \'gfgfdAAA1234ZZZuijjk\'
>>> ss = re.findall(\'AAA(.+)ZZZ\', s)
>>> print ss
[\'1234\']


回答11:

>>> s = \'/tmp/10508.constantstring\'
>>> s.split(\'/tmp/\')[1].split(\'constantstring\')[0].strip(\'.\')


回答12:

One liners that return other string if there was no match. Edit: improved version uses next function, replace \"not-found\" with something else if needed:

import re
res = next( (m.group(1) for m in [re.search(\"AAA(.*?)ZZZ\", \"gfgfdAAA1234ZZZuijjk\" ),] if m), \"not-found\" )

My other method to do this, less optimal, uses regex 2nd time, still didn\'t found a shorter way:

import re
res = ( ( re.search(\"AAA(.*?)ZZZ\", \"gfgfdAAA1234ZZZuijjk\") or re.search(\"()\",\"\") ).group(1) )


回答13:

Simple is better than complex

Also, you can extract numbers from any string if your target is finding numbers(integers).

>>> \'\'.join([n for n in \"gfgfdAAA1234ZZZuijjk\" if n.isdigit()])
>>> \'1234\'

In this way, you don\'t need to use \"re\" module.