可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a problem using regular expressions in python 3 so I would be gladful if someone could help me. I have a text file like the one below:

Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end

what I would like to do is to have a list of the text between the headers but including the headers themselves. I am using this regular expression:

 re.findall(r'(?=(Header.*?Header|Header.*?end))',data, re.DOTALL)

the result is here

['Header A\ntext text\n text text\n Header', 'Header B\ntext text\n text text\n Header', 'Header C\n text text here is the end']

The thing is that I get the next header in the end of the every item in the list. As you can see every header ends when we find the next header but the last header doesn't end in a specific way

Is there a way to get a list (not tuple) of every header including its own text as substrings using regular expressions?

回答1:

Header [^\n]*[\s\S]*?(?=Header|$)

Try this.See demo.

https://regex101.com/r/iS6jF6/21

import re
p = re.compile(r'Header [^\n]*[\s\S]*?(?=Header|$)')
test_str = "Header A\ntext text\ntext text\nHeader B\ntext text\ntext text\nHeader C\ntext text\nhere is the end"

re.findall(p, test_str)

回答2:

How about:

re.findall(r'(?=(Header.*?)(?=Header|end))',data, re.DOTALL)

回答3:

You actually need to use a positive lookahead assertion.

>>> s = '''Header A
text text
text text
Header B
text text
text text
Header C
text text
here is the end'''
>>> re.findall(r'Header.*?(?=Header)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text\n', 'Header B\ntext text\ntext text\n', 'Header C\ntext text\nhere is the end']

Include \n inside the positive lookahead in-order to not to get \n character at the last for each item.

>>> re.findall(r'Header.*?(?=\nHeader)|Header.*?end',s, re.DOTALL)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

Split your input according to the newline which exists just before to the string Header.

>>> re.split(r'\n(?=Header\b)', s)
['Header A\ntext text\ntext text', 'Header B\ntext text\ntext text', 'Header C\ntext text\nhere is the end']

extract specific text using multiple regex in pyth

问题:

回答1:

回答2:

回答3:

收藏的人(0)

extract specific text using multiple regex in pyth

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮