包含词Python中提取句子(Python extract sentence containing

2019-08-31 07:21发布

我试图提取所有含有从文本指定单词的句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它返回我:

[".I like to eat apple. Me too. Let's go buy some apples."]

代替 :

[".I like to eat apple., "Let's go buy some apples."]

任何帮助吗?

Answer 1:

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]


Answer 2:

无需正则表达式:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]


Answer 3:

In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但是请注意,@ jamylak的split为基础的解决方案是快:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

该速度差较小,但仍然显著,对于较大的字符串:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop


Answer 4:

您可以使用str.split ,

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]


Answer 5:

r"\."+".+"+"apple"+".+"+"\."

这条线是有点奇怪; 为什么拼接这么多不同的字符串? 你可以只使用R '.. +苹果+'。

无论如何,你的正则表达式的问题是其贪婪的烦躁。 默认情况下, x+将匹配x经常因为它可能可以。 所以,你的.+将匹配尽可能多的字符( 任何字符)成为可能; 包括点和apple秒。

要改用什么是是非贪婪的表达; 你通常可以通过添加一个做到这一点? 结尾: .+?

这会让你得到以下结果:

['.I like to eat apple. Me too.']

正如你所看到的你不再同时获得苹果的句子,但仍是Me too. 。 那是因为你仍然匹配. 在之后apple ,使得它不可能没有捕捉到下面的句子也是如此。

一个工作正则表达式将是这样的: r'\.[^.]*?apple[^.]*?\.'

在这里,你不看任何字符,但只有那些不是字符圆点自己。 我们也允许不匹配的所有任何字符(因为之后apple在第一个句子中没有任何不点的字符)。 使用本该表达式的结果:

['.I like to eat apple.', ". Let's go buy some apples."]


Answer 6:

很显然,有问题的样本extract sentence containing substring ,而不是
extract sentence containing word 。 如何解决extract sentence containing word通过蟒蛇的问题如下:

一个字可以在开始时|中|句末。 不限于问题的例子,我将提供一个句子中搜索词的一般功能:

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

限于问题的例子,我们可以解决这样的:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

相应的输出是:

['I like to eat apple']


文章来源: Python extract sentence containing word