I want to extract a certain number of words surrou

2019-07-25 07:30发布

问题:

I am trying to extract a selected number of words surrounding a given word. I will give example to make it clear:

string = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."

1) The selected word is development and I need to get the 6 words surrounding it, and get : [to, the, full, of, the, human]


2) But if the selected word is in the beginning or in second position I still need to get 6 words, e.g:

The selected word is shall , I should get: [Education, be, directed, to , the , full]

I should use 're' module. What I managed to find until now is :

def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]

but it helps me only with the first case. Can someone help me out with this, I will be really grateful. Thank you in advance!

回答1:

This will extract all occurrences of the target word in your text, with context:

import re

text = ("Education shall be directed to the full development of the human personality "
        "and to the strengthening of respect for human rights and fundamental freedoms.")

def search(target, text, context=6):
    # It's easier to use re.findall to split the string, 
    # as we get rid of the punctuation
    words = re.findall(r'\w+', text)

    matches = (i for (i,w) in enumerate(words) if w.lower() == target)
    for index in matches:
        if index < context //2:
            yield words[0:context+1]
        elif index > len(words) - context//2 - 1:
            yield words[-(context+1):]
        else:
            yield words[index - context//2:index + context//2 + 1]

print(list(search('the', text)))
# [['be', 'directed', 'to', 'the', 'full', 'development', 'of'], 
#  ['full', 'development', 'of', 'the', 'human', 'personality', 'and'], 
#  ['personality', 'and', 'to', 'the', 'strengthening', 'of', 'respect']]

print(list(search('shall', text)))
# [['Education', 'shall', 'be', 'directed', 'to', 'the', 'full']]

print(list(search('freedoms', text)))
# [['respect', 'for', 'human', 'rights', 'and', 'fundamental', 'freedoms']]


回答2:

Tricky with potential for off-by-one errors but I think this meets your spec. I have left removal of punctuation, probably best to remove it before sending the string for analysis. I assumed case was not important.

test_str = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."

def get_surrounding_words(search_word, s, n_words):
    words = s.lower().split(' ')
    try:
        i = words.index(search_word)
    except ValueError:
        return []
    # Word is near start
    if i < n_words/2:
        words.pop(i)
        return words[:n_words]
    # Word is near end
    elif i >= len(words) - n_words/2:
        words.pop(i)
        return words[-n_words:]
    # Word is in middle
    else:
        words.pop(i)
        return words[i-n_words/2:i+n_words/2]

def test(word):
    print('{}: {}'.format(word, get_surrounding_words(word, test_str, 6)))

test('notfound')
test('development')
test('shall')
test('education')
test('fundamental')
test('for')
test('freedoms')


回答3:

import sys, os

args = sys.argv[1:]
if len(args) != 2:
   os.exit("Use with <string> <query>")
text = args[0]
query = args[1]
words = text.split()
op = []
left = 3
right = 3
try:
    index = words.index(query)
    if index <= left:
        start = 0
    else:
        start = index - left

    if start + left + right + 1 > len(words):
        start = len(words) - left - right - 1
        if start < 0:
            start = 0

    while len(op) < left + right and start < len(words):
        if start != index:
            op.append(words[start])
        start += 1
except ValueError:
    pass
print op
  • How do this work?
    1. find the word in the string
    2. See if we can make left+right words from the index the
    3. Take left+right number of words and save them in op
    4. print op


回答4:

A simple approach to your problem. First separates all the words and then selects words from left and right.

def custom_search(sentence, word, n):     
    given_string = sentence
    given_word = word
    total_required = n
    word_list = given_string.strip().split(" ")
    length_of_words = len(word_list)

    output_list = []
    given_word_position = word_list.index(given_word)
    word_from_left = 0
    word_from_right = 0

    if given_word_position + 1 > total_required / 2:
        word_from_left = total_required / 2
        if given_word_position + 1 + (total_required / 2) <= length_of_words:
            word_from_right = total_required / 2
        else:
            word_from_right = length_of_words - (given_word_position + 1)
            remaining_words = (total_required / 2) - word_from_right
            word_from_left += remaining_words

    else:
        word_from_right = total_required / 2
        word_from_left = given_word_position
        if word_from_left + word_from_right < total_required:
            remaining_words = (total_required / 2) - word_from_left
            word_from_right += remaining_words

    required_words = []
    for i in range(given_word_position - word_from_left, word_from_right + 
    given_word_position + 1):
        if i != given_word_position:
            required_words.append(word_list[i])
    return required_words


sentence = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
custom_search(sentence, "shall", 6)

>>[Education, be, directed, to , the , full] 


custom_search(sentence, "development", 6)

>>['to', 'the', 'full', 'of', 'the', 'human'] 


回答5:

I don't think regular expressions are necessary here. Assuming the text is well-constructed, just split it up into an array of words, and write a couple if-else statements to make sure it retrieves the necessary amount of surrounding words:

def search(text, word, n):
    # text is the string you are searching
    # word is the word you are looking for
    # n is the TOTAL number of words you want surrounding the word

    words    = text.split(" ")  # Create an array of words from the string
    position = words.index(word)   # Find the position of the desired word

    distance_from_end = len(words) - position  # How many words are after the word in the text

    if position < n // 2 + n % 2:  # If there aren't enough words before...
        return words[:position], words[position + 1:n + 1]

    elif distance_from_end < n // 2 + n % 2:  # If there aren't enough words after...
        return words[position - n + distance_from_end:position], words[position + 1:]

    else:  # Otherwise, extract an equal number of words from both sides (take from the right if odd)
        return words[position - n // 2 - n % 2:position], words[position + 1:position + 1 + n//2]

string = "Education shall be directed to the full development of the human personality and to the \
strengthening of respect for human rights and fundamental freedoms."

print search(string, "shall", 6)
# >> (['Education'], ['be', 'directed', 'to', 'the', 'full'])

print search(string, "human", 5)
# >> (['development', 'of', 'the'], ['personality', 'and'])

In your example you didn't have the target word included in the output, so I kept it out as well. If you'd like the target word included simply combine the two arrays the function returns (join them at position).

Hope this helped!