Use Python to search lines of file for list entrie

2019-08-04 09:24发布

I have a text file with tens of thousands of lines of ASCII text. I have a list of a few hundred keywords that I want to search for, considering each line individually. Initially, I want to return (print to screen or a file) the line if there are any matches but eventually I'd like to rank or order the returned lines based on how many matches.

So, my list is something like this...

keywords = ['one', 'two', 'three']

My train of thought was something like:

myfile = open('file.txt')
for line in myfile:
    if keywords in line:
        print line

But taking this from psuedo to working code is not happening.

I've also thought of using RegEx:

print re.findall(keywords, myfile.read())

But that leads me down a path of different errors and problems.

If anyone can offer some guidance, syntax or code snippets I would be grateful.

3条回答
smile是对你的礼貌
2楼-- · 2019-08-04 09:54

Counter from the collections module seems like a great fit for the problem. I would do something like this.

from collections import Counter

keywords = ['one', 'two', 'three']
lines = ['without any keywords', 'with one', 'with one and two']

matches = []
for line in lines: 
    # Takes all the words in the line and gets the number of times 
    # they appear as a dictionary-like Counter object.
    words = Counter(line.split())

    line_matches = 0
    for kw in keywords:
        # Get the number of times it popped up in the line
        occurrences = words.get(kw, 0)
        line_matches += occurrences

    matches.append((line, line_matches))

# Sort by the number of occurrences per line, descending.
print(sorted(matches, key=lambda x: x[1], reverse=True))

This outputs:

[('with one and two', 2), ('with one', 1), ('without any keywords', 0)]
查看更多
Juvenile、少年°
3楼-- · 2019-08-04 09:58

You don't specify it in your question, but according to me if a single keyword is found multiple times, it should count only one for the score (this advantages lines with more different keywords):

def getmatching(lines, keywords):
    result = []
    keywords = set(keywords)
    for line in lines:
        matches = len(keywords & set(line.split()))
        if matches:
            result.append((matches, line))
    return (line for matches, line in sorted(result, reverse=True))

Example

lines = ['no keywords here', 'one keyword here',
         'two keywords in this one line', 'three minus two equals one',
         'one counts only one time because it is only one keyword']

keywords = ['one', 'two', 'three']

for line in getmatching(lines, keywords):
    print line

Output

three minus two equals one
two keywords in this one line
one keyword here
one counts only one time because it is only one keyword
查看更多
叛逆
4楼-- · 2019-08-04 10:04

You can't test to see if there is a list in a string. What you can do is test is there is a string in another string.

lines = ['this is a line without any keywords', 
         'this is a line with one', 
         'this is a line with one and two',
         'this is a line with three']
keywords = ['one', 'two', 'three']

for line in lines:
    for word in keywords:
        if word in line:
            print(line)
            break

The break is necessary to break out of the "word" loop when the first word is matched. Otherwise it will print the line for each word it matches.


The regex solution has the same problem. You can either use the same solution as I did above and add an additional loop over the words, or you can construct a regex that will automatically match any of the words. See the Python regex syntax documentation.

for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        print(line, len(matches))            

Note that re.findall returns an empty list if there are no matches and a list of all the matches if there are matches. So we can directly test the result in the if condition, as empty lists evaluate to False.

You can also easily generate the regex pattern for these simple cases:

pattern = '|'.join(keywords)
print(pattern)
# 'one|two|three'

To sort them, you can simply put them in a list of tuples and use the key argument of sorted.

results = []
for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

You can read the documentation for sorted, but the key argument provides a function to use for sorting. In this case, we extract the second element of each tuple, which is where we stored the number of matches in that line, and sort the list with that.


This is how you might apply this to an actual file and save the results.

keywords = ['one', 'two', 'three']
pattern = '|'.join(keywords)

results = []
with open('myfile.txt', 'r') as f:
    for line in f:
        matches = re.findall(pattern, line)
        if matches:
            results.append((line, len(matches)))

results = sorted(results, key=lambda x: x[1], reverse=True)

with open('results.txt', 'w') as f:
    for line, num_matches in results:
        f.write('{}  {}\n'.format(num_matches, line))

You can read up on the with context manager, but in this situation it essentially ensures that you close the file once you're done with it.

查看更多
登录 后发表回答