I have a text file with tens of thousands of lines of ASCII text. I have a list of a few hundred keywords that I want to search for, considering each line individually. Initially, I want to return (print to screen or a file) the line if there are any matches but eventually I'd like to rank or order the returned lines based on how many matches.
So, my list is something like this...
keywords = ['one', 'two', 'three']
My train of thought was something like:
myfile = open('file.txt')
for line in myfile:
if keywords in line:
print line
But taking this from psuedo to working code is not happening.
I've also thought of using RegEx:
print re.findall(keywords, myfile.read())
But that leads me down a path of different errors and problems.
If anyone can offer some guidance, syntax or code snippets I would be grateful.
Counter from the collections module seems like a great fit for the problem. I would do something like this.
This outputs:
You don't specify it in your question, but according to me if a single keyword is found multiple times, it should count only one for the score (this advantages lines with more different keywords):
Example
Output
You can't test to see if there is a list in a string. What you can do is test is there is a string in another string.
The
break
is necessary to break out of the "word" loop when the first word is matched. Otherwise it will print the line for each word it matches.The regex solution has the same problem. You can either use the same solution as I did above and add an additional loop over the words, or you can construct a regex that will automatically match any of the words. See the Python regex syntax documentation.
Note that
re.findall
returns an empty list if there are no matches and a list of all the matches if there are matches. So we can directly test the result in the if condition, as empty lists evaluate toFalse
.You can also easily generate the regex pattern for these simple cases:
To sort them, you can simply put them in a list of tuples and use the
key
argument ofsorted
.You can read the documentation for
sorted
, but thekey
argument provides a function to use for sorting. In this case, we extract the second element of each tuple, which is where we stored the number of matches in that line, and sort the list with that.This is how you might apply this to an actual file and save the results.
You can read up on the with context manager, but in this situation it essentially ensures that you close the file once you're done with it.