My code processes lines read from a text file (see "Text Processing Details" at end). I need to amend my code so that it carries out the same task, but only with words in between certain points.
Code should not bother about this text. Skip it.
*****This is the marker to say where to start working with text. Don't do anything until after these last three asterisks.>***
Work with all of the code in this section
*****Stop working with the text when the first three asterisks are seen*****
Code should not bother about this text. Skip it.
The markers for all situations are three asterisks. Markers only count when they appear at the beginning and the end of the line.
What should I use to make my code only work in between the second and third set of asterisks?
Text Processing Details
My code reads a text file, makes all the words lowercase, and splits the words, putting them into a list:
infile = open(filename, 'r', encoding="utf-8")
text = infile.read().lower().split()
It then strips that list of all grammatical symbols in the words:
list_of_words = [word.strip('\n"-:\';,.') for word in text]
Finally, for each word in that list, if it only contains alphabetic symbols, it gets appended to a new list. That list is then returned:
for word in list_of_words:
if word.isalpha():
list_2.append(word)
return list_2
You can get only the text between your asterisks with regex:
What appears to be one task, "count the words between two marker lines", is actually several. Separate the different tasks and decisions into separate functions and generators, and it will be vastly easier.
Step 1: Separate the file I/O from the word counting. Why should the word-counting code care where the words came from?
Step 2: Separate selecting the lines to process from the file handling and the word counting. Why should the word-counting code be given words it's not supposed to count? This is still far too big a job for one function, so it will be broken down further. (This is the part you're asking about.)
Step 3: Process the text. You've already done that, more or less. (I'll assume your text-processing code ends up in a function called
words
).1. Separate file I/O
Reading text from a file is really two steps: first, open and read the file, then strip the newline off each line. These are two jobs.
Not a hint of your text processing here. The
lines_from_file
generator just yield whatever strings were found in the file... after stripping their trailing newline. (Note that a plainstrip()
would also remove leading and trailing whitespace, which you have to preserve to identify marker lines.)2. Select only the lines between markers.
This is really more than one step. First, you have to know what is and isn't a marker line. That's just one function.
Then, you have to advance past the first marker (while throwing away any lines encountered), and finally advance to the second marker (while keeping any lines encountered). Anything after that second marker won't even be read, let alone processed.
Python's generators can almost solve the rest of Step 2 for you. The only sticking point is that closing marker... details below.
2a. What is and is not a marker line?
Identifying a marker line is a yes-or-no question, obviously the job of a Boolean function:
Note that a marker line need not (from my reading of your requirements) contain any text between the start and end markers --- six asterisks (
'******'
) is a valid marker line.2b. Advance past the first marker line.
This step is now easy: just throw away every line until we find a marker line (and junk it, too). This function doesn't need to worry about the second marker line, or what if there are no marker lines, or anything else.
2c. Advance past the second marker line, saving content lines.
A generator could easily yield every line after the "start" marker, but if it discovers there is no "end" marker, there's no way to go back and un-
yield
those lines. So, now that you've finally encountered lines you (might) actually care about, you'll have to save them all in a list until you know whether they're valid or not.2d. Gluing Step 2 together.
Advance past the first marker, then yield everything until the second marker.
Testing functions like this with a bunch of input files is annoying. Testing it with lists of strings is easy, but lists are not generators or iterators, they're iterables. The one extra
it = iter(...)
line was worth it.3. Process the selected lines.
Again, I'm assuming your text processing code is safely wrapped up in a function called
words
. The only change is that, instead of opening a file and reading it to produce a list of lines, you're given the lines:...except that
words
should probably be a generator, too.Now, calling
words
is easy:To get the
words_from_file
fname
, you yield thewords
found in thelines_between_markers
, selected from thelines_from_file
... Not quite English, but close.4. Call
words_from_file
from your program.Wherever you already have
filename
defined --- presumably insidemain
somewhere --- callwords_from_file
to get one word at a time:Or, if you really need those words in a
list
:Conclusion
That this would have been much harder trying to squeeze it all into one or two functions. It wasn't just one task or decision, but many. The key was breaking it into tiny jobs, each of which was easy to understand and test.
The generators got rid of a lot of boilerplate code. Without generators, almost every function would have required a
for
loop just tosome_list.append(next_item)
, like inlines_before_next_marker
.If you have Python 3.3+, the
yield from ...
construct, erases even more boilerplate. Every generator containing a loop like this:Could be re-written as:
I counted four of them.
For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's "Loop Like a Native", available as a 30-minute video from PyCon US 2013.
I recommend using regular expressions.