I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately.
Thanks to @James Hurford I got this code:
words = None
with open('data.txt') as f:
words = f.read().split()
total_words = len(words)
print 'Total words:', total_words
in_para = False
para_type = None
paragraph = list()
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
if in_para == False:
in_para = True
para_type = word
else:
print 'Words in paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
del paragraph[:]
para_type = word
else:
paragraph.append(word)
else:
if in_para == True:
print 'Words in last paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
else:
print 'No words'
My text file looks like this:
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.
The next part I need to do is summing up the words for each participant. I can only print them, but I don't know how to return/reuse them.
I would need a new variable with word count for each participant that I could manipulate later on, in addition to summing up all the words said by each participant, e.g.
P1all = sum of words in paragraph
Is there a way to count "you're" or "it's" etc. as two words?
Any ideas how to solve it?
Congrats on beginning your adventure with Python! Not everything in this post might make sense right now but bookmark it and comeback to it if it seems helpful later. Eventually you should try to move from scripting to software engineering, and here are a few ideas for you!
With great power comes great responsibility, and as a Python developer you need to be more disciplined than other languages which don't hold your hand and enforce "good" design.
I find it helps to start with a top-down design.
BOOM! You just wrote the whole program -- now you just need to back and fill in the blanks! When you do it like this, it seems less intimidating. Personally, I don't consider myself smart enough to solve very big problems, but I'm a pro at solving small problems. So lets tackle one thing at a time. I'm going to start with 'process_text'.
I'm not really sure what those things mean yet, but I know that text problems tend to follow a pattern called "map/reduce" which means you perform and operation on something and then you clean it up and combine, so I put in some placeholder functions. I might go back and add more if necessary.
Now let's write 'process_catalogue'. I could've written "process_dict" but that sounded lame to me.
Cool. Not too bad. You might approach this different than me, but I thought it would make sense to aggregate the items, the count the words per paragraph, and then count all the words.
So, at this point I'd probably make one or two little 'lib' (library) modules to back-fill the remaining functions. For the sake you being able to run this without worrying about imports, I'm going to stick it all in one .py file, but eventually you'll learn how to break these up so it looks nicer. So let's do this.
aaaand a little helper function
So it may not be obvious but this library is designed as a data processing pipeline. There several ways to process data, one is pipeline processing and another is batch processing. Let's take a look at batch processing.
All these nested dictionaries are getting a little complicated. In actual production code I would replace these with some more readable classes (along with adding tests and docstrings!!), but I don't want to make this more confusing than it already is! Alright, for your convenience below is the whole thing put together.
So this script is almost certainly overkill for this application, but the point is to see what (questionably) readable, maintainable, modular Python code might look like.
Pretty sure output looks something like:
You can do this with two variables. One to keep track of what person is speaking, the other to keep the paragraphs for the persons speaking. For storing the paragraphs and associating who it is that the paragraph belongs to use a dict with the person as the key and a list of paragraphs that person said associated with this key.
From here you can sum up the number of words spoken thus
No, you would need a
Counter
(Python 2.7+, else use adefaultdict(int)
) mapping persons to word counts.Now
words_per_person['P1']
contains the number of words ofP1
, assumingtext.split()
is a good enough tokenizer for your purposes. (Linguists disagree about the definition of word, so you're always going to get an approximation.)