Reading only the words of a specific speaker and a

2019-09-09 10:58发布

I have a transcript and in order to perform an analysis of each speaker I need to only add their words to a string. The problem I'm having is that each line does not start with the speakers name. Here's a snippet of my text file

BOB: blah blah blah blah
blah hello goodbye etc.

JERRY:.............................................
...............

BOB:blah blah blah
blah blah blah
blah.

I want to collect only the words from the chosen speaker(in this case bob) said and add them to a string and exclude words from jerry and other speakers. Any ideas for this?

edit:There are line breaks between paragraphs and before any new speaker starts.

2条回答
ら.Afraid
2楼-- · 2019-09-09 11:53

Every time a speaker starts to speak, keep the current_speaker and decide what to do according to this speaker. Read the lines until the speaker changes.

查看更多
别忘想泡老子
3楼-- · 2019-09-09 11:58

Using a regex is the best way to go. As you'll be using it multiple times, you can save on a bit of processing by compiling it before using it to match each line.

import re

speaker_words = {}
speaker_pattern = re.compile(r'^(\w+?):(.*)$')

with open("transcript.txt", "r") as f:
        lines = f.readlines()
        current_speaker = None
        for line in lines:
                line = line.strip()
                match = speaker_pattern.match(line)
                if match is not None:
                        current_speaker = match.group(1)
                        line = match.group(2).strip()
                        if current_speaker not in speaker_words.keys():
                                speaker_words[current_speaker] = []
                if current_speaker:
                        # you may want to do some sort of punctuation filtering too
                        words = [word.strip() for word in line.split(' ') if len(word.strip()) > 0]
                        speaker_words[current_speaker].extend(words)

print speaker_words

This outputs the following:

{
    "BOB": ['blah', 'blah', 'blah', 'blah', 'blah', 'hello', 'goodbye', 'etc.', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah.'],
    "JERRY": ['.............................................', '...............']
}
查看更多
登录 后发表回答