How to get the wikipedia corpus text with punctuat

2019-07-21 15:38发布

问题:

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages:

  1. Page from gensim github issues section. It was a question by someone where the answer was to subclass WikiCorpus (answered by Piskvorky). Luckily, in the same page, there was a code representing the suggested 'subclass' solution. The code was provided by Rhazegh. (link)
  2. Page from stackoverflow with a title: "Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus". However, no clear answer was provided and was treated in the context of spaCy. (link)

I decided to use the code provided in page 1. My current code (mywikicorpus.py):

import sys
import os
sys.path.append('C:\\Users\\Ghaliamus\\Anaconda2\\envs\\wiki\\Lib\\site-packages\\gensim\\corpora\\')

from wikicorpus import *

def tokenize(content):
    # override original method in wikicorpus.py
    return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
        if len(token) <= 15 and not token.startswith('_')]

def process_article(args):
   # override original method in wikicorpus.py
    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
        result = tokenize(text)
    return result, title, pageid


class MyWikiCorpus(WikiCorpus):
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
    WikiCorpus.__init__(self, fname, processes, lemmatize, dictionary, filter_namespaces)

    def get_texts(self):
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for tokens, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                articles_all += 1
                positions_all += len(tokens)
            if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                continue
            articles += 1
            positions += len(tokens)
            if self.metadata:
                yield (tokens, (pageid, title))
            else:
                yield tokens
    pool.terminate()

    logger.info(
        "finished iterating over Wikipedia corpus of %i documents with %i positions"
        " (total %i articles, %i positions before pruning articles shorter than %i words)",
        articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
    self.length = articles  # cache corpus length

And then, I used another code by Pan Yang (link). This code initiates WikiCorpus object and retrieve the text. The only change in my current code is initiating MyWikiCorpus instead of WikiCorpus. The code (process_wiki.py):

from __future__ import print_function
import logging
import os.path
import six
import sys
import mywikicorpus as myModule



if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki-20180601-pages-    articles.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = myModule.MyWikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

Through command line I ran the process_wiki.py code. I got text of the corpus with the last line in the command prompt:

(2018-06-05 09:18:16,480: INFO: Finished Saved 4526191 articles)

When I read the file in python, I checked the first article and it was without punctuation. Example:

(anarchism is a political philosophy that advocates self governed societies based on voluntary institutions these are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical or free associations anarchism holds the state to be undesirable unnecessary and harmful while opposition to the state is central anarchism specifically entails opposing authority or hierarchical)

My two relevant questions, and I wish you can help me with them, please:

  1. is there any thing wrong in my reported pipeline above?
  2. regardless such pipeline, if I opened the gensim wikicorpus python code (wikicorpus.py) and wanted to edit it, what is the line that I should add it or remove it or update it (with what if possible) to get the same results but with punctuation?

Many thanks for your time reading this long post.

Best wishes,

Ghaliamus

回答1:

The problem lies on your defined tokenize func

def tokenize(content):
    return [token.encode('utf8') for token in utils.tokenize(content, 
            lower=True, errors='ignore') if len(token) <= 15 and not 
            token.startswith('_')]

The func utils.tokenize(content, lower=True, errors='ignore') simply tokenize the article into list of tokens. However, the implement of this func in .../site-packages/gensim/utils.py ignore the punctuation.

For example, when you call utils.tokenize("I love eating banana, apple") it return ["I", "love","eating","banana","apple"]

Anyway, you can define your own tokenize func as follow to retain punctuations.

def tokenize(content):
    #override original method in wikicorpus.py
    return [token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]


回答2:

In gensim/utils.py you find the method

def save_as_line_sentence(corpus, filename):
    with smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = any2unicode(' '.join(sentence) + '\n')
            fout.write(line)

that you can use to write the corpus into a textfile. You can override it or take it as example and and write your own version of it (maybe you want to break the lines at each punctuation) like

def save_sentence_each_line(corpus, filename):
    with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = utils.any2unicode(' '.join(sentence) + '\n')
            line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
            ...

you can call it like

save_sentence_each_line(wiki.get_texts(), out_f)

but you also need to override PAT_ALPHABETIC from utils, too, because thats where the punctuation gets deleted:

PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)

You may then need to override utils.tokenize and utils.simple_tokenize in case you want to make further changes to the code.