Spark and Python trying to parse wikipedia using g

Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions.

Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List (or iterator) of tuples returned by MAP (PySpark).

My code is following:

import sys
import gensim
from pyspark import SparkContext


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print >> sys.stderr, "Usage: wordcount <file>"
        exit(-1)

    sc = SparkContext(appName="Process wiki - distributed RDD")

    distData = sc.textFile(sys.argv[1])
    #take 10 only to see how the output would look like
    processed_data = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

    print processed_data
    sc.stop()

The source code of extract_pages can be found at https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py and based on my going through it seems that it should work with Spark.

But unfortunately when I run the code I'm getting following error log:

14/10/05 13:21:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <ip address>.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/root/spark/python/pyspark/rdd.py", line 1148, in takeUpToNumLeft
yield next(iterator)
File "/usr/lib64/python2.6/site-packages/gensim/corpora/wikicorpus.py", line 190, in extract_pages
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "<string>", line 52, in __init__
IOError: [Errno 2] No such file or directory: u'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en">'
    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

And then some probably Spark log:

14/10/05 13:21:12 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/10/05 13:21:12 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/10/05 13:21:12 INFO scheduler.DAGScheduler: Failed to run runJob at PythonRDD.scala:296

and

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I've tried this without Spark successfully, so the problem should be somewhere in combination of Spark and gensim, but I don't much understand the error that I'm getting. I don't see any file reading in the line 190 of gensim wikicorpus.py.

EDIT:

Added some more logs from Spark:

EDIT2:

gensim uses from xml.etree.cElementTree import iterparse, documentation here, which might cause the problem. It actually expects file name or file containing the xml data. Can be RDD considered as file containing the xml data?

I usually work with Spark in Scala. Nevertheless here are my thoughts:

When you load a file via sc.textFile, it is some sort of line iterator which is distributed across your sparkWorkers. I think given the xml format of the wikipedia one line does not necessarily corresponds to a parsable xml item, and thus you are getting this problem.

i.e:

 Line 1 :  <item>
 Line 2 :  <title> blabla </title> <subitem>
 Line 3 : </subItem>
 Line 4 : </item>

If you try to parse each line on its own, it will spit out exceptions like the ones you got.

I usually have to mess around with a wikipedia dump, so first thing I do is to transform it into a "REadable version" which is easily digested by Spark. i.e: One line per article entry. Once you have it like that you can easily feed it into spark, and do all kind of processing. It doesn't take much resources to transform it

Take a look at ReadableWiki: https://github.com/idio/wiki2vec