Recently, I want to parse websites and then use BeautifulSoup to filter what I want and write in csv file in hdfs.
Now, I am at the process of filtering website code with BeautifulSoup.
I want to use mapreduce method to execute it:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.0.2.jar
-mapper /pytemp/filter.py
-input /user/root/py/input/
-output /user/root/py/output40/
input file is like kvs(PER LINE): (key, value) = (url, content)
content, I mean:
<html><head><title>...</title></head><body>...</body></html>
filter.py file:
#!/usr/bin/env python
#!/usr/bin/python
#coding:utf-8
from bs4 import BeautifulSoup
import sys
for line in sys.stdin:
line = line.strip()
key, content = line.split(",")
#if the following two lines do not exist, the program will execute successfully
soup = BeautifulSoup(content)
output = soup.find()
print("Start-----------------")
print("End------------------")
BTW, I think I do not need reduce.py to do my work.
However, I got error message:
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Here is a reply said it is memory issue but my input file just 3MB. http://grokbase.com/t/gg/rhadoop/13924fs4as/972-getting-error-pipemapred-waitoutputthreads-while-running-mapreduce-program-for-40mb-of-sizedataset
I have no idea about my problem. I search lots of things for it but still does not work.
My environment is:
- CentOS6
- Python2.7
- Cloudera CDH5
I will appreciate your help with this situation.
EDIT on 2016/06/24
First of all, I checked error log and found the problem is too many values to unpack. (also thanks to @kynan answer)
Just give an example why it happened
<font color="#0000FF">
SomeText1
<font color="#0000FF">
SomeText2
</font>
</font>
If part of content is like above, and I call soup.find("font", color="#0000FF") and assign to output. It will cause two font to be assigned to one output, so that is why the error too many values to unpack
Solution
Just change output = soup.find()
to (Var1, Var2, ...) = soup.find_all("font", color="#0000FF", limit=AmountOfVar)
and work well :)