I'm trying to use subprocess
in a python
script which I call within an oozie
shell action. Subprocess
is supposed to read a file which is stored in Hadoop's HDFS.
I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.
Here is the python
script, named connected_subprocess.py
:
#!/usr/bin/python
import subprocess
import networkx as nx
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
G=nx.DiGraph()
f=open("/home/rlk/liste_strongly_connected.txt","wb")
for item in liste:
try:
app1,app2=item.split('\t')
G.add_edge(app1,app2)
except:
pass
liste_connected=nx.strongly_connected_components(G)
for item in liste_connected:
if len(item)>1:
f.write('{}\n'.format('\t'.join(item)))
f.close()
The corresponding shell action in Oozie's workflow.xml is the following :
<action name="final">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>connected_subprocess.py</exec>
<file>connected_subprocess.py</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
When I run the oozie job the tasktracker log reads theses errors:
Error: Could not find or load main class org.apache.hadoop.fs.FsShell
Traceback (most recent call last):
File "./connected_subprocess.py", line 6, in <module>
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'hadoop fs -cat /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status 1
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
It seems that I cannot run a shell command line within my python script when the python script is embedded within an oozie action since everything works fine when I run my python script within my interactive shell.
Is there any way I can bypass this limitation ?