I'm trying to use subprocess
in a python
script which I call within an oozie
shell action. Subprocess
is supposed to read a file which is stored in Hadoop's HDFS.
I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.
Here is the python
script, named connected_subprocess.py
:
#!/usr/bin/python
import subprocess
import networkx as nx
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
G=nx.DiGraph()
f=open("/home/rlk/liste_strongly_connected.txt","wb")
for item in liste:
try:
app1,app2=item.split('\t')
G.add_edge(app1,app2)
except:
pass
liste_connected=nx.strongly_connected_components(G)
for item in liste_connected:
if len(item)>1:
f.write('{}\n'.format('\t'.join(item)))
f.close()
The corresponding shell action in Oozie's workflow.xml is the following :
<action name="final">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>connected_subprocess.py</exec>
<file>connected_subprocess.py</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
When I run the oozie job the tasktracker log reads theses errors:
Error: Could not find or load main class org.apache.hadoop.fs.FsShell
Traceback (most recent call last):
File "./connected_subprocess.py", line 6, in <module>
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'hadoop fs -cat /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status 1
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
It seems that I cannot run a shell command line within my python script when the python script is embedded within an oozie action since everything works fine when I run my python script within my interactive shell.
Is there any way I can bypass this limitation ?
I wonder if your script just doesn't have access to your PATH environment variable (when executed through Oozie) and is having trouble locating the "hadoop" command. Can you try modifying your python script's subprocess.check_output call and adding the full path to the hadoop fs command?