Python subprocess with oozie

2019-05-28 13:37发布

问题:

I'm trying to use subprocess in a python script which I call within an oozie shell action. Subprocessis supposed to read a file which is stored in Hadoop's HDFS.

I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.

Here is the pythonscript, named connected_subprocess.py :

#!/usr/bin/python

import subprocess
import networkx as nx

liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
G=nx.DiGraph()
f=open("/home/rlk/liste_strongly_connected.txt","wb")
for item in liste:
    try:
        app1,app2=item.split('\t')
        G.add_edge(app1,app2)
    except:
        pass
liste_connected=nx.strongly_connected_components(G)
for item in liste_connected:
    if len(item)>1:
        f.write('{}\n'.format('\t'.join(item)))
f.close()

The corresponding shell action in Oozie's workflow.xml is the following :

 <action name="final">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>connected_subprocess.py</exec>
            <file>connected_subprocess.py</file>
         </shell>
         <ok to="end" />
         <error to="kill" />
    </action>

When I run the oozie job the tasktracker log reads theses errors:

Error: Could not find or load main class org.apache.hadoop.fs.FsShell
Traceback (most recent call last):
  File "./connected_subprocess.py", line 6, in <module>
    liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
  File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'hadoop fs -cat /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status 1
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]

It seems that I cannot run a shell command line within my python script when the python script is embedded within an oozie action since everything works fine when I run my python script within my interactive shell.

Is there any way I can bypass this limitation ?

回答1:

I wonder if your script just doesn't have access to your PATH environment variable (when executed through Oozie) and is having trouble locating the "hadoop" command. Can you try modifying your python script's subprocess.check_output call and adding the full path to the hadoop fs command?