Configuring pig relation with Hadoop

2020-07-18 06:21发布

问题:

I'm having troubles understanding the relation between Hadoop and Pig. I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.

What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node. Indeed, it uses the hadoop configuration files.

Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?

If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?

If not, does it mean we always need to have hadoop running within pig ?

回答1:

Pig can run in two modes:

  1. Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:

    pig -x local 
    
  2. MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.

    Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).

    If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:

    fs.default.name=hdfs://namenode_address:8020/
    mapred.job.tracker=jobtracker_address:8021
    

    You can also specify remote cluster address at the command line:

    pig -fs namenode_address:8020 -jt jobtracker_address:8021
    

Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.