I have lots of hive Tables stored in my HDFS on a Test Cluster with 5 nodes. The Data should be around 70 Gb * 3 (Replipication). No i want to transfer the whole setup to a different environment with much more nodes. A Network Connection between the two Clusters is not possible.
The thing is that i dont have much time with the new Cluster and also no possibilities to Test the Transfering with an other Test environment. Therefore i need a solid plan. :)
What options do i have?
How can i transfer the hive setup with a minimum of configuration effort on the new cluster?
Is it possible to just copy the hdfs directorys of the 5 Nodes to 5 Nodes of the new Cluster, then add the rest of the nodes to the new cluster and start the balancer?
Without a network connection, it will be tricky!
I would
- Copy the files out of HDFS onto some kind of removable storage (USB stick, external HDD, etc.)
- Move the storage to the new cluster
- Copy the files back into HDFS
Note that this won't preserve metadata like file creation/last access time, and, more importantly, ownership and permissions.
Small-scale testing of this process should be pretty simple.
If you can get (even temporarily) network connectivity between the two clusters, then distcp
would be the way to go. It uses map reduce to parallelise the transfers, potentially resulting in massive time savings.
You can copy directories and files from one cluster to another using hadoop distcp command
Here is a small examples that describes its usage
http://souravgulati.webs.com/apps/forums/topics/show/8534378-hadoop-copy-files-from-one-hadoop-cluster-to-other-hadoop-cluster
you can copy data by using this command :
sudo -u hdfs hadoop --config {PathtotheVpcCluster}/vpcCluster distcp hdfs://SourceIP:8020/user/hdfs/WholeData hdfs://DestinationIP:8020/user/hdfs/WholeData