Is it possible to tell HDFS where to store particular files?
Use case
I've just loaded batch #1 of files into HDFS and want to run job/application on these data. However, I also have batch #2 that is still to be loaded. It would be nice if I could run job/application on first batch on, say, nodes from 1 to 10, and load new data to nodes, say, 11 to 20, completely in parallel.
Initially I thought that NameNode federation (Hadoop 2.x) does exactly that, but it looks like federation only splits namespace, while DataNodes still provide blocks for all connected NameNodes.
So, is there a way to control the distribution of data in HDFS? And does it make sense at all?
Technically, you can, but I wouldn't.
If you want full control over where the data goes, you can extend
BlockPlacementPolicy
(see how does hdfs choose a datanode to store). This won't be easy to do and I don't recommend it.You can probably take steps to minimize the amount of traffic between your two sets of nodes with some clever setup to use rack-awareness to your advantage.