I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!!
I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru these 53M rows. I started poking around the logs and I can see that the map is spilling repeatedly (i saw 177 spills from the mapper), and I think this is part of the problem.
The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table. I call this anti-parallel :)
Now, there are a lot of gears at work in this picture, including:
- 2 physical nodes
- The hadoop node is in a "Analytics" DC (default config), but physically in the same rack.
- I can see the job using LOCAL_QUORUM
Can anybody point me in the direction of how to get Pig to create more Input Splits so I can run more mappers? I have 23 slots; seems a pity to only use one all the time.
Or, am I completely mad and don't understand the problem? I welcome both kinds of answers!