Pig & Cassandra & DataStax Splits Control

2020-03-24 06:28发布

问题:

I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!!

I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru these 53M rows. I started poking around the logs and I can see that the map is spilling repeatedly (i saw 177 spills from the mapper), and I think this is part of the problem.

The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table. I call this anti-parallel :)

Now, there are a lot of gears at work in this picture, including:

  • 2 physical nodes
  • The hadoop node is in a "Analytics" DC (default config), but physically in the same rack.
  • I can see the job using LOCAL_QUORUM

Can anybody point me in the direction of how to get Pig to create more Input Splits so I can run more mappers? I have 23 slots; seems a pity to only use one all the time.

Or, am I completely mad and don't understand the problem? I welcome both kinds of answers!

回答1:

You should set pig.noSplitCombination = true. You can do this in one of three places.

When invoking the script:

dse pig -Dpig.noSplitCombination=true /path/to/script.pig

In the Pig script itself:

SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();

Or permanently in /etc/dse/pig/pig.properties. Uncomment:

pig.noSplitCombination=true

Otherwise, Pig may set your total input paths (combined) to process: 1.



回答2:

You can set cassandra.input.split.size to something less than 64k which is the default split size, so you can get more splits. How many rows per node for the Cql table? Can you post your table schema?

add split_size to the url paramaters

For CassandraStorage use the following parameters cassandra://[username:password@]/[?slice_start=&slice_end=[&reversed=true][&limit=1][&allow_deletes=true][&widerows=true][&use_secondary=true][&comparator=][&split_size=][&partitioner=][&init_address=][&rpc_port=]]

For CqlStorage use the following parameters cql://[username:password@]/[?[page_size=][&columns=][&output_query=][&where_clause=][&split_size=][&partitioner=][&use_secondary=true|false][&init_address=][&rpc_port=]]



回答3:

setting pig.noSplitCombination = true takes me to the other extreme end - with this flag I started having 769 map tasks