Querying Large Datasets in Cassandra

2019-08-09 22:35发布

问题:

I am by experience a RDBMS programmer. I am working on a scientific research problem involving genomic data. I was assigned to explore Cassandra since we needed a Big Data, scalable and cheap (free) solution. Setting Cassandra up and loading it with data was seductively trivial and similar to my experience with traditional DBs like Oracle and MySQL. My problem is finding a simple strategy to query data since this is a fundamental requirement for all data repositories. The data I am working with is mutation datasets which contain positional information as well as calculated numerical measures regarding the data. I set up an initial static column family that looks like this:

CREATE TABLE variant (
chrom text,
pos int,
ref text,
alt text,
aa text,
ac int,
af float,
afr_af text,
amr_af text,
an int,
asn_af text,
avgpost text,
erate text,
eur_af text,
ldaf text,
mutation_id text,
patient_id int,
rsq text,
snpsource text,
theta text,
vt text,
PRIMARY KEY (chrom, pos, ref, alt)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX af_variant_idx ON variant (af);

As you can see there is a natural primary key of positional data (chrome, pos, ref and alt). This data is not meaningful from a querying point of view. Much more interesting to my clients currently is to extract data with an 'AF' value below a certain value. I am using Java restful services to interact with this database using the CQL JDBC driver. It quickly became apparent that directly querying this table would not work using AF since it seems like the select statement must identify the row keys that you want to look at. I found some confusing discussions on this point but what I decided to do was since the distinct values of AF are below 100 values, I built a lookup table that looks like this:

CREATE TABLE af_lookup (
  af_id float,
  column1 text,
  column2 text,
  value text,
  PRIMARY KEY (af_id, column1, column2)
 ) WITH COMPACT STORAGE AND
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'SnappyCompressor'};

This was meant to be a dynamic table with very wide rows. I populated this table based on those data stored on my static column family. The 'AF' value is the key and the compound key from the other table is concantenate by '-' (i.e.1-129-T-G) and stored as a string as a dynamic column name. This worked OK but I still do not understand how all of these things work together. Dynamic Column Families seem to only work as advertised using CQL -2 but I really need to utilize function like >, <, >=, <=. It seems like this is theoretically possible but I have not found a solution in the last 4 weeks of trying a number of different tools (I tried astyanax as well as the JDBC driver).

I have two primary problems, the first is the rpc timeout limitation for querying these data which could produce 10 of thousands to millions of records. The second problem is how to present these data in HTML by getting the data that has not been presented already (previous - next links). Similar to the way opscenter displays column family record data. This doesn't seem possible with the functional limitations of not being able to use >, <, >=, <=. Based on my experience this is probably a lack of understanding on my part of how this product really works rather than a lack of capability of the product (databases wouldn't be very useful if they were only capable of handling writes well).

Is there anyone out there that has encountered this issue and solved it before? I would really appreciate sharing an example of how to implement a C* solution using java web services to display a large number of results that will have to be paginated through.

回答1:

You may want to explore and use Playorm for Cassandra as it can resolve your problem of timout limitation and pagination. PlayOrm returns a cursor when you query and as your first page reads in the first 20 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 20 rows again.
Visit http://buffalosw.com/wiki/An-example-to-begin-with-PlayOrm/ to see the example for cursor and http://buffalosw.com/products/playorm/ for all features and more details about playorm