Get distinct partition keys from C* table

2019-03-28 01:44发布

问题:

cqlsh doesn't allow nested queries so I cant export selected data to csv.. I'm trying to export the selected data (about 200,000 rows with a single column) from cassandra using:

echo "SELECT distinct imei FROM listener.snapshots;" > select.cql bin/cqlsh -f select.cql > output.txt

and it just stuck forever without any error, and the file isn't growing.

if I use strace on the last line I got many rows like:

select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)

and --debug give me only:

cqlsh --debug -f select.cql > output.txt

Using CQL driver: <module 'cassandra' from '/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.5.1.zip/cassandra-driver-2.5.1/cassandra/__init__.py'>

what is wrong? Is there better way to get distinct partition keys from large C* table?

回答1:

I used capture:

cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;
---MORE---
---MORE---
---MORE---
---MORE---
.
.
.
cqlsh> 
cqlsh>

And press enter until it finished.

Even faster option is to use paging:

cqlsh> PAGING off
Disabled Query paging.
cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;

It would immediately extract the data to the file (if you get a OperationTimedOut you should edit the timeout settings in cassandra.yaml).

I cant believe that it is the fasts way there is... I know I can export data using spark by using CassandraSQLContext but its not so fast when I need to create the rdd querying C* for distinct column out of very large table(2B rows~), and print them to file:

    val conf = new SparkConf().setAppName("ExtractDistinctImeis")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val connector = CassandraConnector(conf)
    val cc = new CassandraSQLContext(sc)

    val snapshots_imeis = cc.sql("select distinct imei from listener.snapshots").map(row => row(0).toString)

    val imeis = snapshots_imeis.collect 

    def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
        val p = new java.io.PrintWriter(f)
        try { op(p) } finally { p.close() }
    }

    printToFile(new File("/path/to/file.txt")) { p => imeis.foreach(p.println) }

It took 3.5 hours with spark! With capture I manage to get my file after 3 min/3 sec.



回答2:

Usually you would use the cqlsh "COPY ... TO ..." command to export data from a table to a csv file.

See the documentation here.

I'm not sure why what you're trying is getting stuck. To debug I'd suggest not using DISTINCT and adding a LIMIT clause. It may be your table is large and so cqlsh is paging the results, but there is no one there to press return to get to the next page of results, so it waits forever.