cqlsh doesn't allow nested queries so I cant export selected data to csv..
I'm trying to export the selected data (about 200,000 rows with a single column) from cassandra using:
echo "SELECT distinct imei FROM listener.snapshots;" > select.cql
bin/cqlsh -f select.cql > output.txt
and it just stuck forever without any error, and the file isn't growing.
if I use strace on the last line I got many rows like:
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
and --debug give me only:
cqlsh --debug -f select.cql > output.txt
Using CQL driver: <module 'cassandra' from '/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.5.1.zip/cassandra-driver-2.5.1/cassandra/__init__.py'>
what is wrong?
Is there better way to get distinct partition keys from large C* table?
I used capture:
cqlsh> CAPTURE 'temp.csv'
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;
---MORE---
---MORE---
---MORE---
---MORE---
.
.
.
cqlsh>
cqlsh>
And press enter until it finished.
Even faster option is to use paging:
cqlsh> PAGING off
Disabled Query paging.
cqlsh> CAPTURE 'temp.csv'
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;
It would immediately extract the data to the file (if you get a OperationTimedOut you should edit the timeout settings in cassandra.yaml).
I cant believe that it is the fasts way there is... I know I can export data using spark by using CassandraSQLContext but its not so fast when I need to create the rdd querying C* for distinct column out of very large table(2B rows~), and print them to file:
val conf = new SparkConf().setAppName("ExtractDistinctImeis")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val connector = CassandraConnector(conf)
val cc = new CassandraSQLContext(sc)
val snapshots_imeis = cc.sql("select distinct imei from listener.snapshots").map(row => row(0).toString)
val imeis = snapshots_imeis.collect
def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
val p = new java.io.PrintWriter(f)
try { op(p) } finally { p.close() }
}
printToFile(new File("/path/to/file.txt")) { p => imeis.foreach(p.println) }
It took 3.5 hours with spark! With capture I manage to get my file after 3 min/3 sec.
Usually you would use the cqlsh "COPY ... TO ..." command to export data from a table to a csv file.
See the documentation here.
I'm not sure why what you're trying is getting stuck. To debug I'd suggest not using DISTINCT and adding a LIMIT clause. It may be your table is large and so cqlsh is paging the results, but there is no one there to press return to get to the next page of results, so it waits forever.