The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.
Does anyone have an example Storage Plugin configuration for Google Cloud Storage?
Thanks
M
This is quite an old question, so I imagine you either found a solution or moved on with your life, but for anyone looking for a solution without using Dataproc, here's a solution:
Start Apache Drill.
Add a custom storage to Drill.
You're good to go.
The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.
I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:
Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):
Connect to the Drill console, create a new storage plugin (call it, say,
gcs
), and use the following configuration (note I copied most of it from the s3 config, made minor changes):Query using the following syntax (note the backticks):