I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain:
Introduction
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs
command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt
, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt
and not jp.txt
.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs://
URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
Observation / Intermediate conclusion
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://
) and a Google storage bucket (starting with gs://
).
Users and rights are different, depending on where you are listing files from.
Question(s)
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
Related questions
- Do I need to launch a Hadoop cluster on Google Cloud or not to achieve my goal?
- Is it possible to write directly to a Google Cloud Storage Bucket ? If yes, how can I configure flume? (adding jars, redefining classpath...)
- How come there are 2 storage engine in the same cluster (classical HDFS / GS bucket)
My flume configuration
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs://
path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks