I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain:
Introduction
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs
command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt
, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt
and not jp.txt
.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs://
URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
Observation / Intermediate conclusion
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://
) and a Google storage bucket (starting with gs://
).
Users and rights are different, depending on where you are listing files from.
Question(s)
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
Related questions
- Do I need to launch a Hadoop cluster on Google Cloud or not to achieve my goal?
- Is it possible to write directly to a Google Cloud Storage Bucket ? If yes, how can I configure flume? (adding jars, redefining classpath...)
- How come there are 2 storage engine in the same cluster (classical HDFS / GS bucket)
My flume configuration
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs://
path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks
As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the
scheme://
of the URI you use withhadoop fs
. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set togs://your-configbucket
.The reason you had to include the
gs://configbucket/file
instead of just plain/file
on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop'score-site.xml
, settingfs.default.name
to begs://configbucket/
. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out/home/hadoop/hadoop-install/core-site.xml
for a reference of what you might carry over to your local setup.To explain the internals of Hadoop a bit, the reason
hdfs://
paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop'score-site.xml
file, which by default sets:Similarly, you may have noticed that to get
gs://
to work on your local cluster, you providedfs.gs.impl
. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interfaceFileSystem
, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally usehdfs://
you should be able to usegs://
.So, to answer your questions:
hdfs://
paths withgs://
paths instead, and/or settingfs.default.name
to be your rootgs://configbucket
path.appends
to existing files or symlinks.