Flume - Can an entire file be considered an event

2019-03-22 06:27发布

问题:

I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy.

As an alternative, I was advised to use regex interceptors where multiple files would reside in a single directory and based on a string in the file, would be routed to the specific directory in HDFS. The kind of files I am expecting are CSV files where the first line is the header and the subsequent lines are comma separated values.

With this in mind, I have a few questions.

  1. How do interceptors handle files?
  2. Given that the header line in the CSV would be like ID, Name followed in the next lines by IDs and Names, and another file in the same directory would have Name, Address followed in the next line by names and address, what would the interceptor and channel configuration look like for it to route it to different HDFS directories?
  3. How does an interceptor handle the subsequent lines that clearly do not match the regex expression?
  4. Would an entire file even constitute one event or is it possible that one file can actually be multiple events?

Please let me know. Thanks!

回答1:

For starters, flume doesn't work on files as such, but on a thing called events. Events are Avro structures which can contain anything, usually a line, but in your case it might be an entire file.

An interceptor gives you the ability to extract information from your event and add that to that event's headers. The latter can be used to configure a traget directory structure.

In your specific case, you would want to code a parser that analyses the content of you event and sets a header value, for instance sub path:

if (line.contains("Address")) {
    event.getHeaders().put("subpath", "address");
else if (line.contains("ID")) {
    event.getHeaders().put("subpath", "id");
}

You can then reference that in your hdfs-sink confirguration as follows:

hdfs-a1.sinks.hdfs-sink.hdfs.path = hdfs://cluster/path/%{subpath}

As to your question whether multiple files can constitute an event: yes, that's possible, but not with the spool source. You would have to implement a client class which speaks to a configured Avro source. You would have to pipe your files into an event and send that off. You could then also set the headers there instead of using an interceptor.