I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy.
As an alternative, I was advised to use regex interceptors where multiple files would reside in a single directory and based on a string in the file, would be routed to the specific directory in HDFS. The kind of files I am expecting are CSV files where the first line is the header and the subsequent lines are comma separated values.
With this in mind, I have a few questions.
- How do interceptors handle files?
- Given that the header line in the CSV would be like
ID, Name
followed in the next lines by IDs and Names, and another file in the same directory would haveName, Address
followed in the next line by names and address, what would the interceptor and channel configuration look like for it to route it to different HDFS directories? - How does an interceptor handle the subsequent lines that clearly do not match the regex expression?
- Would an entire file even constitute one event or is it possible that one file can actually be multiple events?
Please let me know. Thanks!