In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1<String>
. The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand (so there could be a 'P362'). I want to write that tuple to a certain output location depending on the value inside of the tuple. So e.g. I would like to have the following file structure:
/output/P1
/output/P2
In the documentation I only found possibilities to write to locations that I know beforehand (e.g. stream.writeCsv("/output/somewhere")
), but no way of letting the contents of the data decide where the data is actually ending up.
I read about output splitting in the documentation but this doesn't seem to provide a way to redirect the output to different destinations the way I would like to have it (or I just don't understand how this would work).
Can this be done with the Flink API, if so, how? If not, is there maybe a third party library that can do it or would I have to build such a thing on my own?
Update
Following Matthias' suggestion I came up with a sifting sink function which determines the output path and then writes the tuple to the respective file after serializing it. I put it here for reference, maybe it is useful for someone else:
public class SiftingSinkFunction<IT> extends RichSinkFunction<IT> {
private final OutputSelector<IT> outputSelector;
private final MapFunction<IT, String> serializationFunction;
private final String basePath;
Map<String, TextOutputFormat<String>> formats = new HashMap<>();
/**
* @param outputSelector the selector which determines into which output(s) a record is written.
* @param serializationFunction a function which serializes the record to a string.
* @param basePath the base path for writing the records. It will be appended with the output selector.
*/
public SiftingSinkFunction(OutputSelector<IT> outputSelector, MapFunction<IT, String> serializationFunction, String basePath) {
this.outputSelector = outputSelector;
this.serializationFunction = serializationFunction;
this.basePath = basePath;
}
@Override
public void invoke(IT value) throws Exception {
// find out where to write.
Iterable<String> selection = outputSelector.select(value);
for (String s : selection) {
// ensure we have a format for this.
TextOutputFormat<String> destination = ensureDestinationExists(s);
// then serialize and write.
destination.writeRecord(serializationFunction.map(value));
}
}
private TextOutputFormat<String> ensureDestinationExists(String selection) throws IOException {
// if we know the destination, we just return the format.
if (formats.containsKey(selection)) {
return formats.get(selection);
}
// create a new output format and initialize it from the context.
TextOutputFormat<String> format = new TextOutputFormat<>(new Path(basePath, selection));
StreamingRuntimeContext context = (StreamingRuntimeContext) getRuntimeContext();
format.configure(context.getTaskStubParameters());
format.open(context.getIndexOfThisSubtask(), context.getNumberOfParallelSubtasks());
// put it into our map.
formats.put(selection, format);
return format;
}
@Override
public void close() throws IOException {
Exception lastException = null;
try {
for (TextOutputFormat<String> format : formats.values()) {
try {
format.close();
} catch (Exception e) {
lastException = e;
format.tryCleanupOnError();
}
}
} finally {
formats.clear();
}
if (lastException != null) {
throw new IOException("Close failed.", lastException);
}
}
}