I have tried the example code of SortValues transform using DirectRunner
on local machine (Windows)
PCollection<KV<String, KV<String, Integer>>> input = ...
PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped =
input.apply(GroupByKey.<String, KV<String, Integer>>create());
PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
grouped.apply(SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
but I got the error PipelineExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable
. Does this mean this transform function only works in Hadoop environment?
As of today, if you use Beam with release version below 2.0.0, you will have to add two hadoop dependencies in your maven pom file for this SortValues module to work.
hadoop-common
version 2.7.3 or laterhadoop-mapreduce-client-core
version 2.7.3 or later.Otherwise, you will just need to use Beam with release version >= 2.0.0.