How to manage conflicting DataProc Guava, Protobuf

2019-05-03 05:18发布

问题:

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image.

When running the project locally and building with maven, the correct versions of these dependencies are loaded an the job will run without issue. When submitting the job to DataProc, the DataProc version of these libraries are preferred and the job will reference class functions that cannot be resolved.

What is the recommended way of ensuring that the right version of a dependency's dependencies get loaded when submitting a Spark job on DataProc? I'm not in a position to rewrite components of this library to use the older versions of these packages that are being provided by DataProc.

回答1:

Recommended approach is to include all dependencies for your job into uber jar (created using Maven Shade plugin, for example) and relocate dependencies classes inside this uber jar to avoid conflicts with classes in libraries provided by Dataproc.

For reference, you can take a look at how this is done in Cloud Storage connector which is a part of Dataproc distribution.