Avro message for Google Cloud Pub-Sub?

2019-07-29 17:36发布

问题:

What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format. Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?

回答1:

Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.

In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:

There are two major factors at work here: more efficient data encoding and HTTP/2. gRPC keeps data in binary both in client memory and on the wire by building on HTTP/2 and Protocol Buffers. This eliminates processing and space required for string encoding schemes such as Base64 or JSON. In addition, HTTP/2 itself makes things go faster with multiplexed requests over a single connection and header compression.

I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.

Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:

  1. Making sure you are using gRPC
  2. Batching where possible, to reduce the number of calls and eliminate latency.
  3. Only compressing when needed and after benchmarking (implies extra logic in your application)

Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:

  1. Don't underestimate the importance of capacity planning.
  2. Make sure your pub/sub system is fault-tolerant.
  3. NSM: Never Stop Monitoring.


回答2:

There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.

Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.