Strange behaviour in protobuf with long strings

2019-09-05 02:52发布

问题:

I'm trying to send data from a client to a server. both application's are written in java. But they use a tls layer implemented in c++ over SWIG Wrappers. The tls layer expects a string from the client, transmit it to the server-side and notifies the java server application (and passes the string). However this string should contain serialized data. Somehow I struggle to use protobuf for serializing the data. I would like to use a java protobuf class named ToDoListMessage. The protobuf looks like this:

message ToDoListMessage{  
    optional string user = 1;  
    optional string token = 2;
}

But the generated java class fails to parse the data which it serialized before:

com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.

I'm currently not sending the data to the server. Just testing the serialize and parse part on clientside:

ToDoListMessageProto msg = ToDoListMessageProto.newBuilder().setUser("test").setToken("38632735722755").build();        

byte b [] = msg.toByteArray();  
String sMsg = Arrays.toString(b);   
System.out.println("send message = " + sMsg);
ToDoListMessageProto outputmessage;         
outputmessage = ToDoListMessageProto.parseFrom(sMsg.getBytes());

The message looks like:

[10, 4, 116, 101, 115, 116, 18, 14, 51, 56, 54, 51, 50, 55, 51, 53, 55, 50, 50, 55, 53, 53]

What I tried:

1) All solutions I found so far say this problem might be solved by using a CodedOutputStream. But the tls layer is expecting a string, not a stream. However I also tried to following:

ByteArrayOutputStream bos = new ByteArrayOutputStream();
CodedOutputStream cos = CodedOutputStream.newInstance(bos);
msg.writeTo(cos);   
cos.flush();
byte b [] = msg.toByteArray();              
String sMsg = Arrays.toString(b);   

But I get the same error as above for this parsing:

CodedInputStream cis = CodedInputStream.newInstance(sMsg.getBytes());
ToDoListMessageProto message = ToDoListMessageProto.parseFrom(cis);

2) I also tried to use a UTF8 encoded string instead of the array-like one:

String sMsg = new String(b);

In this case the app behaves even more strange. For short "tokens" (e.g. less than 129 bit) the parsing works, but fails for long tokens:

com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length.

I really can't tell why. Currently the token does only contain numbers.

Does anyone know a solution how I can get a serialized string from protobuf which can be parsed correctly?

Again: there is no tls transmission involved in this testing. Currently everything is done on the client side.

Update:

Because I fetch the byte array directly from the Protobuf Message it is not possible to pass an encoding. I found there is also a toByteString Method for the message but using toStringUtf8 on this ByteString doesn't seem to work neither:

String sMsg = msg.toByteString().toStringUtf8();
System.out.println("send message = " + sMsg);
ToDoListMessageProto outputmessage;         
outputmessage = ToDoListMessageProto.parseFrom(sMsg.getBytes());

I get the same error messages (which differ if I use a long or a short token, see above)

回答1:

Converting a java String into a byte array and back always needs a indication what encoding is to be used. If you omit this indicator, only 7bit characters (encoding "US-ASCII", since java7: StandardCharsets.US_ASCII) are converted correctly. If you want to serialize UTF-8 Strings:

        String inputStr = "öäü";
        byte[] serialized = inputStr.getBytes( StandardCharsets.UTF_8);
        System.out.println( "Number of bytes: " + serialized.length);

        StringBuilder sb = new StringBuilder();
        for (byte b : serialized)
        {
            sb.append(String.format("%02X ", b));
        }
        System.out.println( "Bytes: " + sb.toString());
        String back = new String( serialized, StandardCharsets.UTF_8);
        System.out.println( "Back: " + back);

gives the output:

Number of bytes: 6
Bytes: C3 B6 C3 A4 C3 BC 
Back: öäü


回答2:

I was not able to resolve the original problem. But what I finally ended up doing was to generate Java Protobuf classes and used them for converting the data to byte[]. Afterwards I passed the byte[] to C++. On server side I send the byte[] from the C++ TLS layer to the Java server application through JNI. The Java server application itself uses the Java Protobuf classes again to parse the byte[] to an object. No String is involved in my Java source code. This works, but I'm still curious, if there is a way to solve the original issue.



回答3:

You can use com.google.protobuf.TextFormat, for example:

ToDoListMessageProto msg = ToDoListMessageProto.newBuilder().setUser("test").setToken("38632735722755").build();        

byte b [] = msg.toByteArray();  
String sMsg = Arrays.toString(b);   
System.out.println("send message = " + sMsg);

ToDoListMessageProto.Builder msgBuilder = ToDoListMessageProto.newBuilder();
TextFormat.getParser().merge(sMsg, msgBuilder);
ToDoListMessageProto outputmessage = msgBuilder.build();
System.out.println("received message = " + outputmessage.toString());