Based on the answer to this question I'm thinking that I've provided my .pb file with a "faulty decoder".
This is the data I'm trying to decode.
Based on the ListPeople.java example provided in the Java tutorial documentation, I tried to write something similar to start picking apart that data, I wrote this:
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;
public class ListDocument
{
// Iterates though all people in the AddressBook and prints info about them.
static void Print(Document document)
{
for ( Sentence sentence: document.getSentencesList() )
{
for(int i=0; i < sentence.getTokensCount(); i++)
{
System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
}
}
}
// Main function: Reads the entire address book from a file and prints all
// the information inside.
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.err.println("Usage: ListPeople ADDRESS_BOOK_FILE");
System.exit(-1);
}
// Read the existing address book.
Document addressBook =
Document.parseFrom(new FileInputStream(args[0]));
Print(addressBook);
}
}
But when I run that I get this error
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
at ListDocument.main(ListDocument.java:40)
so, as I said above I think that has to do with me not properly defining the decoder. Is there some way to look at the .proto file I'm trying to use and figure out a way to just read off all that data?
Is there some way to look at that .proto file and see what I'm doing wrong?
These are the first few lines of the file I want to read:
Ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*ŒThe occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."¬
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb†€€€øÿÿÿÿ…€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù
EDIT
This is a file another researcher used to parse these files, so I was told, is it possible that I could use this?
package edu.stanford.nlp.kbp.slotfilling.multir;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;
import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;
/**
* Converts Hoffmann's data in protobuf format to our MultiLabelDataset
* @author Mihai
*
*/
public class ProtobufToMultiLabelDataset {
static class RelationAndMentions {
String arg1;
String arg2;
Set<String> posLabels;
Set<String> negLabels;
List<Mention> mentions;
public RelationAndMentions(String types, String a1, String a2) {
arg1 = a1;
arg2 = a2;
String [] rels = types.split(",");
posLabels = new HashSet<String>();
for(String r: rels){
if(! r.equals("NA")) posLabels.add(r.trim());
}
negLabels = new HashSet<String>(); // will be populated later
mentions = new ArrayList<Mention>();
}
};
static class Mention {
List<String> features;
public Mention(List<String> feats) {
features = feats;
}
}
public static void main(String[] args) throws Exception {
String input = args[0];
InputStream is = new GZIPInputStream(
new BufferedInputStream
(new FileInputStream(input)));
toMultiLabelDataset(is);
is.close();
}
public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException {
List<RelationAndMentions> relations = toRelations(is, true);
MultiLabelDataset<String, String> dataset = toDataset(relations);
return dataset;
}
public static void toDatums(InputStream is,
List<List<Collection<String>>> relationFeatures,
List<Set<String>> labels) throws IOException {
List<RelationAndMentions> relations = toRelations(is, false);
toDatums(relations, relationFeatures, labels);
}
private static void toDatums(List<RelationAndMentions> relations,
List<List<Collection<String>>> relationFeatures,
List<Set<String>> labels) {
for(RelationAndMentions rel: relations) {
labels.add(rel.posLabels);
List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
for(int i = 0; i < rel.mentions.size(); i ++){
mentionFeatures.add(rel.mentions.get(i).features);
}
relationFeatures.add(mentionFeatures);
}
assert(labels.size() == relationFeatures.size());
}
public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException {
//
// Parse the protobuf
//
// all relations are stored here
List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
// all known relations (without NIL)
Set<String> relTypes = new HashSet<String>();
Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
new HashMap<String, Map<String,Set<String>>>();
Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
Relation r = null;
while ((r = Relation.parseDelimitedFrom(is)) != null) {
RelationAndMentions relation = new RelationAndMentions(
r.getRelType(), r.getSourceGuid(), r.getDestGuid());
labelCountHisto.incrementCount(relation.posLabels.size());
relTypes.addAll(relation.posLabels);
relations.add(relation);
for(int i = 0; i < r.getMentionCount(); i ++) {
DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
// String s = mention.getSentence();
relation.mentions.add(new Mention(mention.getFeatureList()));
}
for(String l: relation.posLabels) {
addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
}
}
Log.severe("Loaded " + relations.size() + " relations.");
Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
Log.severe("Label count histogram: " + labelCountHisto);
Counter<Integer> slotCountHisto = new ClassicCounter<Integer>();
for(String e: knownRelationsPerEntity.keySet()) {
slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size());
}
Log.severe("Slot count histogram: " + slotCountHisto);
int negativesWithKnownPositivesCount = 0, totalNegatives = 0;
for(RelationAndMentions rel: relations) {
if(rel.posLabels.size() == 0) {
if(knownRelationsPerEntity.get(rel.arg1) != null &&
knownRelationsPerEntity.get(rel.arg1).size() > 0) {
negativesWithKnownPositivesCount ++;
}
totalNegatives ++;
}
}
Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives +
" negative examples with at least one known relation for arg1.");
Counter<Integer> mentionCountHisto = new ClassicCounter<Integer>();
for(RelationAndMentions rel: relations) {
mentionCountHisto.incrementCount(rel.mentions.size());
if(rel.mentions.size() > 100)
Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels);
}
Log.severe("Mention count histogram: " + mentionCountHisto);
//
// Detect the known negatives for each source entity
//
if(generateNegativeLabels) {
for(RelationAndMentions rel: relations) {
Set<String> negatives = new HashSet<String>(relTypes);
negatives.removeAll(rel.posLabels);
rel.negLabels = negatives;
}
}
return relations;
}
private static MultiLabelDataset<String, String> toDataset(List<RelationAndMentions> relations) {
int [][][] data = new int[relations.size()][][];
Index<String> featureIndex = new HashIndex<String>();
Index<String> labelIndex = new HashIndex<String>();
Set<Integer> [] posLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
Set<Integer> [] negLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
int offset = 0, posCount = 0;
for(RelationAndMentions rel: relations) {
Set<Integer> pos = new HashSet<Integer>();
Set<Integer> neg = new HashSet<Integer>();
for(String l: rel.posLabels) {
pos.add(labelIndex.indexOf(l, true));
}
for(String l: rel.negLabels) {
neg.add(labelIndex.indexOf(l, true));
}
posLabels[offset] = pos;
negLabels[offset] = neg;
int [][] group = new int[rel.mentions.size()][];
for(int i = 0; i < rel.mentions.size(); i ++){
List<String> sfeats = rel.mentions.get(i).features;
int [] features = new int[sfeats.size()];
for(int j = 0; j < sfeats.size(); j ++) {
features[j] = featureIndex.indexOf(sfeats.get(j), true);
}
group[i] = features;
}
data[offset] = group;
posCount += posLabels[offset].size();
offset ++;
}
Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive.");
MultiLabelDataset<String, String> dataset = new MultiLabelDataset<String, String>(
data, featureIndex, labelIndex, posLabels, negLabels);
return dataset;
}
private static void addKnownRelation(String arg1, String arg2, String label,
Map<String, Map<String, Set<String>>> knownRelationsPerEntity) {
Map<String, Set<String>> myRels = knownRelationsPerEntity.get(arg1);
if(myRels == null) {
myRels = new HashMap<String, Set<String>>();
knownRelationsPerEntity.put(arg1, myRels);
}
Set<String> mySlots = myRels.get(label);
if(mySlots == null) {
mySlots = new HashSet<String>();
myRels.put(label, mySlots);
}
mySlots.add(arg2);
}
}
I know it's been over two years, but here I provide a general way to read this delimited protocol buffers in python. The function you mention:
parseDelimitedFrom
, is not available in the python implementation of the protocol buffers. But here is small workaround for whoever might need it. This code is an adaptation of that found in: https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/and a usage example using one of the files of the OP:
Updated; the confusion here is two points:
Relation
, notDocument
(in fact, onlyRelation
andRelationMentionRef
are even used)As such,
Relation.parseDelimitedFrom
should work. Processing it manually, I get:Old; outdated; exploratory:
I extracted your 4 documents and ran them through a little test rig:
where
ProcessFile
first dumps the first 10 bytes as hex, and then tries to process it via aProtoReader
. Here's the results:Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.
Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.
Same as above.
CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).
Your data is well and truly garbage. Sorry.
And with the bonus round of test-multiple.pb from comments (after gz decompression):
This starts identically to testNegative.pb, and hence fails for exactly the same reason.