Text associated to PDF paragraph in document conte

2019-07-21 05:05发布

问题:

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):

public class ReadPdf  {
public static void main( String[] args ) throws IOException{

    MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
            "C:/Users/wip.txt")));
    RandomAccessFile raf = new RandomAccessFile(new File(
            "C:/Users/mypdf.pdf"), "r");
    PDFParser parser = new PDFParser(raf);
    parser.parse();

    COSDocument cosDoc = parser.getDocument();
    out.write(cosDoc.getXrefTable().toString());
    out.write(cosDoc.getObjects().toString());
    PDDocument document = parser.getPDDocument()
    document.getClass();
    COSParser cosParser = new COSParser(raf);

    PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();

    for (Object kid : treeRoot.getKids()){


        for (Object kid2 :((PDStructureElement)kid).getKids()){
            PDStructureElement kid2c = (PDStructureElement)kid2;

            if (kid2c.getStandardStructureType() == "P"){
                for (Object kid3 : kid2c.getKids()){
                    if (kid3 instanceof PDStructureElement){
                        PDStructureElement kid3c = (PDStructureElement)kid3;
                    }

                    else{

                        for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){


                            // Print all the Keys in the paragraph COSDictionary
                            System.out.println(entry.getKey().toString());
                            System.out.println(entry.getValue().toString());}

                    }}}}}}}

When I print the contents I get the following Keys:

  • /P : Reference to Parent
  • /A : Format of the paragraph
  • /K : Position of the paragraph in the section
  • /C : Name of the paragraph (!= text)
  • /Pg : Reference to the page

Example output:

COSName{K}

COSInt{2}

COSName{Pg}

COSObject{12, 0}

COSName{C}

COSName{Normal}

COSName{A}

COSObject{434, 0}

COSName{S}

COSName{Normal}

COSName{P}

COSObject{421, 0}

Now none of these points to the actual text inside the paragraph. I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

回答1:

I found a way to do this through the parsing of the Content Stream from a page. Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.

1) To get the text and the MCID:

PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
  1. Then to get the tags and their attribute that matches MCID:

    PDStructureElement pDStructureElement;
    pDStructureElement .getCOSObject().getInt(COSName.K)

That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.



标签: java pdf pdfbox