I’m currently using PDFBox to read the text of a set of pdfs that I’ve inherited.

I’m only interested in reading the text, not making any changes to the file.

The code that works for most of the files is:

   File pdfFile = myPath.toFile();
   PDDocument document = PDDocument.load(pdfFile );
   Writer sw = new StringWriter();
   PDFTextStripper stripper = new PDFTextStripper();
   stripper.setStartPage( 1 );
   stripper.writeText( document,  sw );
   String documentText = sw.toString()

For most files, I wind up with the text in the documentText field.

But, for 3 of 24 files, the documentText content for the first file is “\r\n”, for the second “\r\n\r\n”, and for the third “\r\n\r\n\r\n:, But the three files are not consecutive. Multiple good files are between each of these files.

The File is derived from a java.nio.Path. The WindowsFileAttribute that is part of the Path has a size of 279K, so the file is not empty on disk.

I can open the file and view the data, and it looks like the other files that my code reads.

I’m using Java 8.0.121, and PDFBox 2.0.4. (this is the latest version, I believe.)

Any suggestions? Is there a better way to read the text? (I’m not interested in the formatting, or fonts used, just the text.)

Thanks.

标签： java pdfbox

1条回答

劫难

2楼-- · 2019-09-14 14:24

Reading multiple PDF docs using pdfbox in java

package readwordfile;

import java.io.BufferedReader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

/**
 * This is an example on how to extract words from PDF document
 *
 * @author saravanan
 */
public class GetWordsFromPDF extends PDFTextStripper {

    static List<String> words = new ArrayList<String>();

    public GetWordsFromPDF() throws IOException {
    }

    /**
     * @param args
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException {
        String files;
//        FileWriter fs = new FileWriter("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");  
//        FileInputStream fstream1 = new FileInputStream("C:\\Users\\saravanan\\Desktop\\New Text Document (2).txt");
//        DataInputStream in1 = new DataInputStream(fstream1);
//        BufferedReader br1 = new BufferedReader(new InputStreamReader(in1));
        String path = "C:\\Users\\saravanan\\Desktop\\New folder\\";  //local folder path name
        File folder = new File(path);

        File[] listOfFiles = folder.listFiles();

        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
                files = listOfFiles[i].getName();
                if (files.endsWith(".pdf") || files.endsWith(".PDF")) {

                    String nfiles = "C:\\Users\\saravanan\\Desktop\\New folder\\";
                    String fileName1 = nfiles + files;
                    System.out.print("\n\n" + files+"\n");
                    PDDocument document = null;
                    try {
                        document = PDDocument.load(new File(fileName1));
                        PDFTextStripper stripper = new GetWordsFromPDF();
                        stripper.setSortByPosition(true);
                        stripper.setStartPage(0);
                        stripper.setEndPage(document.getNumberOfPages());

                        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
                        stripper.writeText(document, dummy);
                        int x = 0;

                        System.out.println("");
                        for (String word : words) {
                            if (word.startsWith("xxxxxx")) { //here you can give your pdf doc starting word 
                                x = 1;
                            }
                            if (x == 1) {
                                if (!(word.endsWith("YYYYYY"))) { //here you can give your pdf doc ending word 
                                    System.out.print(word + " ");
                                    // fs.write(word);                                   
                                } else {
                                    x = 0;
                                    break;
                                }
                            }
                        }
                    } finally {
                        if (document != null) {
                            document.close();
                            words.clear();
                        }
                    }
                }
            }
        }
    }

    /**
     * Override the default functionality of PDFTextStripper.writeString()
     *
     * @param str
     * @param textPositions
     * @throws java.io.IOException
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String[] wordsInStream = str.split(getWordSeparator());
        if (wordsInStream != null) {
            for (String word : wordsInStream) {
                words.add(word);    //store the pdf content into the List
            }
        }
    }
}

enter image description here

0人赞添加讨论(0) 举报

Reading text of a pdf using PDFBOX occasionally re

Reading multiple PDF docs using pdfbox in java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间