I would like to extract text from a given PDF file with Apache PDFBox.
I wrote this code:
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(filepath);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
However, I got the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)
I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.
Edit
I added System.out.println("program starts");
to the beginning of the program.
I ran it, then I got the same error as mentioned above and program starts
did not appear in the console.
Thus, I think I have a problem with the class path or something.
Thank you.
I executed your code and it worked properly. Maybe your problem is related to FilePath
that you have given to file. I put my pdf in C drive and hard coded the file path.here is my code:
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// import org.apache.pdfbox.io.RandomAccessFile;
public class PDFReader{
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:/my.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
// PDFParser parser = new PDFParser(randomAccessFile);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Using PDFBox 2.0.7, this is how I get the text of a PDF:
static String getText(File pdfFile) throws IOException {
PDDocument doc = PDDocument.load(pdfFile);
return new PDFTextStripper().getText(doc);
}
Call it like this:
try {
String text = getText(new File("/home/me/test.pdf"));
System.out.println("Text in PDF: " + text);
} catch (IOException e) {
e.printStackTrace();
}
Since user oivemaria asked in the comments:
You can use PDFBox in your application by adding it to your dependencies in build.gradle
:
dependencies {
compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
}
Here's more on dependency management using Gradle.
If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.
PdfBox 2.0.3 has a command line tool as well.
- Download jar file
java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]
Options:
-password <password> : Password to decrypt document
-encoding <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
-console : Send text to console instead of file
-html : Output in HTML format instead of raw text
-sort : Sort the text before writing
-ignoreBeads : Disables the separation by beads
-debug : Enables debug output about the time consumption of every stage
-startPage <number> : The first page to start extraction(1 based)
-endPage <number> : The last page to extract(inclusive)
<inputfile> : The PDF document to use
[output-text-file] : The file to write the text to
Maven dep:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.9</version>
</dependency>
Then the fucntion to get the pdf text as String.
private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
try (PDDocument document = PDDocument.load(pdf)) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
// System.out.println("Text:" + st);
// split by whitespace
String lines[] = pdfFileInText.split("\\r?\\n");
List<String> pdfLines = new ArrayList<>();
StringBuilder sb = new StringBuilder();
for (String line : lines) {
System.out.println(line);
pdfLines.add(line);
sb.append(line + "\n");
}
return sb.toString();
}
}
return null;
}
This works fine to extract data from a PDF file that has text content using pdfbox 2.0.6
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
public class PDFTextExtractor {
public static void main(String[] args) throws IOException {
System.out.println(readParaFromPDF("C:\\sample1.pdf",3, "Enter Start Text Here", "Enter Ending Text Here"));
//Enter FilePath, Page Number, StartsWith, EndsWith
}
public static String readParaFromPDF(String pdfPath, int pageNo, String strStartIndentifier, String strEndIdentifier) {
String returnString = "";
try {
PDDocument document = PDDocument.load(new File(pdfPath));
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.setStartPage(pageNo);
tStripper.setEndPage(pageNo);
String pdfFileInText = tStripper.getText(document);
String strStart = strStartIndentifier;
String strEnd = strEndIdentifier;
int startInddex = pdfFileInText.indexOf(strStart);
int endInddex = pdfFileInText.indexOf(strEnd);
returnString = pdfFileInText.substring(startInddex, endInddex) + strEnd;
}
} catch (Exception e) {
returnString = "No ParaGraph Found";
}
return returnString;
}
}