PDF Parsing with SWIFT

2019-04-06 06:49发布

I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".

I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.

Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.

Here my code:

// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
    let myPage = CGPDFDocumentGetPage(pdf, index)
    //Search somehow the string "Name:" to get whats written next
}

2条回答
We Are One
2楼-- · 2019-04-06 07:01

This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.

https://github.com/SimpleApp/PDFParser

[disclaimer : lib author]

[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]

EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.

That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.

查看更多
Summer. ? 凉城
3楼-- · 2019-04-06 07:13

You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.

Here is an example:

import Quartz

let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))

guard let contents = pdf?.string else {
    print("could not get string from pdf: \(String(describing: pdf))")
    exit(1)
}

let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note

print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text

// Output: "The operating system being written in C resulted in a more portable software."

You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.

查看更多
登录 后发表回答