How to extract plain text from PDF in golang

2019-07-12 18:50发布

问题:

I want to extract text from pdf file using GO. I tried using ledongthuc/pdf Go package that implement the method GetPlainText() to get plain text content without format. But I don't get the plain text. I have as a result:

 W
 S
 D
 V
 Y R
 O
 R
 Q
 W
 D
 L
 U
 H
 P
 H
 Q
 W
......

Go code

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf")
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}

回答1:

You can have a message such as "Exemple of a pdf document." instead of

Ex
a
m
pl
e

of

a

pd
f

doc
u
m
e
nt
.

What you need to do is change the textBuilder.WriteString(p.GetPlainText("\n")) to

textBuilder.WriteString(p.GetPlainText(""))

I hope this helps.