How to extract plain text from PDF in golang

2019-07-12 18:31发布

I want to extract text from pdf file using GO. I tried using ledongthuc/pdf Go package that implement the method GetPlainText() to get plain text content without format. But I don't get the plain text. I have as a result:

 W
 S
 D
 V
 Y R
 O
 R
 Q
 W
 D
 L
 U
 H
 P
 H
 Q
 W
......

Go code

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf")
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}

标签： pdf go text extract

1条回答

兄弟一词,经得起流年.

2楼-- · 2019-07-12 18:43

You can have a message such as "Exemple of a pdf document." instead of

Ex
a
m
pl
e

of

a

pd
f

doc
u
m
e
nt
.

What you need to do is change the textBuilder.WriteString(p.GetPlainText("\n")) to

textBuilder.WriteString(p.GetPlainText(""))

I hope this helps.

0人赞添加讨论(0) 举报

How to extract plain text from PDF in golang

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间