Reading PDF per Line

2020-03-22 07:39发布

问题:

How can I read a PDF file line by line using iText5 for .NET? I have search through the internet but I only found reading PDF file per page content.

Please see below code.

public string ReadPdfFile(object Filename)
{

    string strText = string.Empty;
    try
    {
        PdfReader reader = new PdfReader((string)Filename);

        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();

            String s = PdfTextExtractor.GetTextFromPage(reader, page, its);

            s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
            strText = strText + s;

        }
        reader.Close();
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
    return strText;
}

回答1:

Try this, use theLocationTextExtractionStrategy instead of the SimpleTextExtractionStrategy it will add new line characters to the text returned. Then you can use strText.Split('\n') to split your text into a string[] and consume it on a per line basis.



回答2:

You can find here the PDF2Text Pilot licensed under BSD Open-Sourse software.

Despite that it's written in c++, it may serve as an an inspiring good start toward solving your problem.

I'm not proficient in C# but I think there might be some hope on the interoperability side ?



回答3:

I worked for a eBook reading company and PDFs, we spent a lot of time and effort trying to get the reading order of text, since the reader could read to you ... bouncing dot ... PDFs do not have to have line by line sequence. Books also have lots of elements that are not in reading order including page number, references, captions, examples, multi-column, etc.. It's a hard problem. PDF is basically a print format at its heart.



回答4:

If you make a eBook reader for PDF, either just show as what PDF is, same look as other pdf ready does. Or read the text out and reformat yourself.

I prefer the second method, just format the text whatever nice since if I use the ebook reader, I just care the content and never care about what it should look like



标签: c# pdf itext