How can I read a PDF file line by line
using iText5 for .NET?
I have search through the internet but I only found reading PDF file per page content.
Please see below code.
public string ReadPdfFile(object Filename)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader((string)Filename);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return strText;
}
Try this, use theLocationTextExtractionStrategy
instead of the SimpleTextExtractionStrategy
it will add new line characters to the text returned. Then you can use strText.Split('\n') to split your text into a string[]
and consume it on a per line basis.
You can find here the PDF2Text Pilot licensed under BSD Open-Sourse software.
Despite that it's written in c++, it may serve as an an inspiring good start toward solving your problem.
I'm not proficient in C# but I think there might be some hope on the interoperability side ?
I worked for a eBook reading company and PDFs, we spent a lot of time and effort trying to get the reading order of text, since the reader could read to you ... bouncing dot ... PDFs do not have to have line by line sequence. Books also have lots of elements that are not in reading order including page number, references, captions, examples, multi-column, etc.. It's a hard problem. PDF is basically a print format at its heart.
If you make a eBook reader for PDF, either just show as what PDF is, same look as other pdf ready does. Or read the text out and reformat yourself.
I prefer the second method, just format the text whatever nice since if I use the ebook reader, I just care the content and never care about what it should look like