How to search for Particular Line Contents in PDF

2019-09-23 04:22发布

问题:

I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color". Here is my code in c#..

                ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

                if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
                {
                    int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
                    int endPosition = currentText.IndexOf("Total");

                    string result = currentText.Substring(startPosition, endPosition - startPosition);
                    // result will contain everything from and up to the Total line

                    using (StringReader reader = new StringReader(result))
                    {
                        // Loop over the lines in the string.
                                string[] split = line.Split(new Char[] { ' ' });

                    }
                }

If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file

回答1:

Please read the documentation before posting semi-duplicate questions, such as:

  • Edit an existing PDF file using iTextSharp
  • How to Read and Mark(Highlight) a pdf file using C#

You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).

In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.

The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:

Hello World

In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):

ld Wor llo He

The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!

When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).

<>
<<ld><Wor><llo><He>>
<<Hello People>>

This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.

The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.

When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.

<<ld><Wor><llo><He>>

results in:

<<He><llo><Wor><ld>>

Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.

You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.