I'm trying to remove text from a particular section of a PDF. If I know the X,Y co-ordinates of the area, I'm able to remove the text. But I'm unable to get the co-ordinates of the selected area from PDF. Kindly help me.
问题:
回答1:
This question is a follow-up of your previous question: Remove text occurrences contained in a specified area with iText
In this question, you ask about removing content from a specific area. Now you are asking how to determine this specific area, but your question is incomplete: you are not telling us any of the criteria to select the area.
It seems that you are trying to do something that is called redaction. This is explained in the StackOverflow question: How to create and apply redactions?
In the answer to that question, I explain how to create redaction annotations programmatically. However, redaction is usually done manually, using Adobe Acrobat:
The arrow shows the functionality you need: Tools > Protection > Mark for Redaction
If you only need the coordinates and no redaction annotation, you could introduce another annotation that allows you to mark a rectangle manually and then use iText to extract the coordinates. For instance: if the rectangle is a form field, then it's really easy to get the coordinates. If the content you want to remove is a value of the form field, it's even easier to remove that content: you just remove the field.
If there is no way to retrieve these coordinates manually, then you may be facing something that is impossible: for instance: if you don't know anything about the content of the area you want to remove, how on earth are you going to teach a program what it needs to remove?
If you do know what content you're looking for, you have to parse for that content. That question has been asked and answered before: Get the exact Stringposition in PDF
Update:
In the comments, you explain that you convert the PDF page to an image, that you render the image in a Java Swing application so that a user can select a rectangle. This rectangle is stored as a java.awt.Image
.
This leads to the following potential problems due to the fact that the coordinate system in Java is different from the coordinate system in PDF.
- The Y-axis is different: In PDF, the size of the page is described in rectangles that we call page boundaries. The most important page boundaries are the MediaBox (mandatory) and the CropBox (optional). The MediaBox contains the coordinates of the lower-left corner and the upper-right corner of the rectangle that defines your page. In the coordinate system, the Y-axis points upwards. The Y coordinate of the lower-left corner is lower than the Y coordinate of the upper-right corner. In Java, it's the other way around: the Y coordinate at the top of an object is 0 and the Y-axis points downwards: the higher the Y value, the lower the object at this Y value.
- There may be an offset: In most cases, the lower-left corner of the MediaBox has the coordinate X = 0, Y = 0. This isn't always the case. It may be necessary to take into account an offset.
- The resolution can be different: The default user unit corresponds with a point. For instance: an A4 page measures 595 by 842 user units. There are 72 points in every inch. When you create an image, you don't necessarily measure in points. Maybe you measure in pixels. Maybe you create an image with 300 pixels per inch (300 dpi).
All these reasons can cause the rectangle you get from your Swing app to be different from the coordinates you need to use in PDF. You need to take all of this into account, otherwise, you'll keep on facing you "it doesn't work" problem. This is not an iText problem, this is a Math problem.