The reason why the PDFVisibleTextStripper from this answer the OP referenced does not work is that the calculation of the end of a character baseline end
in the overwritten processTextPosition
does not take page rotation into account. If you change that method, though, to only test the start of each character baseline and ignore the end, it works fairly good for the document at hand:
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || area.contains(start.getX(), start.getY()))
super.processTextPosition(text);
}
With this processTextPosition
override the result of the text extraction (with SortByPosition
set to true
) is:
Profit & Loss 12 Month Recap
Property: 8151 W. 183rd Street
Monthly recap 05/01/16 - 04/30/17 (cash basis)
MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
INCOME
4000 RENTAL INCOME
4001 Base Rent 343,002.59 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 0.00 76,090.22 38,598.49 66,634.74 930,823.06
4004 Prepaid Rent Inco -165,742.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 38,045.11 -38,045.11 0.00 0.00 -165,742.50
4000 Total RENTAL INC 177,260.09 38,045.11 38,045.11 38,045.11 66,081.36 122,153.86 66,081.36 38,045.11 38,045.11 38,045.11 38,598.49 66,634.74 765,080.56
4200 INCOME CHARGEB
4205 Property Tax Reco 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 4,250.00 3,696.62 4,250.00 50,446.62
4210 CAM Recoveries 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 4,750.00 57,000.00
4200 Total INCOME CH 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 9,000.00 8,446.62 9,000.00 107,446.62
4600 OTHER INCOME
4610 Late / NSF Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
4600 Total OTHER INC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,394.72 3,828.61 0.00 0.00 0.00 5,223.33
TOTAL INCOME 186,260.09 47,045.11 47,045.11 47,045.11 75,081.36 131,153.86 75,081.36 48,439.83 50,873.72 47,045.11 47,045.11 75,634.74 877,750.51
EXPENSE
6000 PROFESSIONAL FE
6010 Professional Fees 0.00 0.00 0.00 2,500.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00
6020 Legal Fees 0.00 0.00 0.00 4,592.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 14,822.85
6000 Total PROFESSIO 0.00 0.00 0.00 7,092.71 0.00 1,466.33 1,703.35 2,006.00 0.00 685.96 4,368.50 0.00 17,322.85
6100 UTILITIES
6105 Water & Sewer 0.00 0.00 0.00 21.21 0.00 0.00 25.81 0.00 0.00 31.91 0.00 0.00 78.93
6110 Electricity 1,000.91 358.23 390.43 350.71 353.69 0.00 666.39 381.97 486.85 449.62 480.21 486.81 5,405.82
6125 Trash Removal 229.54 231.34 232.56 232.78 231.66 240.94 240.94 241.40 241.40 518.97 259.18 0.00 2,900.71
6100 Total UTILITIES 1,230.45 589.57 622.99 604.70 585.35 240.94 933.14 623.37 728.25 1,000.50 739.39 486.81 8,385.46
6200 REPAIR & MAINTEN
6210 Field & Grounds - 3,094.00 0.00 0.00 2,313.84 1,009.50 0.00 1,439.58 1,302.75 600.00 0.00 0.00 1,909.73 11,669.40
6211 Irrigation / Sprinkle 0.00 0.00 0.00 0.00 0.00 1,121.08 350.00 0.00 0.00 0.00 0.00 0.00 1,471.08
6215 Landscape / Lawn 565.71 565.71 565.71 565.71 565.71 565.71 1,165.71 0.00 0.00 0.00 0.00 495.00 5,054.97
6220 Sanitary Sewers 0.00 0.00 0.00 950.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 950.00
6221 Storm Drains 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2,500.00 0.00 2,500.00
6223 Snow Removal 1,365.00 3,440.00 0.00 0.00 0.00 0.00 0.00 1,350.00 4,440.00 4,106.00 790.00 2,340.00 17,831.00
6228 Ceiling Tiles 0.00 0.00 0.00 0.00 53.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 53.30
6231 Building - General 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 634.65 634.65
6233 Roof / Flashing 1,840.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 764.00 0.00 2,604.00
6234 Electrical Repairs 0.00 0.00 0.00 395.00 0.00 0.00 960.00 90.00 0.00 0.00 0.00 0.00 1,445.00
6236 Plumbing Repairs 0.00 0.00 3,316.59 0.00 2,315.95 0.00 930.00 812.17 0.00 0.00 0.00 0.00 7,374.71
6237 Fire & Life Safety 0.00 0.00 0.00 0.00 0.00 150.00 0.00 0.00 660.00 0.00 0.00 1,550.00 2,360.00
6238 Lighting Supplies 0.00 0.00 0.00 0.00 0.00 0.00 875.00 193.05 0.00 0.00 0.00 0.00 1,068.05
Profit & Loss 12 Month Recap 05/02/17 11:13 AM Page 1 of rentmanager.com - property management systems rev.12.180
MAY 16 JUN 16 JUL 16 AUG 16 SEP 16 OCT 16 NOV 16 DEC 16 JAN 17 FEB 17 MAR 17 APR 17 TOTAL
6240 Lock & Key 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.59 0.00 0.00 0.00 14.59
6242 HVAC Expense 4,375.00 0.00 1,370.00 2,043.25 0.00 0.00 0.00 415.00 1,326.00 1,835.00 0.00 0.00 11,364.25
6251 Pest Control 0.00 71.07 0.00 71.07 0.00 0.00 71.07 71.07 0.00 71.07 0.00 71.07 426.42
6200 Total REPAIR & M 11,239.71 4,076.78 5,252.30 6,338.87 3,944.46 1,836.79 5,791.36 4,234.04 7,040.59 6,012.07 4,054.00 7,000.45 66,821.42
6300 JANITORIAL
6310 Janitorial Services 1,935.00 1,935.00 1,935.00 1,935.00 1,935.00 0.00 3,870.00 1,935.00 1,935.00 1,935.00 1,995.00 1,995.00 23,340.00
6320 Janitorial Supplies 79.74 260.01 79.74 90.84 113.14 0.00 170.58 0.00 365.61 90.84 0.00 153.01 1,403.51
6300 Total JANITORIAL 2,014.74 2,195.01 2,014.74 2,025.84 2,048.14 0.00 4,040.58 1,935.00 2,300.61 2,025.84 1,995.00 2,148.01 24,743.51
6400 PAYROLL
6410 P/R Salaries - Offi 2,167.72 2,190.43 2,213.14 2,213.14 1,512.40 2,342.28 2,224.93 2,107.58 2,107.58 2,107.58 2,190.78 2,344.16 25,721.72
6412 P/R Taxes - Office 179.87 167.56 169.30 169.30 115.70 179.18 170.21 161.23 238.16 231.10 199.89 196.42 2,177.92
6420 Employee Insuran 76.06 76.14 76.22 199.23 104.30 161.06 152.29 137.91 139.14 139.14 143.91 175.02 1,580.42
6421 Employee Benefit 3.54 2.40 87.37 141.59 35.59 114.13 111.50 110.15 89.47 107.81 114.80 49.60 967.95
6423 Workers Compens 42.50 42.94 37.74 32.10 21.93 33.96 32.26 30.56 30.56 30.56 31.76 33.98 400.85
6400 Total PAYROLL 2,469.69 2,479.47 2,583.77 2,755.36 1,789.92 2,830.61 2,691.19 2,547.43 2,604.91 2,616.19 2,681.14 2,799.18 30,848.86
6500 TAXES INSURANCE
6510 Real Estate Tax E 69,570.07 0.00 0.00 0.00 0.00 69,570.07 0.00 0.00 0.00 0.00 0.00 0.00 139,140.14
6520 Insurance Expens 2,078.00 2,704.50 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 2,704.50 0.00 0.00 12,896.00
6500 Total TAXES INSU 71,648.07 2,704.50 0.00 2,704.50 0.00 69,570.07 2,704.50 0.00 0.00 2,704.50 0.00 0.00 152,036.14
6600 Property Manageme 9,575.44 8,381.70 2,117.03 2,117.03 2,117.03 3,378.66 5,901.92 3,378.66 2,179.79 2,000.00 3,829.06 2,117.03 47,093.35
6650 Receiver Fees 6,625.00 6,125.00 0.00 0.00 6,875.00 0.00 7,062.50 8,375.00 0.00 0.00 8,875.00 0.00 43,937.50
6700 GENERAL & ADMIN
6710 PM / Work Order S 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 95.00 1,140.00
6720 Postage / Messen 63.58 0.00 7.59 9.64 20.63 5.98 6.99 0.00 17.38 7.21 14.36 10.98 164.34
6725 Office Supplies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 148.88 148.88
6735 Office Equipment 0.00 0.00 0.00 0.00 0.00 0.00 0.00 218.40 0.00 0.00 0.00 0.00 218.40
6740 Telephone 21.33 0.00 11.54 15.00 21.12 8.76 9.77 0.00 13.19 11.96 3.14 7.88 123.69
6760 Auto Mileage & Ex 100.44 0.00 68.75 140.24 104.14 61.29 142.59 29.00 56.04 0.00 23.14 0.00 725.63
6770 Leasing & Maint. O 0.00 0.00 0.00 0.00 0.00 0.00 75.00 0.00 0.00 0.00 0.00 0.00 75.00
6780 Bank Fees 129.45 0.00 0.00 105.91 87.62 0.00 53.61 0.00 120.92 56.46 77.49 79.74 711.20
6700 Total GENERAL & 409.80 95.00 182.88 365.79 328.51 171.03 382.96 342.40 302.53 170.63 213.13 342.48 3,307.14
TOTAL EXPENSE 105,212.90 26,647.03 12,773.71 24,004.80 17,688.41 79,494.43 31,211.50 23,441.90 15,156.68 17,215.69 26,755.22 14,893.96 394,496.23
NOI 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 35,717.04 29,829.42 20,289.89 60,740.78 483,254.28
N/O EXPENSE
7100 NON-OPERATING E
7110 Lease Commissio 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 0.00 0.00 0.00 33,203.00
7130 Professional Fees 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1,276.00 0.00 0.00 1,276.00
7100 Total NON-OPER 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
TOTAL N/O EXPENSE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 33,203.00 1,276.00 0.00 0.00 34,479.00
NET INCOME 81,047.19 20,398.08 34,271.40 23,040.31 57,392.95 51,659.43 43,869.86 24,997.93 2,514.04 28,553.42 20,289.89 60,740.78 448,775.28
Profit & Loss 12 Month Recap 05/02/17 11:13 AM Page 2 of rentmanager.com - property management systems rev.12.180
At first glance the only visible text missing is the total number of pages in the footers of both pages.
As said by the OP in a comment
It seems same thing should be applied in deleteCharsInPath()
Indeed, deleteCharsInPath
should also be changed to:
void deleteCharsInPath() {
for (List<TextPosition> list : charactersByArticle) {
List<TextPosition> toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
if (linePath.contains(start.getX(), start.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
list.removeAll(toRemove);
}
}
}
The OP presented another document in which even the PDFVisibleTextStripper
as corrected above failed to properly recognize the visible characters.
The cause is another normalization by PDFBox text stripping moving the origin into the lower left corner of the crop box.
Patching the PDFVisibleTextStripper
methods to add the lower left crop box coordinate values again results in a decent extraction of visible text.
Overriding processPage
allows us to read the lower left crop box coordinates:
float lowerLeftX = 0;
float lowerLeftY = 0;
@Override
public void processPage(PDPage page) throws IOException {
PDRectangle pageSize = page.getCropBox();
lowerLeftX = pageSize.getLowerLeftX();
lowerLeftY = pageSize.getLowerLeftY();
super.processPage(page);
}
processTextPosition
and deleteCharsInPath
need to take these values into account:
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
super.processTextPosition(text);
}
[...]
void deleteCharsInPath() {
for (List<TextPosition> list : charactersByArticle) {
List<TextPosition> toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
if (linePath.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
list.removeAll(toRemove);
}
}
}
Now the extraction result is ok for the new file, too. ;)