Trim() doesn't work with tables

2019-08-14 15:17发布

问题:

I had to solve a little problem today (trimming trailing whitespace in a MS Word document that the PDF converter had added to each and every cell), and I quickly found out that this isn't possible using the standard Word interface, so wrote a small VBA script:

Sub TrimCellSpaces()
    Dim itable As Table
    Dim C As Cell
    For Each itable In ThisDocument.Tables
        For Each C In itable.Range.Cells
            C.Range.Text = Trim(C.Range.Text)
        Next
    Next
End Sub

I was surprised that not only did this fail to remove the trailing spaces, it even added paragraph markers at the end of each cell. So I tried a regex approach:

Sub TrimCellSpaces()
    Dim myRE As New RegExp
    Dim itable As Table
    Dim C As Cell
    myRE.Pattern = "\s+$"
    For Each itable In ThisDocument.Tables
        For Each C In itable.Range.Cells
            With myRE
                C.Range.Text = .Replace(C.Range.Text, "")
            End With
        Next
    Next
End Sub

Same result. I added a breakpoint, copied the value of C.Range.Text (before replacement) into a hex editor and found that it ended in the hex sequence 0D 0D 07 (07 is the ASCII Bell character (!)).

I changed the regex to \s+(?!.*\w), and the script worked flawlessly. After the replace operation, the value of C.Range.Text ended only in 0D 07 (one 0D fewer).

I also tried this with a newly created table, not one generated by Word's PDF importer - same results.

What's going on here? Is Word using 0D 0D 07 as an "end of cell" marker? Or is it 0D 07? Why did \s+ remove only one 0D?

回答1:

All cells in Word end in ANSI 13 + ANSI 07 - it's the "end of cell" marker (a little "sunshine" if you have the display of non-printing characters turned on in the UI). Word uses this for structuring the table and storing cell-related information.

It's not possible to remove this character combination from the table cells - Word requires it. If you could remove it, the table would break. So Word simply prevents you from deleting them.

If you need table cell content as a text string you basically need to check the character codes of the last two characters and remove them before you use the string. You need to check the two characters because Microsoft changed the way text is returned from a cell a few versions back. Sometimes it returns only one of the characters, sometimes both, depending on how you pick up the information and which version of Word is involved.