How to fix orphaned punctuation in iText

2020-05-03 10:15发布

I saw in How to fix iText's text wrapping for chinese characters that another user had a similar problem as what we're facing. A response by https://stackoverflow.com/users/1622493/bruno-lowagie indicated the DefaultSplitCharacter has taken Chinese characters into account since iText 5. We're using iText 5.5.6, but still see the problem.

As near as I can tell, DefaultSplitCharacter is working correctly, but the problem appears to be that the ColumnText class allows lines to begin with these punctuation marks.

Here's a screen shot of the PdfChunks in the BidiLine class being used to render the text

However, the result is being written where the 3rd and 5th lines both begin with punctuation characters as show in this image of the PDF output

I can simply add some new lines in the proper places to make it look correct, but this would mean if the text is ever re-translated internally my fix may no longer work. Does anyone know how to ensure that iText won't begin a line with these punctuation characters?

标签: itext
2条回答
神经病院院长
2楼-- · 2020-05-03 11:02

I'm using iTextSharp. I wrote a ISplitCharacter following k.f.'s sample.

public class CJKSplitCharacter : ISplitCharacter
{
    public static ISplitCharacter Default = new CJKSplitCharacter();
    private static ISplitCharacter defaultSplit = new DefaultSplitCharacter();

    public bool IsSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck)
    {
        char charCurrent = GetChar(current, cc, ck);
        int next = current + 1;
        if (next < cc.Length)
        {
            char charNext = GetChar(next, cc, ck);
            // if next char is close char, do not break here
            if (IsCloseChar(charNext))
            {
                return false;
            }
            // otherwise, if current char is close char, mark as breakable
            else if (IsCloseChar(charCurrent))
            {
                return true;
            }
        }
        // if current char is open char, do not break here
        if (IsOpenChar(charCurrent))
        {
            return false;
        }

        // default:
        // split every CJK character

        if (Char.GetUnicodeCategory(charCurrent) == UnicodeCategory.OtherLetter) // CJK Letters
        {
            return true;
        }
        else
        {
            return defaultSplit.IsSplitCharacter(start, current, end, cc, ck);
        }
    }
    private char GetChar(int position, char[] cc, PdfChunk[] ck)
    {
        char c;
        if (ck == null || ck[Math.Min(position, ck.Length - 1)] == null)
        {
            c = cc[position];
        }
        else
        {
            c = (char)ck[Math.Min(position, ck.Length - 1)].GetUnicodeEquivalent(cc[position]);
        }
        return c;
    }

    private bool IsCloseChar(char c)
    {
        UnicodeCategory cat = Char.GetUnicodeCategory(c);
        return (cat == UnicodeCategory.ClosePunctuation         //right bracket/brace, eg: )]
            || cat == UnicodeCategory.FinalQuotePunctuation     //right quote, eg: ”
            || cat == UnicodeCategory.OtherPunctuation          //other punctuation, eg: ,。
            );
    }
    private bool IsOpenChar(char c)
    {
        UnicodeCategory cat = Char.GetUnicodeCategory(c);
        return (cat == UnicodeCategory.OpenPunctuation          //left bracket/brace, eg: ([
            || cat == UnicodeCategory.InitialQuotePunctuation   //right quote, eg: “
            );
    }
}
查看更多
神经病院院长
3楼-- · 2020-05-03 11:05

For breaking lines in Asian languages you need to write your own implementation of SplitCharacter. A good reference for line breaking is Unicode® Standard Annex #14 -Unicode Line Breaking Algorithm. Another one is https://msdn.microsoft.com/en-us/library/cc194864.aspx.

Having suffered through implementing this for Japanese, I'm putting example code I wrote for Japanese text mixed with English text. This code could be modified for Chinese fairly easily using the references above.

Here is a snippet showing JapaneseSplitCharacter in use:

  Chunk chunk = new Chunk(<asian text>,<asian font>);
  chunk.setSplitCharacter(JapaneseSplitCharacter.SplitCharacter);
  Paragraph paragraph = new Paragraph(chunk);  

Here is the code for JapaneseSplitCharacter:

import com.itextpdf.text.SplitCharacter;
import com.itextpdf.text.pdf.DefaultSplitCharacter;
import com.itextpdf.text.pdf.PdfChunk;

/**
 * <p/>
 * For basic latin characters spaces, periods, commas, etc. are split characters. For Japanese characters lines can break
 * anywhere, unless prohibited. This class uses logic for Japanese, non-starting and non-ending characters based on the
 * kinsoku rule and uses the DefaultSplitCharacter class for basic latin characters while writing free flowing text to a PDF.
 * <p/>
 */

public class JapaneseSplitCharacter implements SplitCharacter {

  // line of text cannot start or end with this character
  static final char u2060 = '\u2060';   //       - ZERO WIDTH NO BREAK SPACE

  // a line of text cannot start with any following characters in NOT_BEGIN_CHARACTERS[]
  static final char u30fb = '\u30fb';   //  ・   - KATAKANA MIDDLE DOT
  static final char u2022 = '\u2022';   //  •    - BLACK SMALL CIRCLE (BULLET)
  static final char uff65 = '\uff65';   //  ・    - HALFWIDTH KATAKANA MIDDLE DOT
  static final char u300d = '\u300d';   //  」   - RIGHT CORNER BRACKET
  static final char uff09 = '\uff09';   //  )   - FULLWIDTH RIGHT PARENTHESIS
  static final char u0021 = '\u0021';   //  !    - EXCLAMATION MARK
  static final char u0025 = '\u0025';   //  %    - PERCENT SIGN
  static final char u0029 = '\u0029';   //  )    - RIGHT PARENTHESIS
  static final char u002c = '\u002c';   //  ,    - COMMA
  static final char u002e = '\u002e';   //  .    - FULL STOP
  static final char u003f = '\u003f';   //  ?    - QUESTION MARK
  static final char u005d = '\u005d';   //  ]    - RIGHT SQUARE BRACKET
  static final char u007d = '\u007d';   //  }    - RIGHT CURLY BRACKET
  static final char uff61 = '\uff61';   //  。    - HALFWIDTH IDEOGRAPHIC FULL STOP
  static final char uff63 = '\uff63';   //  」    - HALFWIDTH RIGHT CORNER BRACKET
  static final char uff64 = '\uff64';   //  、    - HALFWIDTH IDEOGRAPHIC COMMA
  static final char uff67 = '\uff67';   //  ァ    - HALFWIDTH KATAKANA LETTER SMALL A
  static final char uff68 = '\uff68';   //  ィ    - HALFWIDTH KATAKANA LETTER SMALL I
  static final char uff69 = '\uff69';   //  ゥ    - HALFWIDTH KATAKANA LETTER SMALL U
  static final char uff6a = '\uff6a';   //  ェ    - HALFWIDTH KATAKANA LETTER SMALL E
  static final char uff6b = '\uff6b';   //  ォ    - HALFWIDTH KATAKANA LETTER SMALL O
  static final char uff6c = '\uff6c';   //  ャ    - HALFWIDTH KATAKANA LETTER SMALL YA
  static final char uff6d = '\uff6d';   //  ュ    - HALFWIDTH KATAKANA LETTER SMALL YU
  static final char uff6e = '\uff6e';   //  ョ    - HALFWIDTH KATAKANA LETTER SMALL YO
  static final char uff6f = '\uff6f';   //  ッ    - HALFWIDTH KATAKANA LETTER SMALL TU
  static final char uff70 = '\uff70';   //  ー    - HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char uff9e = '\uff9e';   //  ゙    - HALFWIDTH KATAKANA VOICED SOUND MARK
  static final char uff9f = '\uff9f';   //  ゚    - HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
  static final char u3001 = '\u3001';   //  、    - IDEOGRAPHIC COMMA
  static final char u3002 = '\u3002';   //  。    - IDEOGRAPHIC FULL STOP
  static final char uff0c = '\uff0c';   //  ,    - FULLWIDTH COMMA
  static final char uff0e = '\uff0e';   //  .    - FULLWIDTH FULL STOP
  static final char uff1a = '\uff1a';   //  :    - FULLWIDTH COLON
  static final char uff1b = '\uff1b';   //  ;    - FULLWIDTH SEMICOLON
  static final char uff1f = '\uff1f';   //  ?    - FULLWIDTH QUESTION MARK
  static final char uff01 = '\uff01';   //  !    - FULLWIDTH EXCLAMATION MARK
  static final char u309b = '\u309b';   //  ゛    - KATAKANA-HIRAGANA VOICED SOUND MARK
  static final char u309c = '\u309c';   //  ゜    - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
  static final char u30fd = '\u30fd';   //  ヽ    - KATAKANA ITERATION MARK
  static final char u30fe = '\u30fe';   //  ヾ    - KATAKANA VOICED ITERATION MARK
  static final char u309d = '\u309d';   //  ゝ    - HIRAGANA ITERATION MARK
  static final char u309e = '\u309e';   //  ゞ    - HIRAGANA VOICED ITERATION MARK
  static final char u3005 = '\u3005';   //  々    - IDEOGRAPHIC ITERATION MARK
  static final char u30fc = '\u30fc';   //  ー    - KATAKANA-HIRAGANA PROLONGED SOUND MARK
  static final char u2019 = '\u2019';   //  ’    - RIGHT SINGLE QUOTATION MARK
  static final char u201d = '\u201d';   //  ”    - RIGHT DOUBLE QUOTATION MARK
  static final char u3015 = '\u3015';   //  〕    - RIGHT TORTOISE SHELL BRACKET
  static final char uff3d = '\uff3d';   //  ]    - FULLWIDTH RIGHT SQUARE BRACKET
  static final char uff5d = '\uff5d';   //  }    - FULLWIDTH RIGHT CURLY BRACKET
  static final char u3009 = '\u3009';   //  〉    - RIGHT ANGLE BRACKET
  static final char u300b = '\u300b';   //  》    - RIGHT DOUBLE ANGLE BRACKET
  static final char u300f = '\u300f';   //  』    - RIGHT WHITE CORNER BRACKET
  static final char u3011 = '\u3011';   //  】    - RIGHT BLACK LENTICULAR BRACKET
  static final char u00b0 = '\u00b0';   //  °    - DEGREE SIGN
  static final char u2032 = '\u2032';   //  ′    - PRIME
  static final char u2033 = '\u2033';   //  ″    - DOUBLE PRIME
  static final char u2103 = '\u2103';   //  ℃    - DEGREE CELSIUS
  static final char u00a2 = '\u00a2';   //  ¢    - CENT SIGN
  static final char uff05 = '\uff05';   //  %    - FULLWIDTH PERCENT SIGN
  static final char u2030 = '\u2030';   //  ‰    - PER MILLE SIGN
  static final char u3041 = '\u3041';   //  ぁ    - HIRAGANA LETTER SMALL A
  static final char u3043 = '\u3043';   //  ぃ    - HIRAGANA LETTER SMALL I
  static final char u3045 = '\u3045';   //  ぅ    - HIRAGANA LETTER SMALL U
  static final char u3047 = '\u3047';   //  ぇ    - HIRAGANA LETTER SMALL E
  static final char u3049 = '\u3049';   //  ぉ    - HIRAGANA LETTER SMALL O
  static final char u3063 = '\u3063';   //  っ    - HIRAGANA LETTER SMALL TU
  static final char u3083 = '\u3083';   //  ゃ    - HIRAGANA LETTER SMALL YA
  static final char u3085 = '\u3085';   //  ゅ    - HIRAGANA LETTER SMALL YU
  static final char u3087 = '\u3087';   //  ょ    - HIRAGANA LETTER SMALL YO
  static final char u308e = '\u308e';   //  ゎ    - HIRAGANA LETTER SMALL WA
  static final char u30a1 = '\u30a1';   //  ァ    - KATAKANA LETTER SMALL A
  static final char u30a3 = '\u30a3';   //  ィ    - KATAKANA LETTER SMALL I
  static final char u30a5 = '\u30a5';   //  ゥ    - KATAKANA LETTER SMALL U
  static final char u30a7 = '\u30a7';   //  ェ    - KATAKANA LETTER SMALL E
  static final char u30a9 = '\u30a9';   //  ォ    - KATAKANA LETTER SMALL O
  static final char u30c3 = '\u30c3';   //  ッ    - KATAKANA LETTER SMALL TU
  static final char u30e3 = '\u30e3';   //  ャ    - KATAKANA LETTER SMALL YA
  static final char u30e5 = '\u30e5';   //  ュ    - KATAKANA LETTER SMALL YU
  static final char u30e7 = '\u30e7';   //  ョ    - KATAKANA LETTER SMALL YO
  static final char u30ee = '\u30ee';   //  ヮ    - KATAKANA LETTER SMALL WA
  static final char u30f5 = '\u30f5';   //  ヵ    - KATAKANA LETTER SMALL KA
  static final char u30f6 = '\u30f6';   //  ヶ    - KATAKANA LETTER SMALL KE

  static final char[] NOT_BEGIN_CHARACTERS = new char[]{u30fb, u2022, uff65, u300d, uff09, u0021, u0025, u0029, u002c,
          u002e, u003f, u005d, u007d, uff61, uff63, uff64, uff67, uff68, uff69, uff6a, uff6b, uff6c, uff6d, uff6e,
          uff6f, uff70, uff9e, uff9f, u3001, u3002, uff0c, uff0e, uff1a, uff1b, uff1f, uff01, u309b, u309c, u30fd,
          u30fe, u309d, u309e, u3005, u30fc, u2019, u201d, u3015, uff3d, uff5d, u3009, u300b, u300f, u3011, u00b0,
          u2032, u2033, u2103, u00a2, uff05, u2030, u3041, u3043, u3045, u3047, u3049, u3063, u3083, u3085, u3087,
          u308e, u30a1, u30a3, u30a5, u30a7, u30a9, u30c3, u30e3, u30e5, u30e7, u30ee, u30f5, u30f6, u2060};

  // a line of text cannot end with any following characters in NOT_ENDING_CHARACTERS[]
  static final char u0024 = '\u0024';   //  $   - DOLLAR SIGN
  static final char u0028 = '\u0028';   //  (   - LEFT PARENTHESIS
  static final char u005b = '\u005b';   //  [   - LEFT SQUARE BRACKET
  static final char u007b = '\u007b';   //  {   - LEFT CURLY BRACKET
  static final char u00a3 = '\u00a3';   //  £   - POUND SIGN
  static final char u00a5 = '\u00a5';   //  ¥   - YEN SIGN
  static final char u201c = '\u201c';   //  “   - LEFT DOUBLE QUOTATION MARK
  static final char u2018 = '\u2018';   //   ‘  - LEFT SINGLE QUOTATION MARK
  static final char u300a = '\u300a';   //  《  - LEFT DOUBLE ANGLE BRACKET
  static final char u3008 = '\u3008';   //  〈  - LEFT ANGLE BRACKET
  static final char u300c = '\u300c';   //  「  - LEFT CORNER BRACKET
  static final char u300e = '\u300e';   //  『  - LEFT WHITE CORNER BRACKET
  static final char u3010 = '\u3010';   //  【  - LEFT BLACK LENTICULAR BRACKET
  static final char u3014 = '\u3014';   //  〔  - LEFT TORTOISE SHELL BRACKET
  static final char uff62 = '\uff62';   //  「   - HALFWIDTH LEFT CORNER BRACKET
  static final char uff08 = '\uff08';   //  (  - FULLWIDTH LEFT PARENTHESIS
  static final char uff3b = '\uff3b';   //  [  - FULLWIDTH LEFT SQUARE BRACKET
  static final char uff5b = '\uff5b';   //  {  - FULLWIDTH LEFT CURLY BRACKET
  static final char uffe5 = '\uffe5';   //  ¥  - FULLWIDTH YEN SIGN
  static final char uff04 = '\uff04';   //  $  - FULLWIDTH DOLLAR SIGN

  static final char[] NOT_ENDING_CHARACTERS = new char[]{u0024, u0028, u005b, u007b, u00a3, u00a5, u201c, u2018, u3008,
          u300a, u300c, u300e, u3010, u3014, uff62, uff08, uff3b, uff5b, uffe5, uff04, u2060};

  /**
   * An instance of the jpSplitCharacter.
   */
  public static final JapaneseSplitCharacter SplitCharacter = new JapaneseSplitCharacter();

  /**
   * An instance DefaultSplitCharacter used for BasicLatin characters.
   */
  private static final SplitCharacter defaultSplitCharacter = new DefaultSplitCharacter();

  public JapaneseSplitCharacter() { }

  /**
   * Custom method to for SplitCharacter to handle Japanese characters.
   * Returns <CODE>true</CODE> if the character can split a line. The splitting implementation
   * is free to look ahead or look behind characters to make a decision.
   *
   * @param start   the lower limit of <CODE>cc</CODE> inclusive
   * @param current the pointer to the character in <CODE>cc</CODE>
   * @param end     the upper limit of <CODE>cc</CODE> exclusive
   * @param cc      an array of characters at least <CODE>end</CODE> sized
   * @param ck      an array of <CODE>PdfChunk</CODE>. The main use is to be able to call
   *                {@link PdfChunk#getUnicodeEquivalent(int)}. It may be <CODE>null</CODE>
   *                or shorter than <CODE>end</CODE>. If <CODE>null</CODE> no conversion takes place.
   *                If shorter than <CODE>end</CODE> the last element is used
   * @return <CODE>true</CODE> if the character(s) can split a line
   */
  public boolean isSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck) {

    // Note: If you don't add an try/catch iText and there is an issue with isSplitCharacter() silently fails and
    // you have no idea there was a problem.
    try {

      char charCurrent = getCharacter(current, cc, ck);

      int next = current + 1;
      if (next < cc.length) {
        char charNext = getCharacter(next, cc, ck);
        for (char not_begin_character : NOT_BEGIN_CHARACTERS) {
          if (charNext == not_begin_character) {
            return false;
          }
        }
      }

      for (char not_ending_character : NOT_ENDING_CHARACTERS) {
        if (charCurrent == not_ending_character) {
          return false;
        }
      }

      boolean isBasicLatin = Character.UnicodeBlock.of(charCurrent) == Character.UnicodeBlock.BASIC_LATIN;
      if (isBasicLatin)
        return  defaultSplitCharacter.isSplitCharacter(start, current, end, cc, ck);

      return true;

    } catch (Exception ex) {
      ex.printStackTrace();
    }

    return true;
  }

  /**
   * Returns a character int the array (Note: modified from the iText default version with the addition null
   * check of '|| ck[Math.min(position, ck.length - 1)] == null'.
   *
   * @param position position in the array
   * @param ck       chunk array
   * @param cc       the character array that has to be checked
   * @return the character
   */
  protected char getCharacter(int position, char[] cc, PdfChunk[] ck) {
    if (ck == null || ck[Math.min(position, ck.length - 1)] == null) {
      return cc[position];
    }
    return (char) ck[Math.min(position, ck.length - 1)].getUnicodeEquivalent(cc[position]);
  }

}

Hope this helps.

查看更多
登录 后发表回答