I saw in
How to fix iText's text wrapping for chinese characters that another user had a similar problem as what we're facing. A response by https://stackoverflow.com/users/1622493/bruno-lowagie indicated the DefaultSplitCharacter has taken Chinese characters into account since iText 5. We're using iText 5.5.6, but still see the problem.
As near as I can tell, DefaultSplitCharacter is working correctly, but the problem appears to be that the ColumnText class allows lines to begin with these punctuation marks.
Here's a screen shot of the PdfChunks in the BidiLine class being used to render the text
However, the result is being written where the 3rd and 5th lines both begin with punctuation characters as show in this image of the PDF output
I can simply add some new lines in the proper places to make it look correct, but this would mean if the text is ever re-translated internally my fix may no longer work. Does anyone know how to ensure that iText won't begin a line with these punctuation characters?
For breaking lines in Asian languages you need to write your own implementation of SplitCharacter. A good reference for line breaking is Unicode® Standard Annex #14 -Unicode Line Breaking Algorithm. Another one is https://msdn.microsoft.com/en-us/library/cc194864.aspx.
Having suffered through implementing this for Japanese, I'm putting example code I wrote for Japanese text mixed with English text. This code could be modified for Chinese fairly easily using the references above.
Here is a snippet showing JapaneseSplitCharacter in use:
Chunk chunk = new Chunk(<asian text>,<asian font>);
chunk.setSplitCharacter(JapaneseSplitCharacter.SplitCharacter);
Paragraph paragraph = new Paragraph(chunk);
Here is the code for JapaneseSplitCharacter:
import com.itextpdf.text.SplitCharacter;
import com.itextpdf.text.pdf.DefaultSplitCharacter;
import com.itextpdf.text.pdf.PdfChunk;
/**
* <p/>
* For basic latin characters spaces, periods, commas, etc. are split characters. For Japanese characters lines can break
* anywhere, unless prohibited. This class uses logic for Japanese, non-starting and non-ending characters based on the
* kinsoku rule and uses the DefaultSplitCharacter class for basic latin characters while writing free flowing text to a PDF.
* <p/>
*/
public class JapaneseSplitCharacter implements SplitCharacter {
// line of text cannot start or end with this character
static final char u2060 = '\u2060'; // - ZERO WIDTH NO BREAK SPACE
// a line of text cannot start with any following characters in NOT_BEGIN_CHARACTERS[]
static final char u30fb = '\u30fb'; // ・ - KATAKANA MIDDLE DOT
static final char u2022 = '\u2022'; // • - BLACK SMALL CIRCLE (BULLET)
static final char uff65 = '\uff65'; // ・ - HALFWIDTH KATAKANA MIDDLE DOT
static final char u300d = '\u300d'; // 」 - RIGHT CORNER BRACKET
static final char uff09 = '\uff09'; // ) - FULLWIDTH RIGHT PARENTHESIS
static final char u0021 = '\u0021'; // ! - EXCLAMATION MARK
static final char u0025 = '\u0025'; // % - PERCENT SIGN
static final char u0029 = '\u0029'; // ) - RIGHT PARENTHESIS
static final char u002c = '\u002c'; // , - COMMA
static final char u002e = '\u002e'; // . - FULL STOP
static final char u003f = '\u003f'; // ? - QUESTION MARK
static final char u005d = '\u005d'; // ] - RIGHT SQUARE BRACKET
static final char u007d = '\u007d'; // } - RIGHT CURLY BRACKET
static final char uff61 = '\uff61'; // 。 - HALFWIDTH IDEOGRAPHIC FULL STOP
static final char uff63 = '\uff63'; // 」 - HALFWIDTH RIGHT CORNER BRACKET
static final char uff64 = '\uff64'; // 、 - HALFWIDTH IDEOGRAPHIC COMMA
static final char uff67 = '\uff67'; // ァ - HALFWIDTH KATAKANA LETTER SMALL A
static final char uff68 = '\uff68'; // ィ - HALFWIDTH KATAKANA LETTER SMALL I
static final char uff69 = '\uff69'; // ゥ - HALFWIDTH KATAKANA LETTER SMALL U
static final char uff6a = '\uff6a'; // ェ - HALFWIDTH KATAKANA LETTER SMALL E
static final char uff6b = '\uff6b'; // ォ - HALFWIDTH KATAKANA LETTER SMALL O
static final char uff6c = '\uff6c'; // ャ - HALFWIDTH KATAKANA LETTER SMALL YA
static final char uff6d = '\uff6d'; // ュ - HALFWIDTH KATAKANA LETTER SMALL YU
static final char uff6e = '\uff6e'; // ョ - HALFWIDTH KATAKANA LETTER SMALL YO
static final char uff6f = '\uff6f'; // ッ - HALFWIDTH KATAKANA LETTER SMALL TU
static final char uff70 = '\uff70'; // ー - HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
static final char uff9e = '\uff9e'; // ゙ - HALFWIDTH KATAKANA VOICED SOUND MARK
static final char uff9f = '\uff9f'; // ゚ - HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
static final char u3001 = '\u3001'; // 、 - IDEOGRAPHIC COMMA
static final char u3002 = '\u3002'; // 。 - IDEOGRAPHIC FULL STOP
static final char uff0c = '\uff0c'; // , - FULLWIDTH COMMA
static final char uff0e = '\uff0e'; // . - FULLWIDTH FULL STOP
static final char uff1a = '\uff1a'; // : - FULLWIDTH COLON
static final char uff1b = '\uff1b'; // ; - FULLWIDTH SEMICOLON
static final char uff1f = '\uff1f'; // ? - FULLWIDTH QUESTION MARK
static final char uff01 = '\uff01'; // ! - FULLWIDTH EXCLAMATION MARK
static final char u309b = '\u309b'; // ゛ - KATAKANA-HIRAGANA VOICED SOUND MARK
static final char u309c = '\u309c'; // ゜ - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
static final char u30fd = '\u30fd'; // ヽ - KATAKANA ITERATION MARK
static final char u30fe = '\u30fe'; // ヾ - KATAKANA VOICED ITERATION MARK
static final char u309d = '\u309d'; // ゝ - HIRAGANA ITERATION MARK
static final char u309e = '\u309e'; // ゞ - HIRAGANA VOICED ITERATION MARK
static final char u3005 = '\u3005'; // 々 - IDEOGRAPHIC ITERATION MARK
static final char u30fc = '\u30fc'; // ー - KATAKANA-HIRAGANA PROLONGED SOUND MARK
static final char u2019 = '\u2019'; // ’ - RIGHT SINGLE QUOTATION MARK
static final char u201d = '\u201d'; // ” - RIGHT DOUBLE QUOTATION MARK
static final char u3015 = '\u3015'; // 〕 - RIGHT TORTOISE SHELL BRACKET
static final char uff3d = '\uff3d'; // ] - FULLWIDTH RIGHT SQUARE BRACKET
static final char uff5d = '\uff5d'; // } - FULLWIDTH RIGHT CURLY BRACKET
static final char u3009 = '\u3009'; // 〉 - RIGHT ANGLE BRACKET
static final char u300b = '\u300b'; // 》 - RIGHT DOUBLE ANGLE BRACKET
static final char u300f = '\u300f'; // 』 - RIGHT WHITE CORNER BRACKET
static final char u3011 = '\u3011'; // 】 - RIGHT BLACK LENTICULAR BRACKET
static final char u00b0 = '\u00b0'; // ° - DEGREE SIGN
static final char u2032 = '\u2032'; // ′ - PRIME
static final char u2033 = '\u2033'; // ″ - DOUBLE PRIME
static final char u2103 = '\u2103'; // ℃ - DEGREE CELSIUS
static final char u00a2 = '\u00a2'; // ¢ - CENT SIGN
static final char uff05 = '\uff05'; // % - FULLWIDTH PERCENT SIGN
static final char u2030 = '\u2030'; // ‰ - PER MILLE SIGN
static final char u3041 = '\u3041'; // ぁ - HIRAGANA LETTER SMALL A
static final char u3043 = '\u3043'; // ぃ - HIRAGANA LETTER SMALL I
static final char u3045 = '\u3045'; // ぅ - HIRAGANA LETTER SMALL U
static final char u3047 = '\u3047'; // ぇ - HIRAGANA LETTER SMALL E
static final char u3049 = '\u3049'; // ぉ - HIRAGANA LETTER SMALL O
static final char u3063 = '\u3063'; // っ - HIRAGANA LETTER SMALL TU
static final char u3083 = '\u3083'; // ゃ - HIRAGANA LETTER SMALL YA
static final char u3085 = '\u3085'; // ゅ - HIRAGANA LETTER SMALL YU
static final char u3087 = '\u3087'; // ょ - HIRAGANA LETTER SMALL YO
static final char u308e = '\u308e'; // ゎ - HIRAGANA LETTER SMALL WA
static final char u30a1 = '\u30a1'; // ァ - KATAKANA LETTER SMALL A
static final char u30a3 = '\u30a3'; // ィ - KATAKANA LETTER SMALL I
static final char u30a5 = '\u30a5'; // ゥ - KATAKANA LETTER SMALL U
static final char u30a7 = '\u30a7'; // ェ - KATAKANA LETTER SMALL E
static final char u30a9 = '\u30a9'; // ォ - KATAKANA LETTER SMALL O
static final char u30c3 = '\u30c3'; // ッ - KATAKANA LETTER SMALL TU
static final char u30e3 = '\u30e3'; // ャ - KATAKANA LETTER SMALL YA
static final char u30e5 = '\u30e5'; // ュ - KATAKANA LETTER SMALL YU
static final char u30e7 = '\u30e7'; // ョ - KATAKANA LETTER SMALL YO
static final char u30ee = '\u30ee'; // ヮ - KATAKANA LETTER SMALL WA
static final char u30f5 = '\u30f5'; // ヵ - KATAKANA LETTER SMALL KA
static final char u30f6 = '\u30f6'; // ヶ - KATAKANA LETTER SMALL KE
static final char[] NOT_BEGIN_CHARACTERS = new char[]{u30fb, u2022, uff65, u300d, uff09, u0021, u0025, u0029, u002c,
u002e, u003f, u005d, u007d, uff61, uff63, uff64, uff67, uff68, uff69, uff6a, uff6b, uff6c, uff6d, uff6e,
uff6f, uff70, uff9e, uff9f, u3001, u3002, uff0c, uff0e, uff1a, uff1b, uff1f, uff01, u309b, u309c, u30fd,
u30fe, u309d, u309e, u3005, u30fc, u2019, u201d, u3015, uff3d, uff5d, u3009, u300b, u300f, u3011, u00b0,
u2032, u2033, u2103, u00a2, uff05, u2030, u3041, u3043, u3045, u3047, u3049, u3063, u3083, u3085, u3087,
u308e, u30a1, u30a3, u30a5, u30a7, u30a9, u30c3, u30e3, u30e5, u30e7, u30ee, u30f5, u30f6, u2060};
// a line of text cannot end with any following characters in NOT_ENDING_CHARACTERS[]
static final char u0024 = '\u0024'; // $ - DOLLAR SIGN
static final char u0028 = '\u0028'; // ( - LEFT PARENTHESIS
static final char u005b = '\u005b'; // [ - LEFT SQUARE BRACKET
static final char u007b = '\u007b'; // { - LEFT CURLY BRACKET
static final char u00a3 = '\u00a3'; // £ - POUND SIGN
static final char u00a5 = '\u00a5'; // ¥ - YEN SIGN
static final char u201c = '\u201c'; // “ - LEFT DOUBLE QUOTATION MARK
static final char u2018 = '\u2018'; // ‘ - LEFT SINGLE QUOTATION MARK
static final char u300a = '\u300a'; // 《 - LEFT DOUBLE ANGLE BRACKET
static final char u3008 = '\u3008'; // 〈 - LEFT ANGLE BRACKET
static final char u300c = '\u300c'; // 「 - LEFT CORNER BRACKET
static final char u300e = '\u300e'; // 『 - LEFT WHITE CORNER BRACKET
static final char u3010 = '\u3010'; // 【 - LEFT BLACK LENTICULAR BRACKET
static final char u3014 = '\u3014'; // 〔 - LEFT TORTOISE SHELL BRACKET
static final char uff62 = '\uff62'; // 「 - HALFWIDTH LEFT CORNER BRACKET
static final char uff08 = '\uff08'; // ( - FULLWIDTH LEFT PARENTHESIS
static final char uff3b = '\uff3b'; // [ - FULLWIDTH LEFT SQUARE BRACKET
static final char uff5b = '\uff5b'; // { - FULLWIDTH LEFT CURLY BRACKET
static final char uffe5 = '\uffe5'; // ¥ - FULLWIDTH YEN SIGN
static final char uff04 = '\uff04'; // $ - FULLWIDTH DOLLAR SIGN
static final char[] NOT_ENDING_CHARACTERS = new char[]{u0024, u0028, u005b, u007b, u00a3, u00a5, u201c, u2018, u3008,
u300a, u300c, u300e, u3010, u3014, uff62, uff08, uff3b, uff5b, uffe5, uff04, u2060};
/**
* An instance of the jpSplitCharacter.
*/
public static final JapaneseSplitCharacter SplitCharacter = new JapaneseSplitCharacter();
/**
* An instance DefaultSplitCharacter used for BasicLatin characters.
*/
private static final SplitCharacter defaultSplitCharacter = new DefaultSplitCharacter();
public JapaneseSplitCharacter() { }
/**
* Custom method to for SplitCharacter to handle Japanese characters.
* Returns <CODE>true</CODE> if the character can split a line. The splitting implementation
* is free to look ahead or look behind characters to make a decision.
*
* @param start the lower limit of <CODE>cc</CODE> inclusive
* @param current the pointer to the character in <CODE>cc</CODE>
* @param end the upper limit of <CODE>cc</CODE> exclusive
* @param cc an array of characters at least <CODE>end</CODE> sized
* @param ck an array of <CODE>PdfChunk</CODE>. The main use is to be able to call
* {@link PdfChunk#getUnicodeEquivalent(int)}. It may be <CODE>null</CODE>
* or shorter than <CODE>end</CODE>. If <CODE>null</CODE> no conversion takes place.
* If shorter than <CODE>end</CODE> the last element is used
* @return <CODE>true</CODE> if the character(s) can split a line
*/
public boolean isSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck) {
// Note: If you don't add an try/catch iText and there is an issue with isSplitCharacter() silently fails and
// you have no idea there was a problem.
try {
char charCurrent = getCharacter(current, cc, ck);
int next = current + 1;
if (next < cc.length) {
char charNext = getCharacter(next, cc, ck);
for (char not_begin_character : NOT_BEGIN_CHARACTERS) {
if (charNext == not_begin_character) {
return false;
}
}
}
for (char not_ending_character : NOT_ENDING_CHARACTERS) {
if (charCurrent == not_ending_character) {
return false;
}
}
boolean isBasicLatin = Character.UnicodeBlock.of(charCurrent) == Character.UnicodeBlock.BASIC_LATIN;
if (isBasicLatin)
return defaultSplitCharacter.isSplitCharacter(start, current, end, cc, ck);
return true;
} catch (Exception ex) {
ex.printStackTrace();
}
return true;
}
/**
* Returns a character int the array (Note: modified from the iText default version with the addition null
* check of '|| ck[Math.min(position, ck.length - 1)] == null'.
*
* @param position position in the array
* @param ck chunk array
* @param cc the character array that has to be checked
* @return the character
*/
protected char getCharacter(int position, char[] cc, PdfChunk[] ck) {
if (ck == null || ck[Math.min(position, ck.length - 1)] == null) {
return cc[position];
}
return (char) ck[Math.min(position, ck.length - 1)].getUnicodeEquivalent(cc[position]);
}
}
Hope this helps.
I'm using iTextSharp.
I wrote a ISplitCharacter following k.f.'s sample.
public class CJKSplitCharacter : ISplitCharacter
{
public static ISplitCharacter Default = new CJKSplitCharacter();
private static ISplitCharacter defaultSplit = new DefaultSplitCharacter();
public bool IsSplitCharacter(int start, int current, int end, char[] cc, PdfChunk[] ck)
{
char charCurrent = GetChar(current, cc, ck);
int next = current + 1;
if (next < cc.Length)
{
char charNext = GetChar(next, cc, ck);
// if next char is close char, do not break here
if (IsCloseChar(charNext))
{
return false;
}
// otherwise, if current char is close char, mark as breakable
else if (IsCloseChar(charCurrent))
{
return true;
}
}
// if current char is open char, do not break here
if (IsOpenChar(charCurrent))
{
return false;
}
// default:
// split every CJK character
if (Char.GetUnicodeCategory(charCurrent) == UnicodeCategory.OtherLetter) // CJK Letters
{
return true;
}
else
{
return defaultSplit.IsSplitCharacter(start, current, end, cc, ck);
}
}
private char GetChar(int position, char[] cc, PdfChunk[] ck)
{
char c;
if (ck == null || ck[Math.Min(position, ck.Length - 1)] == null)
{
c = cc[position];
}
else
{
c = (char)ck[Math.Min(position, ck.Length - 1)].GetUnicodeEquivalent(cc[position]);
}
return c;
}
private bool IsCloseChar(char c)
{
UnicodeCategory cat = Char.GetUnicodeCategory(c);
return (cat == UnicodeCategory.ClosePunctuation //right bracket/brace, eg: )]
|| cat == UnicodeCategory.FinalQuotePunctuation //right quote, eg: ”
|| cat == UnicodeCategory.OtherPunctuation //other punctuation, eg: ,。
);
}
private bool IsOpenChar(char c)
{
UnicodeCategory cat = Char.GetUnicodeCategory(c);
return (cat == UnicodeCategory.OpenPunctuation //left bracket/brace, eg: ([
|| cat == UnicodeCategory.InitialQuotePunctuation //right quote, eg: “
);
}
}