可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm going to make a word wrap algorithm in PHP. I want to split small chunks of text (short phrases) in n lines of maximum m characters (n is not given, so there will be as much lines as needed). The peculiarity is that lines length (in characters) has to be much balanced as possible across lines.
Example of input text:
How to do things
Wrong output (this is the normal word-wrap behavior), m=6:
How to
do
things
Desired output, always m=6:
How
to do
things
Does anyone have suggestions or guidelines on how to implement this function? Basically, I'm searching something for pretty print short phrases on two or three (as much as possible) equal length lines.
Update: It seems I'm searching exactly for a Minimum raggedness word wrap algorithm. But I can't find any implementation in a real programming language (anyone, then I can convert it in PHP).
Update 2: I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searching procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm.
回答1:
Quick and dirty, in c++
#include <sstream>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <memory.h>
using namespace std;
int cac[1000][1000];
string res[1000][1000];
vector<string> words;
int M;
int go(int a, int b){
if(cac[a][b]>= 0) return cac[a][b];
if(a == b) return 0;
int csum = -1;
for(int i=a; i<b; ++i){
csum += words[i].size() + 1;
}
if(csum <= M || a == b-1){
string sep = "";
for(int i=a; i<b; ++i){
res[a][b].append(sep);
res[a][b].append(words[i]);
sep = " ";
}
return cac[a][b] = (M-csum)*(M-csum);
}
int ret = 1000000000;
int best_sp = -1;
for(int sp=a+1; sp<b; ++sp){
int cur = go(a, sp) + go(sp,b);
if(cur <= ret){
ret = cur;
best_sp = sp;
}
}
res[a][b] = res[a][best_sp] + "\n" + res[best_sp][b];
return cac[a][b] = ret;
}
int main(int argc, char ** argv){
memset(cac, -1, sizeof(cac));
M = atoi(argv[1]);
string word;
while(cin >> word) words.push_back(word);
go(0, words.size());
cout << res[0][words.size()] << endl;
}
Test:
$ echo "The quick brown fox jumps over a lazy dog" |./a.out 10
The quick
brown fox
jumps over
a lazy dog
EDIT: just looked at the wikipedia page for minimum raggedness word wrap. Changed algorithm to the given one (with squared penalties)
回答2:
I've implemented on the same lines of Alex, coding the Wikipedia algorithm, but directly in PHP (an interesting exercise to me). Understanding how to use the optimal cost function f(j), i.e. the 'recurrence' part, is not very easy. Thanks to Alex for the well commented code.
/**
* minimumRaggedness
*
* @param string $input paragraph. Each word separed by 1 space.
* @param int $LineWidth the max chars per line.
* @param string $lineBreak wrapped lines separator.
*
* @return string $output the paragraph wrapped.
*/
function minimumRaggedness($input, $LineWidth, $lineBreak = "\n")
{
$words = explode(" ", $input);
$wsnum = count($words);
$wslen = array_map("strlen", $words);
$inf = 1000000; //PHP_INT_MAX;
// keep Costs
$C = array();
for ($i = 0; $i < $wsnum; ++$i)
{
$C[] = array();
for ($j = $i; $j < $wsnum; ++$j)
{
$l = 0;
for ($k = $i; $k <= $j; ++$k)
$l += $wslen[$k];
$c = $LineWidth - ($j - $i) - $l;
if ($c < 0)
$c = $inf;
else
$c = $c * $c;
$C[$i][$j] = $c;
}
}
// apply recurrence
$F = array();
$W = array();
for ($j = 0; $j < $wsnum; ++$j)
{
$F[$j] = $C[0][$j];
$W[$j] = 0;
if ($F[$j] == $inf)
{
for ($k = 0; $k < $j; ++$k)
{
$t = $F[$k] + $C[$k + 1][$j];
if ($t < $F[$j])
{
$F[$j] = $t;
$W[$j] = $k + 1;
}
}
}
}
// rebuild wrapped paragraph
$output = "";
if ($F[$wsnum - 1] < $inf)
{
$S = array();
$j = $wsnum - 1;
for ( ; ; )
{
$S[] = $j;
$S[] = $W[$j];
if ($W[$j] == 0)
break;
$j = $W[$j] - 1;
}
$pS = count($S) - 1;
do
{
$i = $S[$pS--];
$j = $S[$pS--];
for ($k = $i; $k < $j; $k++)
$output .= $words[$k] . " ";
$output .= $words[$k] . $lineBreak;
}
while ($j < $wsnum - 1);
}
else
$output = $input;
return $output;
}
?>
回答3:
A C version:
// This is a direct implementation of the minimum raggedness word wrapping
// algorithm from http://en.wikipedia.org/wiki/Word_wrap#Minimum_raggedness
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>
#include <limits.h>
const char* pText = "How to do things";
int LineWidth = 6;
int WordCnt;
const char** pWords;
int* pWordLengths;
int* pC;
int* pF;
int* pW;
int* pS;
int CountWords(const char* p)
{
int cnt = 0;
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
cnt++;
while (*p != '\0' && !isspace(*p)) p++;
}
}
return cnt;
}
void FindWords(const char* p, int cnt, const char** pWords, int* pWordLengths)
{
while (*p != '\0')
{
while (*p != '\0' && isspace(*p)) p++;
if (*p != '\0')
{
*pWords++ = p;
while (*p != '\0' && !isspace(*p)) p++;
*pWordLengths++ = p - pWords[-1];
}
}
}
void PrintWord(const char* p, int l)
{
int i;
for (i = 0; i < l; i++)
printf("%c", p[i]);
}
// 1st program's argument is the text
// 2nd program's argument is the line width
int main(int argc, char* argv[])
{
int i, j;
if (argc >= 3)
{
pText = argv[1];
LineWidth = atoi(argv[2]);
}
WordCnt = CountWords(pText);
pWords = malloc(WordCnt * sizeof(*pWords));
pWordLengths = malloc(WordCnt * sizeof(*pWordLengths));
FindWords(pText, WordCnt, pWords, pWordLengths);
printf("Input Text: \"%s\"\n", pText);
printf("Line Width: %d\n", LineWidth);
printf("Words : %d\n", WordCnt);
#if 0
for (i = 0; i < WordCnt; i++)
{
printf("\"");
PrintWord(pWords[i], pWordLengths[i]);
printf("\"\n");
}
#endif
// Build c(i,j) in pC[]
pC = malloc(WordCnt * WordCnt * sizeof(int));
for (i = 0; i < WordCnt; i++)
{
for (j = 0; j < WordCnt; j++)
if (j >= i)
{
int k;
int c = LineWidth - (j - i);
for (k = i; k <= j; k++) c -= pWordLengths[k];
c = (c >= 0) ? c * c : INT_MAX;
pC[j * WordCnt + i] = c;
}
else
pC[j * WordCnt + i] = INT_MAX;
}
// Build f(j) in pF[] and store the wrap points in pW[]
pF = malloc(WordCnt * sizeof(int));
pW = malloc(WordCnt * sizeof(int));
for (j = 0; j < WordCnt; j++)
{
pW[j] = 0;
if ((pF[j] = pC[j * WordCnt]) == INT_MAX)
{
int k;
for (k = 0; k < j; k++)
{
int s;
if (pF[k] == INT_MAX || pC[j * WordCnt + k + 1] == INT_MAX)
s = INT_MAX;
else
s = pF[k] + pC[j * WordCnt + k + 1];
if (pF[j] > s)
{
pF[j] = s;
pW[j] = k + 1;
}
}
}
}
// Print the optimal solution cost
printf("f : %d\n", pF[WordCnt - 1]);
// Print the optimal solution, if any
pS = malloc(2 * WordCnt * sizeof(int));
if (pF[WordCnt - 1] != INT_MAX)
{
// Work out the solution's words by back tracking the
// wrap points from pW[] and store them on the pS[] stack
j = WordCnt - 1;
for (;;)
{
*pS++ = j;
*pS++ = pW[j];
if (!pW[j]) break;
j = pW[j] - 1;
}
// Print the solution line by line, word by word
// in direct order
do
{
int k;
i = *--pS;
j = *--pS;
for (k = i; k <= j; k++)
{
PrintWord(pWords[k], pWordLengths[k]);
printf(" ");
}
printf("\n");
} while (j < WordCnt - 1);
}
return 0;
}
Output 1:
ww.exe
Input Text: "How to do things"
Line Width: 6
Words : 4
f : 10
How
to do
things
Output 2:
ww.exe "aaa bb cc ddddd" 6
Input Text: "aaa bb cc ddddd"
Line Width: 6
Words : 4
f : 11
aaa
bb cc
ddddd
Output 3:
ww.exe "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm." 60
Input Text: "I started a bounty for this. Is it possible that do not exist any public implementation of Minimum raggedness algorithm in any procedural language? I need something written in a way that can be translated into procedural instructions. All I can find now is just a bounch of (generic) equation that however need a optimal searhing procedure. I will be grateful also for an implementation that can only approximate that optimal searching algorithm."
Line Width: 60
Words : 73
f : 241
I started a bounty for this. Is it possible that do not
exist any public implementation of Minimum raggedness
algorithm in any procedural language? I need something
written in a way that can be translated into procedural
instructions. All I can find now is just a bounch of
(generic) equation that however need a optimal searhing
procedure. I will be grateful also for an implementation
that can only approximate that optimal searching algorithm.
回答4:
I think the simplest way to look at it - is with iteration between limits
E.g.
/**
* balancedWordWrap
*
* @param string $input
* @param int $maxWidth the max chars per line
*/
function balancedWordWrap($input, $maxWidth = null) {
$length = strlen($input);
if (!$maxWidth) {
$maxWidth = min(ceil($length / 2), 75);
}
$minWidth = min(ceil($length / 2), $maxWidth / 2);
$permutations = array();
$scores = array();
$lowestScore = 999;
$lowest = $minWidth;
foreach(range($minWidth, $maxWidth) as $width) {
$permutations[$width] = wordwrap($input, $width);
$lines = explode("\n", $permutations[$width]);
$max = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
if ($lineLength > $max) {
$max = $lineLength;
}
}
$score = 0;
foreach($lines as $line) {
$lineLength = strlen($line);
$score += pow($max - $lineLength, 2);
}
$scores[$width] = $score;
if ($score < $lowestScore) {
$lowestScore = $score;
$lowest = $width;
}
}
return $permutations[$lowest];
}
Given the input "how to do things"
it outputs
How
to do
things
Given the input "Mary had a little lamb"
it outputs
Mary had a
little lamb
Given the input "This extra-long paragraph was writtin to demonstrate how the fmt(1)
program handles longer inputs. When testing inputs, you don\'t want them to be too short, nor too long, because the quality of the program can only be determined upon inspection of complex content. The quick brown fox jumps over the lazy dog. Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.", and limited to 75 chars max width, it outputs:
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
be too short, nor too long, because the quality of the program can only be
determined upon inspection of complex content. The quick brown fox jumps
over the lazy dog. Congress shall make no law respecting an establishment
of religion, or prohibiting the free exercise thereof; or abridging the
freedom of speech, or of the press; or the right of the people peaceably
to assemble, and to petition the Government for a redress of grievances.
回答5:
Justin's link to Knuth's Breaking Paragraphs Into Lines is the historically best answer. (Newer systems also apply microtypography techniques such as fiddling with character widths, kerning, and so on, but if you're simply looking for monospaced plain-text, these extra approaches won't help.)
If you just want to solve the problem, the fmt(1)
utility supplied on many Linux systems by the Free Software Foundation implements a variant of Knuth's algorithm that also attempts to avoid line breaks at the end of sentences. I wrote your inputs and a larger example, and ran them through fmt -w 20
to force 20-character lines:
$ fmt -w 20 input
Lorem ipsum dolor
sit amet
Supercalifragilisticexpialidocious
and some other
small words
One long
extra-long-word
This extra-long
paragraph
was writtin to
demonstrate how the
`fmt(1)` program
handles longer
inputs. When
testing inputs,
you don't want them
to be too short,
nor too long,
because the quality
of the program can
only be determined
upon inspection
of complex
content. The quick
brown fox jumps
over the lazy
dog. Congress
shall make no
law respecting
an establishment
of religion, or
prohibiting the
free exercise
thereof; or
abridging the
freedom of speech,
or of the press;
or the right of the
people peaceably
to assemble,
and to petition
the Government
for a redress of
grievances.
The output looks much better if you allow it the default 75 characters width for non-trivial input:
$ fmt input
Lorem ipsum dolor sit amet
Supercalifragilisticexpialidocious and some other small words
One long extra-long-word
This extra-long paragraph was writtin to demonstrate how the `fmt(1)`
program handles longer inputs. When testing inputs, you don't want them
to be too short, nor too long, because the quality of the program can
only be determined upon inspection of complex content. The quick brown
fox jumps over the lazy dog. Congress shall make no law respecting an
establishment of religion, or prohibiting the free exercise thereof;
or abridging the freedom of speech, or of the press; or the right of
the people peaceably to assemble, and to petition the Government for a
redress of grievances.
回答6:
Here is a bash version:
#! /bin/sh
if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
echo "Usage: balance <width> [ <string> ]"
echo " "
echo " if string is not passed as parameter it will be read from STDIN\n"
exit 2
elif [ $# -le 1 ] ; then
LINE=`cat`
else
LINE="$2"
fi
LINES=`echo "$LINE" | fold -s -w $1 | wc -l`
MAX=$1
MIN=0
while [ $MAX -gt $(($MIN+1)) ]
do
TRY=$(( $MAX + $MIN >> 1 ))
NUM=`echo "$LINE" | fold -s -w $TRY | wc -l`
if [ $NUM -le $LINES ] ; then
MAX=$TRY
else
MIN=$TRY
fi
done
echo "$LINE" | fold -s -w $MAX
example:
$ balance 50 "Now is the time for all good men to come to the aid of the party."
Now is the time for all good men
to come to the aid of the party.
Requires 'fold' and 'wc' which are usually available where bash is installed.