Python, 236 206 chars

s="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
### ------------------------------------------------------------
import re
c=o=127
l={}
i=len(s)/2
while i>1:
    r=re.search('(.{%d}).*\\1'%i,s)
    if r:f=r.group(1);c+=1;l[c-o]=f;s=s.replace(f,chr(c))
    else:i-=1
for i in l:s=re.sub(chr(i+o),'<TAG%d>%s</TAG%d>'%(i,l[i],i),s)
### ------------------------------------------------------------
print s

And the result of running this on the example input, it picks the following words ('LoremIpsum', 'dummytext', 'industry', 'print', 'types', 'oft', 'ing', 'and', 'the', 'ook', 'ss', 'im', 'he', 'tt', 'en', 'er', 'le', 'pe') and the result is:

<TAG1>LoremIpsum</TAG1>i<TAG11>ss</TAG11><TAG12>im</TAG12>ply<TAG2>dummytext</TAG2><TAG6>oft</TAG6><TAG13>he</TAG13><TAG4>print</TAG4><TAG7>ing</TAG7><TAG8>and</TAG8><TAG5>types</TAG5>e<TAG14>tt</TAG14><TAG7>ing</TAG7><TAG3>industry</TAG3>.<TAG1>LoremIpsum</TAG1>hasbe<TAG15>en</TAG15><TAG9>the</TAG9><TAG3>industry</TAG3>'<TAG11>ss</TAG11>t<TAG8>and</TAG8>ard<TAG2>dummytext</TAG2>ev<TAG16>er</TAG16>since<TAG9>the</TAG9>1500s,w<TAG13>he</TAG13>nanunknown<TAG4>print</TAG4><TAG16>er</TAG16>t<TAG10>ook</TAG10>agal<TAG17>le</TAG17>y<TAG6>oft</TAG6>y<TAG18>pe</TAG18><TAG8>and</TAG8>scramb<TAG17>le</TAG17>di<TAG14>tt</TAG14>omakea<TAG5>types</TAG5><TAG18>pe</TAG18>c<TAG12>im</TAG12><TAG15>en</TAG15>b<TAG10>ook</TAG10>.

Which is more readable on this wiki highlighted like this:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

PS. Somebody complained so I added input and output statements. To the confused I apologize - it seemed obvious to me. Apparently not, so I added prefix/trailer statements, which are not required by the problem spec and should not be counted to the code length.

回答3:

Haskell: 343/344 403 420 445 485 characters

Character count is 343 while using an exponential algorithm, whereas it is 344 when using a quadratic algorithm.

The code posted is the quadratic one. For the exponential algorithm, change the one occurrence of inits=<<tails to subsequences in the code.

import Data.List
l=length
e=map$either id id
(&)=stripPrefix
y@(_:w)!x=l x>1&&maybe(w!x)(isInfixOf x)(x&y)
_!_=0<0
t@(x,i)?s@(y:z)=maybe(y:t?z)(((map Right$'<':v++e x++"</"++v)++).(t?))$x&s where v="TAG"++i++">"
_?_=[]
r s=e$foldr(?)s$zip(sortBy(\a b->compare(l a)$l b)$filter(s!)$inits=<<tails s)$map show[1..]
main=getLine>>=putStr.r.map Left

Input 1:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

Output 1:

<TAG338>LoremIpsum</TAG338>i<TAG72>ss</TAG72><TAG122>im</TAG122>ply<TAG336>dummytext</TAG336><TAG188>oft</TAG188><TAG91>he</TAG91><TAG275>print</TAG275><TAG153>ing</TAG153><TAG191>and</TAG191><TAG276>types</TAG276><TAG88>et</TAG88><TAG214>ting</TAG214><TAG328>industry</TAG328>.<TAG338>LoremIpsum</TAG338>hasbe<TAG123>en</TAG123><TAG183>the</TAG183><TAG328>industry</TAG328>'s<TAG73>st</TAG73><TAG191>and</TAG191>ard<TAG336>dummytext</TAG336>ev<TAG99>er</TAG99>s<TAG96>in</TAG96>ce<TAG183>the</TAG183>1500s,wh<TAG123>en</TAG123><TAG111>an</TAG111>unknown<TAG275>print</TAG275><TAG99>er</TAG99>t<TAG195>ook</TAG195>a<TAG103>ga</TAG103>l<TAG113>le</TAG113>y<TAG105>of</TAG105><TAG241>type</TAG241><TAG191>and</TAG191>scramb<TAG113>le</TAG113>dit<TAG115>to</TAG115>mak<TAG116>ea</TAG116><TAG276>types</TAG276><TAG121>pe</TAG121>c<TAG122>im</TAG122><TAG123>en</TAG123>b<TAG195>ook</TAG195>.

Input 2:

hello!TAG!</hello.TAG.</

Output 2:

<TAG28>hello</TAG28>!<TAG22>TAG</TAG22>!<TAG14></</TAG14><TAG28>hello</TAG28>.<TAG22>TAG</TAG22>.<TAG14></</TAG14>

回答4:

I think you can use back references to do this. See this post : Regular Expression to detect repetition within a string I've done many attempts and for the moment I have this expression: #([a-zA-Z ]+).*\1#, but I think it finds the first repeated string, not the largest... This was before I knew you didn't care about words... What you should do is:

find the largest sequence of characters repeated in the text
remove it from the text where it appears
iterate until you find nothing repeated
use tomit's method to colorize the strings you have memorized

the step is described on this page: http://en.wikipedia.org/wiki/Longest_common_substring_problem And here is a php implementation : http://www.davidtavarez.com/archives/longer-common-substring-problem-php-implementation/ (you'll have to fix it, it contains html entities, and the comment says it returns an integer but we don't know what it represents...), if it still does not work, you can try to implement wikipedia's pseudo-code.

回答5:

Python, 236 281 chars, no REGEX :)

Makes a set M of all 2+ character strings then iterates through them to assign them in greedy-length order

s="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
#s="abcd1TAGabcd2TAG"

### ----
L,C,R=len,chr,range
M,l,T,t=set(),L(s),[],0
[[M.add(s[A:B])for B in R(A+2,l)]for A in R(l)]
while 1:
 m,t=sorted([(L(m),m)if s.count(m)>1 else(0,"")for m in M])[-1][1],t+1
 if m=="":break
 T+=[(t,m)]
 s=s.replace(m,C(t))
for(t,m)in T:
 s=s.replace(C(t),"<TAG%d>%s</TAG%d>"%(t,m,t))
### ----

print s

Outputs, as expected:

<TAG1>LoremIpsum</TAG1>i<TAG11>ss</TAG11><TAG15>im</TAG15>ply<TAG2>dummytext</TAG2><TAG13>of</TAG13><TAG6>the</TAG6><TAG5>print</TAG5><TAG8>ing</TAG8><TAG9>and</TAG9><TAG4>types</TAG4>e<TAG10>tt</TAG10><TAG8>ing</TAG8><TAG3>industry</TAG3>.<TAG1>LoremIpsum</TAG1>hasbe<TAG17>en</TAG17><TAG6>the</TAG6><TAG3>industry</TAG3>'<TAG11>ss</TAG11>t<TAG9>and</TAG9>ard<TAG2>dummytext</TAG2>ev<TAG16>er</TAG16>since<TAG6>the</TAG6>1500s,wh<TAG17>en</TAG17>anunknown<TAG5>print</TAG5><TAG16>er</TAG16>t<TAG7>ook</TAG7>agal<TAG14>le</TAG14>y<TAG13>of</TAG13>ty<TAG12>pe</TAG12><TAG9>and</TAG9>scramb<TAG14>le</TAG14>di<TAG10>tt</TAG10>omakea<TAG4>types</TAG4><TAG12>pe</TAG12>c<TAG15>im</TAG15><TAG17>en</TAG17>b<TAG7>ook</TAG7>.

回答6:

Mathematica - 262 Chars

Not pure functional / Not short / Not nice / Lots of side effects /

b = "LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.\
     LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,\
     whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimen\
     book."

i = 0
a = c = "@"
v = StringFreeQ@## &
w = StringReplace@## &
t = x__ ~~ y__ ~~ __ ~~ x__ ~~ y__ /; v[x <> y, c]
NestWhile[
  w[#, (a = SortBy[StringCases[#, t -> x <> y,Overlaps -> True], -StringLength@# &][[1]]) -> c] &,
  b,
  (z = k@++i; b = w[b, a -> "<TAG" <> z <> ">" <> a <> "</TAG" <> z <> ">"] /. k -> IntegerString; True) && ! v[#, t] &]

回答7:

Many thanks to Dennis Williamson who helped me arrive at this approach by answering a few related questions I had on shell scripting - here and here.

Known issues with the below:

It works only for ascii files and not binary ones
It works only if there are no newlines in the file
It takes exponentially longer as the file gets longer
It breaks easily on files more than a few kb long (runs out of tmp disk space)

As you can see, its a huge brute-force method - not a smart algorithm at all. I've recorded the time taken for a few sample files.

bytes  time(s)
204 1.281
407 24.916
610 269.302

The point of even doing this as I did below was more about a "thought-challenge" for me - to do this in a shell environment and in a manner as "complete" as possible. Nothing more. Of course, as the results show, its grossly inefficient, so its completely unsuited for a real world application.

filesize=`stat -c %s $1`
while [ $filesize -gt 1 ]
do
        filesize=`expr $filesize - 1`
        array=( "${array[@]}" $(cat $1 | sed -n ":a;/^.\{$filesize\}$/{p;b};s/.\{$filesize\}/&\n/;P;s/.//;s/\n//;ba" | sort | uniq -c | grep -v '      1' | cut -c9-) )
done

sample=$(<$1)
tag=0;
for entry in ${array[@]};
        do
        test="<[^>/]*>[^>]*$entry[^<]*</";
        if [[ ! $sample =~ $test ]];
                then ((tag++));
                sample=${sample//${entry}/<T$tag>$entry</T$tag>};
        fi;
        done;
echo $sample

Usage would be as:

sh tagwords4 sample2.txt

Code golf: “Color highlighting” of repeated text

问题:

回答1:

Perl 206, 189, 188, 199, 157 chars

回答2:

Python, 236 206 chars

回答3:

Haskell: 343/344 403 420 445 485 characters

回答4:

回答5:

Python, 236 281 chars, no REGEX :)

回答6:

Mathematica - 262 Chars

回答7:

收藏的人(0)

Code golf: “Color highlighting” of repeated text

问题:

回答1:

Perl 206, 189, 188, 199, 157 chars

回答2:

Python, 236 206 chars

回答3:

Haskell: 343/344 403 420 445 485 characters

回答4:

回答5:

Python, 236 281 chars, no REGEX :)

回答6:

Mathematica - 262 Chars

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮