(Thanks to greg0ire below for helping with key concepts)
The challenge: Build a program that finds all substrings and "tags" them with color attributes (effectively highlighting them in XML).
The rules:
- This should only be done for substrings of length 2 or more.
- Substrings are just strings of consecutive characters, which may include non-alphabetic characters. Note that spaces and other punctuation do not delimit substrings.
- Character casing cannot be ignored.
- The "highlight" should be done by tagging the substring in XML. Your tagging should be of the form
<TAG#>theSubstring</TAG#>
where#
is a positive number unique to that substring and identical substrings. - The priority of the algorithm is to find the longest substring, not how many times it matches within the text.
Note: The order of the tagging shown in the example below is not important. Its just used by the OP for clarity.
An example input:
LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.
A partially correct output (OP may NOT have completely replaced perfectly in this example)
<TAG1>LoremIpsum</TAG1>issimply<TAG2>dummytext</TAG2>of<TAG5>the</TAG5><TAG3>print</TAG3>ingand<TAG4>type</TAG4>setting<TAG6>industry</TAG6>.<TAG1>LoremIpsum</TAG1>hasbeen<TAG5>the</TAG5><TAG6>industry</TAG6>'sstandard<TAG2>dummytext</TAG2>eversince<TAG5>the</TAG5>1500s,whenanunknown<TAG3>print</TAG3>ertookagalleyof<TAG4>type</TAG4>andscrambledittomakea<TAG4>type</TAG4>specimenbook.
Your code should be able to handle edge cases, such as the following:
Example Input 2:
hello!TAG!</hello.TAG.</
Example Output 2:
<TAG1>hello</TAG1>!<TAG2>TAG</TAG2>!<TAG3></</TAG3><TAG1>hello</TAG1>.<TAG2>TAG</TAG2>.<TAG3></</TAG3>
The winner:
- Most elegant solution wins (judged by others comments, upvotes)
- Bonus points/consideration for solutions utilizing shell scripting
Minor clarifications:
- Input can be hard coded or read from a file
- The criteria remains "elegance", which admittedly IS slightly vague, but it also encapsulates simple character/line counts as well. Comments by others and/or upvotes are also indicative of how the SO community views the challenge
Haskell: 343/344
403420445485charactersCharacter count is 343 while using an exponential algorithm, whereas it is 344 when using a quadratic algorithm.
The code posted is the quadratic one. For the exponential algorithm, change the one occurrence of
inits=<<tails
tosubsequences
in the code.Input 1:
Output 1:
Input 2:
Output 2:
Mathematica - 262 Chars
Not pure functional / Not short / Not nice / Lots of side effects /
I think you can use back references to do this. See this post : Regular Expression to detect repetition within a string I've done many attempts and for the moment I have this expression: #([a-zA-Z ]+).*\1#, but I think it finds the first repeated string, not the largest...This was before I knew you didn't care about words... What you should do is:the step is described on this page: http://en.wikipedia.org/wiki/Longest_common_substring_problem And here is a php implementation : http://www.davidtavarez.com/archives/longer-common-substring-problem-php-implementation/ (you'll have to fix it, it contains html entities, and the comment says it returns an integer but we don't know what it represents...), if it still does not work, you can try to implement wikipedia's pseudo-code.
Perl
excluding original string and last print.206,189,188,199, 157 charsMany thanks to Dennis Williamson who helped me arrive at this approach by answering a few related questions I had on shell scripting - here and here.
Known issues with the below:
As you can see, its a huge brute-force method - not a smart algorithm at all. I've recorded the time taken for a few sample files.
The point of even doing this as I did below was more about a "thought-challenge" for me - to do this in a shell environment and in a manner as "complete" as possible. Nothing more. Of course, as the results show, its grossly inefficient, so its completely unsuited for a real world application.
Usage would be as:
Python,
236281 chars, no REGEX :)Makes a set
M
of all 2+ character strings then iterates through them to assign them in greedy-length orderOutputs, as expected: