The challenge:
Build an ASCII chart of the most commonly used words in a given text.
The rules:
- Only accept
a-z
andA-Z
(alphabetic characters) as part of a word. - Ignore casing (
She
==she
for our purpose). - Ignore the following words (quite arbitary, I know):
the, and, of, to, a, i, it, in, or, is
Clarification: considering
don't
: this would be taken as 2 different 'words' in the rangesa-z
andA-Z
: (don
andt
).Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).
Parse a given text
(read a file specified via command line arguments or piped in; presume us-ascii
) and build us a word frequency chart
with the following characteristics:
- Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
- The bar
width
represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word. - Make sure these bars (plus space-word-space) always fit:
bar
+[space]
+word
+[space]
should be always <=80
characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).
An example:
The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).
This specific text would yield the following chart:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so
For your information: these are the frequencies the above chart is built upon:
[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that ', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t' , 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), (' but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]
A second example (to check if you implemented the complete spec):
Replace every occurence of you
in the linked Alice in Wonderland file with superlongstringstring
:
________________________________________________________________ |________________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |______________________________| as |___________________________| her |_________________________| with |_________________________| at |________________________| s |________________________| t |______________________| on |_____________________| all |___________________| this |___________________| for |___________________| had |__________________| but |_________________| be |_________________| not |________________| they |________________| so
The winner:
Shortest solution (by character count, per language). Have fun!
Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):
Language Relaxed Strict ========= ======= ====== GolfScript 130 143 Perl 185 Windows PowerShell 148 199 Mathematica 199 Ruby 185 205 Unix Toolchain 194 228 Python 183 243 Clojure 282 Scala 311 Haskell 333 Awk 336 R 298 Javascript 304 354 Groovy 321 Matlab 404 C# 422 Smalltalk 386 PHP 450 F# 452 TSQL 483 507
The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____|
bars, closes the first bar on top with a ____
line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.
Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).
C# -
510451436446434426422 chars (minified)Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.
422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):
Java -
886865756744742744752742714680 charsUpdates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.
Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (
\\s
in regex replaced byand
ArrayList
replaced byVector
). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit
Ctrl+Z
to get result.Update 752 > 742 chars: I removed
public
and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced
m.containsKey(k)
bym.get(k)!=null
(730 > 728), introduced substringing of line (728 > 714).Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved
split()
to remove unnecessaryreplaceAll()
.More readable version:
Output:
It pretty sucks that Java doesn't have
String#join()
and closures (yet).Edit by Rotsor:
I have made several changes to your solution:
The condensed code is
688711684 characters long:The fast version (
720693 characters)More readable version:
The version without behaviour improvements is 615 characters:
Perl:
203202201198195208203 / 231 charsAlternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):
The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)
PS: Reddit says "Hi" ;)
Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.
Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.
Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.
Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob
Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.
Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.
Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)
Examples:
Alternative implementation in pathological case example:
Gawk -- 336 (originally 507) characters
(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)
Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;) and [AKX's python][2].
The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.
It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.
Minified:
line breaks for clarity only: they are not necessary and should not be counted.
Output:
Readable; 633 characters (originally 949):
Haskell -
366351344337333 characters(One line break in
main
added for readability, and no line break needed at end of last line.)How it works is best seen by reading the argument to
interact
backwards:map f
lowercases alphabetics, replaces everything else with spaces.words
produces a list of words, dropping the separating whitespace.filter (
notElemwords "the and of to a i it in or is")
discards all entries with forbidden words.group . sort
sorts the words, and groups identical ones into lists.map h
maps each list of identical words to a tuple of the form(-frequency, word)
.take 22 . sort
sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.b
maps tuples to bars (see below).a
prepends the first line of underscores, to complete the topmost bar.unlines
joins all these lines together with newlines.The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so
||
would be a bar of zero length. The functionb
mapsc x
overx
, wherex
is the list of histograms. The entire list is passed toc
, so that each invocation ofc
can compute the scale factor for itself by callingu
. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.Note the trick of using
-frequency
. This removes the need toreverse
thesort
since sorting (ascending)-frequency
will places the words with the largest frequency first. Later, in the functionu
, two-frequency
values are multiplied, which will cancel the negation out.PHP CLI version (450 chars)
This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!
Usage:
php.exe <this.php> <file.txt>
Minified:
Human readable:
Output:
When there is a long word, the bars are adjusted properly: