I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
- Return only words with an occurrence greater than X
- Return only words with a length greater than Y
- Ignore common terms like "and, is, the, etc"
- Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
- Return results in a collection/array
Extra Credit
- Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement
Extra-Extra Credit
- Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
- It would be great to see the results of your code, output, in addition to how long the operation lasted.
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
output :