Code Golf: Quickly Build List of Keywords from Tex

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array

Extra Credit

Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement

Extra-Extra Credit

Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

It would be great to see the results of your code, output, in addition to how long the operation lasted.

标签： code-golf text-parsing language-agnostic rosetta-stone

13条回答

一夜七次

2楼-- · 2019-02-02 13:11

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

Here is an example of it's use:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

If you need to list only the lowercase versions, it requires two more characters.

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'

real    0m0.679s
user    0m0.304s
sys     0m0.084s

Here is the last example expanded a bit.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}

0人赞添加讨论(0) 举报

我想做一个坏孩纸

3楼-- · 2019-02-02 13:16

C# 3.0 (with LINQ)

Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.

Here are the results of the function run on the sample text (min occurences: 3, min length: 2).

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

And my test program:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-02-02 13:19

In C#:

Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.

0人赞添加讨论(0) 举报

爷的心禁止访问

5楼-- · 2019-02-02 13:23

GNU scripting

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

Results:

  7 be
  6 to
[...]
  1 2.
  1 -

With occurence greater than X:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2019-02-02 13:28

F#: 304 chars

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

0人赞添加讨论(0) 举报

▲ chillily

7楼-- · 2019-02-02 13:28

Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

A much longer version with plenty of comments for you reading pleasure:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.

Results:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

0人赞添加讨论(0) 举报

1 2 3 下一页