How should I find repeated word sequences

2019-03-30 08:42发布

站内文章 / Java

40 0

姐就是有狂的资本

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to detect the presence of multiple blocks of columnar data given only their headings. Nothing else is known about the data except the heading words, which are different for every set of data.

Importantly, it is not known before hand how many words are in each block nor, therefore, how many blocks there are.

Equally important, the word list is always relatively short - less than 20.

So, given a list or array of heading words such as:

Opt
Object
Type
Opt
Object
Type
Opt
Object
Type

what's the most processing-efficient way to determine that it consists entirely of the repeating sequence:

Opt
Object
Type

It must be an exact match, so my first thought is to search [1+] looking for matches to [0], calling them index n,m,... Then if they are equidistant check [1] == [n+1] == [m+1], and [2] == [n+2] == [m+2] etc.

EDIT: It must work for word sets where some of the words are themselves repeated within a block, so

Opt
Opt
Object
Opt
Opt
Object

is a set of 2

Opt
Opt
Object

回答1:

If the list is made of x repeating groups such that each group contains n elements...

We know there is at least 1 group so we will see if there 2 repeating groups, test by comparing the first half of the list and the second half.

1) If the above is true we know that that the solution is a factor of 2

2) If the above is false we move to the next largest prime number which is divisible by the total number of words...

At each step we check for equality among the lists, if we find it then know we have a solution with that factor in it.

We want to return a list of words for which we have the greatest factor of the first prime number for which we find equality among sub lists.

So we apply the above formula on the sub list knowing all sub lists are equal... therefore the solution is best solved recursively. That is we only need to consider the current sub list in isolation.

The solution will be extremely efficient if loaded with a short table of primes... after this it will be necessary to compute them but the list would have to be non trivial if even a list of only a few dozen primes are taken into account.

回答2:

Can the unit sequence contain repetitions of its own? Do you know the length of the unit sequence?

e.g.

ABCABCABCDEFABCABCABCDEFABCABCABCDEF

where the unit sequence is ABCABCABCDEF

If the answer is yes, you've got a difficult problem, I think, unless you know the length of the unit sequence (in which case the solution is trivial, you just make a state machine that first stores the unit sequence, then verifies that the each element rest of the sequence corresponds to each element of the unit sequence).

If the answer is no, use this variant Floyd's cycle-finding algorithm to identify the unit sequence:

Initialize pointers P1 and P2 to the beginning of the sequence.
For each new element, increment pointer P1 every time, and increment pointer P2 every other time (keep a counter around to do this).
If P1 points to an identical elements of P2, you've found a unit sequence.
Now repeat through the rest of the sequence to verify that it consists of duplicates.

UPDATE: you've clarified your problem to state that the unit sequence may contain repetitions of its own. In this case, use the cycle-finding algorithm, but it's only guaranteed to find potential cycles. Keep it running throughout the length of the sequence, and use the following state machine, starting in state 1:

State 1: no cycle found that works; keep looking. When the cycle-finding algorithm finds a potential cycle, verify that you've gotten 2 copies of a preliminary unit sequence from P, and go to state 2. If you reach the end of the input, go to state 4.

State 2: preliminary unit sequence found. Run through the input as long as the cycle repeats identically. If you reach the end of the input, go to state 3. If you find an input element that is different from the corresponding element of the unit sequence, go back to state 1.

State 3: The input is a repetition of a unit sequence if the end of the input consists of complete repetitions of the unit sequence. (If it's midway through a unit sequence, e.g. ABCABCABCABCAB then a unit sequence found, but it does not consist of complete repetitions.)

State 4: No unit sequence found.

In my example (repeating ABCABCABCDEF) the algorithm starts by finding ABCABC, which would put it in state 2, and it would stay there until it hit the first DEF, which would put it back in state 1, then probably jump back and forth between states 1 and 2, until it reached the 2nd ABCABCABCDEF, at which point it would re-enter state 2, and at the end of the input it would be in state 3.

回答3:

A better answer than my other one: a Java implementation which works, should be straightforward to understand, and is generic:

package com.example.algorithms;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

interface Processor<T> {
    public void process(T element);
}

public class RepeatingListFinder<T> implements Processor<T> {

    private List<T> unit_sequence = new ArrayList<T>();
    private int repeat_count = 0;
    private int partial_matches = 0;
    private Iterator<T> iterator = null;

    /* Class invariant:
     * 
     * The sequence of elements passed through process()
     * can be expressed as the concatenation of
     *        the unit_sequence repeated "repeat_count" times,
     *   plus the first "element_matches" of the unit_sequence.
     * 
     * The iterator points to the remaining elements of the unit_sequence,
     * or null if there have not been any elements processed yet.
     */

    public void process(T element) {
        if (unit_sequence.isEmpty() || !iterator.next().equals(element))
        {
            revise_unit_sequence(element);
            iterator = unit_sequence.iterator();
            repeat_count = 1;
            partial_matches = 0;
        }
        else if (!iterator.hasNext())
        {
            iterator = unit_sequence.iterator();
            ++repeat_count;
            partial_matches = 0;
        }
        else
        {
            ++partial_matches;
        }
    }

    /* Unit sequence has changed. 
     * Restructure and add the new non-matching element. 
     */
    private void revise_unit_sequence(T element) {
        if (repeat_count > 1 || partial_matches > 0)
        {
            List<T> new_sequence = new ArrayList<T>();
            for (int i = 0; i < repeat_count; ++i)
                new_sequence.addAll(unit_sequence);
            new_sequence.addAll(
                    unit_sequence.subList(0, partial_matches));

            unit_sequence = new_sequence;
        }
        unit_sequence.add(element);
    }

    public List<T> getUnitSequence() { 
        return Collections.unmodifiableList(unit_sequence);
    }
    public int getRepeatCount() { return repeat_count; }
    public int getPartialMatchCount() { return partial_matches; }
    public String toString()
    {
        return "("+getRepeatCount()
        +(getPartialMatchCount() > 0 
            ? (" "+getPartialMatchCount()
                +"/"+unit_sequence.size())
            : "")
        +") x "+unit_sequence;
    }

    /********** static methods below for testing **********/

    static public List<Character> stringToCharList(String s)
    {
        List<Character> result = new ArrayList<Character>();
        for (char c : s.toCharArray())
            result.add(c);
        return result;
    }

    static public <T> void test(List<T> list)
    {
        RepeatingListFinder<T> listFinder 
            = new RepeatingListFinder<T>();
        for (T element : list)
            listFinder.process(element);
        System.out.println(listFinder);
    }

    static public void test(String testCase)
    {
        test(stringToCharList(testCase));
    }

    static public void main(String[] args)
    {
        test("ABCABCABCABC");
        test("ABCDFTBAT");
        test("ABABA");
        test("ABACABADABACABAEABACABADABACABAEABACABADABAC");
        test("ABCABCABCDEFABCABCABCDEFABCABCABCDEF");
        test("ABABCABABCABABDABABDABABC");      
    }
}

This is a stream-oriented approach (with O(N) execution time and O(N) worst-case space requirements); if the List<T> to be processed already exists in memory, it should be possible to rewrite this class to process the List<T> without any additional space requirements, just keeping track of the repeat count and partial match count, using List.subList() to create a unit sequence that is a view of the first K elements of the input list.

回答4:

My solution, which works as desired, is perhaps naive. It does have the advantage of being simple.

String[]                            wta;                                    // word text array
...
INTERVAL:
for(int xa=1,max=(wta.length/2); xa<=max; xa++) {
    if((wta.length%xa)!=0) { continue; }                                    // ignore intervals which don't divide evenly into the words
    for(int xb=0; xb<xa; xb++) {                                            // iterate the words within the current interval
        for(int xc=xb+xa; xc<wta.length; xc+=xa) {                          // iterate the corresponding words in each section
            if(!wta[xb].equalsIgnoreCase(wta[xc])) { continue INTERVAL; }   // not a cycle
            }
        }
    ivl=xa;
    break;
    }