Text packing algorithm

2019-01-19 02:49发布

I bet somebody has solved this before, but my searches have come up empty.

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

Example: doll dollhouse house

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

What I've come up with so far is:

  1. Sort the words longest to shortest: (dollhouse, house, doll)
  2. Scan the buffer to see if the string already exists as a substring, if so note the location.
  3. If it doesn't already exist, add it to the end of the buffer.

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm

8条回答
劳资没心,怎么记你
2楼-- · 2019-01-19 02:58

I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).

查看更多
Deceive 欺骗
3楼-- · 2019-01-19 03:07

It's not clear what do you want to do.

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa 
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string. You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

查看更多
何必那么认真
4楼-- · 2019-01-19 03:11

Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.

查看更多
一纸荒年 Trace。
5楼-- · 2019-01-19 03:14

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

查看更多
淡お忘
6楼-- · 2019-01-19 03:19

Refine step 3.

  • Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
  • If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
  • If no, add word to end of list as in current step 3.

This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).

查看更多
男人必须洒脱
7楼-- · 2019-01-19 03:20

I did a lab back in college where we tasked with implementing a simple compression program.

What we did was sequentially apply these techniques to text:

  • BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
  • MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
  • Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

Here, I found the assignment page.

To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.

查看更多
登录 后发表回答