Efficiently Working with (and generating) Large Te

2019-01-16 10:15发布

As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.

For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).

What am I currently doing?

Here is my current code, where inputlower is an all lowercase String (scraped web data w/ Mathematica).

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[] program. But when we're dealing with big inputs...

What is wrong with it when I use big files?

Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?

Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?

Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55} without bogging the system down?

Edit

[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]

5条回答
\"骚年 ilove
2楼-- · 2019-01-16 10:56

These are my suggestions:

  1. I suggest using ReadList[file, Word]. Usually it is much faster than Import. This will also break it into words.

  2. You can consider working with gzip-compressed files too. Import/Export support these seamlessly, but ReadList does not. For disk-limited operations this will actually be faster than reading/writing uncompressed data.

  3. Your Sorts may be slow (I haven't tested your operations with large files, so I'm not sure). See yesterday's question on how to do this fast.

You can't break out from Tally before it finishes, but you can always use Select, Cases or DeleteCases to trim the bigram list before exporting.

Finally, as an answer to your last question: I am afraid Mathematica is only efficient/convenient to work with of you load all data into memory. The system seems to be conceived to work well with in-memory data only. This is from personal experience.


EDIT Working with your 50 MB text file is slow, but still bearable on my (rather old and slow) machine. Just make sure you use SortBy:

In[1]:= $HistoryLength = 0; (* save memory *)

In[2]:= Timing[
 data = ReadList["~/Downloads/lowered-text-50.txt", Word, 
    WordSeparators -> {" ", "\t", ".", ","}];]

Out[2]= {6.10038, Null}

In[3]:= Timing[counts = Tally@Partition[data, 2, 1];]

Out[3]= {87.3695, Null}

In[4]:= Timing[counts = SortBy[counts, Last];]

Out[4]= {28.7538, Null}

In[5]:= Timing[counts = DeleteCases[counts, {_, 1}];]

Out[5]= {3.11619, Null}

I couldn't get ReadList to handle UTF-8 properly, so you may need to stick to Import there.

查看更多
Rolldiameter
3楼-- · 2019-01-16 11:02

Introduction

What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays, Compress, .mx files and DumpSave, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.

Code

Here is the code:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.

Preprocessing and saving in optimized format

We now start. Preprocessing takes about a minute on my machine:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

The symbols inputlower, wordIndexRules, ngrams here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.

Compressing and saving takes another half a minute:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

The resulting .mx file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variable inputlower, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.

Working with the optimized file

We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

Loading the list of words takes some time (self - uncompressing):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g., ngrams["NGrams",3] is still compressed). Loading the 3-grams and their frequencies:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

The timings are decent, given the size of the lists. Note that neither DumpSave - Get nor Compress - Uncompress unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

Here we uncompress the rules relating indices to words:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.

Efficient work with uncompressed data

If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

Here are some more convenience functions:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

Concluding remarks

The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of Compress and DumpSave. Hash-tables, Dispatch-ed rules and self-uncompressing are techniques used to make it more convenient.

There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.

查看更多
ら.Afraid
4楼-- · 2019-01-16 11:05

To extend a comment I made, Read is a useful alternative to ReadList or Import. Like ReadList, you specify a type, and if you specify String it reads in the entire line. So, you can process the entire file, one (or more) lines at a time. The only difficulty, is that you have to watch for EndOfFile by hand. For instance,

strm = OpenRead[file];
While[ (line = Read[ str, String ]) =!= EndOfFile,
 (* Do something with the line *)
];
Close[ strm ];

To extend this to multiple lines at once, replace String above with a list the length of the number of lines you wish to process at once containing only String. This is best accomplished using ConstantArray[String, n] for more than a few lines. Of course, Word can be used instead, to process the file on a word for word basis.

Processing files line-by-line has a down side, if you need to Abort the process, strm will be left open. So, I suggest wrapping the code in CheckAbort, or using the functionality described here.

查看更多
地球回转人心会变
5楼-- · 2019-01-16 11:12

You might look at "String Patterns", which are Mathematica's version of Regular Expressions. Perhaps something like StringCases[data, RegularExpression["\\w+?\\W+?\\w+?"]], which should return all the matching sequences of word-whitespace-word. I can't say whether this will be faster than your partition code or not.

There's a "Tips For Efficient Matching" at the bottom of that page.

You could apply "DeleteDuplicates" to trim the list before sorting.

If I were doing this in another language, I would store the n-grams in a Hash Table, with the text as key, and the instance count as a value. This would work well with a line-by line file parser. However, it seems using a Hash Table in Mathematica is not straightforward.

And one observation: Instead of running 4 passes, you could just generate all the 5-grams to a single file, and use simple command line text processing to generate the 2-, 3- and 4-grams from that file. Of course, this is only useful after you get the 5-gram extractor to run in a reasonable time.

查看更多
【Aperson】
6楼-- · 2019-01-16 11:20

These slides are the current best wisdom on how to deal with importing and working with large sets of data:

http://library.wolfram.com/infocenter/Conferences/8025/

It covers some of the subjects mentioned here and gives some graphs which will show you how much of a speed up you can expect from switching away from Import.

查看更多
登录 后发表回答