As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.
For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).
What am I currently doing?
Here is my current code, where inputlower
is an all lowercase String (scraped web data w/ Mathematica).
inputlower=Import["/directory/allTextLowered.txt"];
bigrams =
Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];
Clear[bigrams];
trigrams =
Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];
quadgrams =
Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams =
Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];
In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[]
program. But when we're dealing with big inputs...
What is wrong with it when I use big files?
Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?
Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?
Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55}
without bogging the system down?
Edit
[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]
These are my suggestions:
I suggest using
ReadList[file, Word]
. Usually it is much faster thanImport
. This will also break it into words.You can consider working with gzip-compressed files too.
Import
/Export
support these seamlessly, butReadList
does not. For disk-limited operations this will actually be faster than reading/writing uncompressed data.Your
Sort
s may be slow (I haven't tested your operations with large files, so I'm not sure). See yesterday's question on how to do this fast.You can't break out from
Tally
before it finishes, but you can always useSelect
,Cases
orDeleteCases
to trim the bigram list before exporting.Finally, as an answer to your last question: I am afraid Mathematica is only efficient/convenient to work with of you load all data into memory. The system seems to be conceived to work well with in-memory data only. This is from personal experience.
EDIT Working with your 50 MB text file is slow, but still bearable on my (rather old and slow) machine. Just make sure you use
SortBy
:I couldn't get
ReadList
to handle UTF-8 properly, so you may need to stick toImport
there.Introduction
What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays,
Compress
, .mx files andDumpSave
, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.Code
Here is the code:
The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.
Preprocessing and saving in optimized format
We now start. Preprocessing takes about a minute on my machine:
The symbols
inputlower
,wordIndexRules
,ngrams
here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.
Compressing and saving takes another half a minute:
The resulting
.mx
file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variableinputlower
, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.Working with the optimized file
We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):
Loading the list of words takes some time (self - uncompressing):
but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:
Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g.,
ngrams["NGrams",3]
is still compressed). Loading the 3-grams and their frequencies:The timings are decent, given the size of the lists. Note that neither
DumpSave - Get
norCompress - Uncompress
unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:Here we uncompress the rules relating indices to words:
This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.
Efficient work with uncompressed data
If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:
However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):
Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):
Here are some more convenience functions:
Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:
Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):
The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:
Concluding remarks
The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of
Compress
andDumpSave
. Hash-tables,Dispatch
-ed rules and self-uncompressing are techniques used to make it more convenient.There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.
To extend a comment I made,
Read
is a useful alternative toReadList
orImport
. LikeReadList
, you specify a type, and if you specifyString
it reads in the entire line. So, you can process the entire file, one (or more) lines at a time. The only difficulty, is that you have to watch forEndOfFile
by hand. For instance,To extend this to multiple lines at once, replace
String
above with a list the length of the number of lines you wish to process at once containing onlyString
. This is best accomplished usingConstantArray[String, n]
for more than a few lines. Of course,Word
can be used instead, to process the file on a word for word basis.Processing files line-by-line has a down side, if you need to
Abort
the process,strm
will be left open. So, I suggest wrapping the code inCheckAbort
, or using the functionality described here.You might look at "String Patterns", which are Mathematica's version of Regular Expressions. Perhaps something like
StringCases[data, RegularExpression["\\w+?\\W+?\\w+?"]]
, which should return all the matching sequences of word-whitespace-word. I can't say whether this will be faster than your partition code or not.There's a "Tips For Efficient Matching" at the bottom of that page.
You could apply "DeleteDuplicates" to trim the list before sorting.
If I were doing this in another language, I would store the n-grams in a Hash Table, with the text as key, and the instance count as a value. This would work well with a line-by line file parser. However, it seems using a Hash Table in Mathematica is not straightforward.
And one observation: Instead of running 4 passes, you could just generate all the 5-grams to a single file, and use simple command line text processing to generate the 2-, 3- and 4-grams from that file. Of course, this is only useful after you get the 5-gram extractor to run in a reasonable time.
These slides are the current best wisdom on how to deal with importing and working with large sets of data:
http://library.wolfram.com/infocenter/Conferences/8025/
It covers some of the subjects mentioned here and gives some graphs which will show you how much of a speed up you can expect from switching away from Import.