In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
As @gsk3 noted,
bigmemory
is a great package for this, along with the packagesbiganalytics
andbigtabulate
(there are more, but these are worth checking out). There's alsoff
, though that isn't as easy to use.Common to both R and Python is support for HDF5 (see the
ncdf4
orNetCDF4
packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily usebigmemory
, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.Since this has the R tag I'll give some R solutions:
bigmemory
package http://www.cybaea.net/Blogs/Data/Big-data-for-R.htmlHow huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier. This problem seems ideally suited to a line-by-line parser. Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the
file
module directly. You can see what functions are available in it by typinghelp (file)
in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you toimport file
.Something like:
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing,
re
for example (typehelp(re)
into the interactive interpreter).