Python: handling a large set of data. Scipy or Rpy

2020-05-24 05:25发布

问题:

In my python environment, the Rpy and Scipy packages are already installed.

The problem I want to tackle is such:

1) A huge set of financial data are stored in a text file. Loading into Excel is not possible

2) I need to sum a certain fields and get the totals.

3) I need to show the top 10 rows based on the totals.

Which package (Scipy or Rpy) is best suited for this task?

If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?

Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory

回答1:

As @gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.

Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.



回答2:

Neither Rpy or Scipy is necessary, although numpy may make it a bit easier. This problem seems ideally suited to a line-by-line parser. Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.



回答3:

Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.

Something like:

f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
    #line is a string type
    #do whatever you want to do on a per-line basis here, for example:
    print len(line)

Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.

I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).



回答4:

How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:

import numpy as np
with file("data.csv", "rb") as f:
   title = f.readline()  # if your data have a title line.
   data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
   print np.sum(data, axis=0)  # sum along 0 axis to get the sum of every column


回答5:

I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.

As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.

I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.



回答6:

Since this has the R tag I'll give some R solutions:

  • Overview http://www.r-bloggers.com/r-references-for-handling-big-data/
  • bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
  • XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
  • Hadoop interfaces to R (RHIPE, etc.)