Say I have the following variables
and its corresponding values
which represents a record
.
name = 'abc'
age = 23
weight = 60
height = 174
Please note that the value
could be of different types
(string
, integer
, float
, reference-to-any-other-object, etc).
There will be many records
(at least >100,000). Each and every record
will be unique
when all these four variables
(actually its values
) are put together. In other words, there exists no record
with all 4 values
are the same.
I am trying to find an efficient data structure in Python
which will allow me to (store and) retrieve records
based on any one of these variables
in log(n)
time complexity.
For example:
def retrieve(name=None,age=None,weight=None,height=None)
if name is not None and age is None and weight is None and height is None:
/* get all records with the given name */
if name is None and age is not None and weight is None and height is None:
/* get all records with the given age */
....
return records
The way the retrieve
should be called is as follows:
retrieve(name='abc')
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]
retrieve(age=23)
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]
And, I may need to add one or two more variables
to this record in future. For example, say, sex = 'm'
. So, the retrieve
function must be scalable.
So in short: Is there a data structure in Python
which will allow storing a record
with n
number of columns
(name, age, sex, weigh, height, etc) and retrieving records
based on any (one) of the column
in logarithmic
(or ideally constant - O(1)
look-up time) complexity?
No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.
Given http://wiki.python.org/moin/TimeComplexity how about this:
AGE
,NAME
, etc.AGE
,NAME
) be possible values for given column (35 or "m").VALUES = [ [35, "m"], ...]
AGE
,NAME
) be lists of indices from theVALUES
list.VALUES
so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).Then the
retrieve
function could look like this:Then, this is what you get
If you want a dictionary, you can do the following:
but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.
Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the
VALUES
list and then append the index inVALUES
to each of the dictionaries.Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)
EDIT:
Please note that as @moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.
You could achieve logarithmic time complexity in a relational database using indexes (
O(log(n)**k)
with single column indexes). Then to retrieve data just construct appropriate SQL:Example:
Output:
There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.
For example, say your input was the following data in a comma-separated-value file called
employees.csv
with field names defined as shown by the first line:The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.
The records are instances of a class created by
namedtuple
which is a very memory efficient because each one lacks a__dict__
attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, likerecord.fieldname
.The look-up tables are
defaultdict(list)
instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of thePerson
records stored in theemployees
list with that value — so they'll all be relatively small.Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. Of course when using an instance, all
retrieve()
method calls must provide valid field names.Update
Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the
retrieve()
method "lazily" creates them only when one is needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.Output: