I am not sure what is happening under the hood with regards to the Python object model for the code below.
You can download the data for the ctabus.csv file from this link
import csv
def read_as_dicts(filename):
records = []
with open(filename) as f:
rows = csv.reader(f)
headers = next(rows)
for row in rows:
route = row[0]
date = row[1]
daytype = row[2]
rides = int(row[3])
records.append({
'route': route,
'date': date,
'daytype': daytype,
'rides': rides})
return records
# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461
# record route ids (object ids)
route_ids = set()
for row in rows:
route_ids.add(id(row['route']))
print(len(route_ids)) #690072
# unique_routes
unique_routes = set()
for row in rows:
unique_routes.add(row['route'])
print(len(unique_routes)) #185
When I call print(len(route_ids))
it prints "690072"
. Why did Python end up creating these many objects?
I expect this count to be either 185 or 736461. 185 because, when I count the unique routes in set the length of that set comes out to be 185. 736461 because, this is the total number of records in csv file.
What is this weird number "690072"?
I am trying to understand why this partial caching? Why python can't perform a full caching something like below.
import csv
route_cache = {}
#some hack to cache
def cached_route(routename):
if routename not in route_cache:
route_cache[routename] = routename
return route_cache[routename]
def read_as_dicts(filename):
records = []
with open(filename) as f:
rows = csv.reader(f)
headers = next(rows)
for row in rows:
row[0] = cached_route(row[0]) #cache trick
route = row[0]
date = row[1]
daytype = row[2]
rides = int(row[3])
records.append({
'route': route,
'date': date,
'daytype': daytype,
'rides': rides})
return records
# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461
# unique_routes
unique_routes = set()
for row in rows:
unique_routes.add(row['route'])
print(len(unique_routes)) #185
# record route ids (object ids)
route_ids = set()
for row in rows:
route_ids.add(id(row['route']))
print(len(route_ids)) #185
A typical record from the file looks like following:
That means most of your immutable objects are strings and only the
'rides'
-value is an integer.For small integers (
-5...255
), Python3 keeps an integer pool - so these small integers feels like being cached (as long asPyLong_FromLong
and Co. are used).The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:
0
and1
are interned.All of the above are implementation details, but taking them into account we get the following for the
row[0]
above:'route', 'date', 'daytype', 'rides'
are all interned because they created at compile time of the functionread_as_dicts
and don't have "strange" characters.'3'
and'W'
are interned because their length is only1
.01/01/2001
isn't interned because it is longer than1
, created at runtime and whouldn't qualify anyway because it has character/
in it.7354
isn't from the small integer pool, because too large. But other entries might be from this pool.This was an explanation for the current behavior, with only some objects being "cached".
But why doesn't Python cache all created strings/integer?
Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than
O(n)
), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost
0
!On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take
O(n)
(n
-number of characters), which probably won't pay off for large strings.Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using
sys.intern()
.Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.
There are 736461 elements in
rows
.Thus, you are adding
id(row['route'])
to the setroute_ids
736461 times.Since whatever
id
returns is guaranteed to be unique among simultaneously existing objects, we would expectroute_ids
to end up with 736461 items minus whatever the amount of strings that are small enough to be cached for two'route'
keys of two rows inrows
.Turns out that in your specific case, that number is 736461 - 690072 == 46389.
Caching of small immutable objects (strings, integers) is an implementation detail you should not rely on - but here's a demo:
In the end, there's probably a semantic error in your program. What do you want to do with the unique
id
s of the Python objects?