This one has me entirely baffled.
asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
for index, hist_item in enumerate(val_hist_list):
#row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)])
row = {"computer_name": key_host, "id": index, "hist_item": hist_item}
asset_hist.append(row)
This code works perfectly with the collections line commented out. However, when I comment out the row = dict line and remove the comment from the collections line things get very strange. There are about 4 million of these rows being generated and appended to asset_hist.
So, when I use row=dict, the entire loop finishes in about 10 milliseconds, lightning fast. When I use the ordered dictionary, I've waited over 10 minutes and it still didn't finish. Now, I know OrderDict is supposed to be a little slower than a dict, but it's supposed to be about 10x slower at worst and by my math its actually about 100,000 times slower in this function.
I decided to print the index in the lowest loop to see what was happening. Interestingly enough, I noticed a sputtering in console output. The index would print very fast on the screen and then stop for about 3-5 seconds before continuing on.
am_output.asset_history is a dictionary which has one key, host, and every row is a list of strings. E.g.
am_output.asset_history = {"host1": ["string1", "string2", ...], "host2": ["string1", "string2", ...], ...}
EDIT: Sputter Analysis with OrderedDict
Total Memory on this VM Server: Only 8GB... need to get more provissioned.
LOOP NUM
184796 (~5 second wait, ~60% memory usage)
634481 (~5 second wait, ~65% memory usage)
1197564 (~5 second wait, ~70% memory usage)
1899247 (~5 second wait, ~75% memory usage)
2777296 (~5 second wait, ~80% memory usage)
3873730 (LONG WAIT... waited 20 minutes and gave up!, 88.3% memory usage, process is still running)
Where the wait happens changes with each run.
EDIT: Ran it again, this time it stop on 3873333, close to the spot it stopped before. It stopped after forming the row, while trying to append... I didn't notice this last attempt but it was there then too... the problem is with the append line, not the row line... I'm still baffled. Here's the row it produced right before the long stop (added the row to the print statement)... hostname changed to protect the innocent:
3873333: OrderedDict([('computer_name', 'bg-fd5612ea'), ('id', 1), ('hist_item', "sys1 Normalizer (sys1-4): Domain Name cannot be determined from sys1 Name 'bg-fd5612ea'.")])
As your own tests prove, you're running out of memory. Even on CPython 3.6 (where plain
dict
is actually ordered, though not as a language guarantee yet),OrderedDict
has significant memory overhead compared todict
; it's still implemented with a side-band linked list to preserve the order and support easy iteration, reordering withmove_to_end
, etc. You can tell just by checking withsys.getsizeof
(exact results will differ by Python version and build bitwidth, 32 vs. 64 bit):Ignoring the data stored, the overhead for the
OrderedDict
here is nearly twice that of the plaindict
. If you're making 4 million of these items, on my machine that would add overhead of a titch over 850 MB (on both 3.5 and 3.6).It's likely the combination of all the other programs on your system, plus your Python program, is exceeding the RAM allocated to your machine, and you're stuck swap thrashing. In particular, whenever
asset_hist
has to expand for new entries, it's likely needing to page in large parts of it (that got paged out for lack of use), and whenever a cyclic garbage collection run triggers (a full GC occurs roughly every 70,000 allocations and deallocations by default), all theOrderedDict
s get paged back in to check if they're still referenced outside of cycles (you could check if the GC runs are the main problem by disabling cyclic GC viagc.disable()
).Given your particular use case, I'd strongly recommend avoiding both
dict
andOrderedDict
though. The overhead of evendict
, even the cheaper form on Python 3.6, is kind of extreme when you have a set of exactly three fixed keys over and over. Instead, usecollections.namedtuple
, which is designed for lightweight objects referenceable by either name or index (they act like regulartuple
s, but also allow accessing each value as a named attribute), which would dramatically reduce the memory cost of your program (and likely speed it up even when memory is not an issue).For example:
Only difference in use is that you replace
row['computer_name']
withrow.computer_name
, or if you need all the values, you can unpack it like a regulartuple
, e.g.comphost, idx, hist = row
. If you need a trueOrderedDict
temporarily (don't store them for everything), you can callrow._asdict()
to get anOrderedDict
with the same mapping as thenamedtuple
, but that's not normally needed. The memory savings are meaningful; on my system, the three elementnamedtuple
drops the per-item overhead to 72 bytes, less than a third that of even the 3.6dict
and less than a sixth of a 3.6OrderedDict
(and three elementnamedtuple
remains 72 bytes on 3.5, wheredict
/OrderedDict
are larger pre-3.6). It may save even more than that;tuple
s (andnamedtuple
by extension) are allocated as a single contiguous Cstruct
, whiledict
and company are at least two allocations (one for the object structure, one or more for the dynamically resizable parts of the structure), each of which may pay allocator overhead and alignment costs.Either way, for your four million row scenario, using
namedtuple
would mean paying (beyond the cost of the values) overhead of around 275 MB total, vs. 915 (3.6) - 1100 (3.5) MB fordict
and 1770 (3.6) - 1950 (3.5) MB forOrderedDict
. When you're talking about an 8 GB system, shaving 1.5 GB off your overhead is a major improvement.