My machine learning script produces a lot of data (millions of BTree
s contained in one root BTree
) and store it in ZODB's FileStorage
, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size
to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage
but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage
would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script. I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict
.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'): dbroot.actions_values = BTree() if not hasattr(dbroot, 'games_played'): dbroot.games_played = 0
actions_values
is conceptually built as follows:actions_values = { # BTree str(state): { # BTree # contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4) # and their values(only actions previously taken by the angent are present here), e.g.: 1: 0.4356 5: 0.3456 }, # other states }
state
is a simple 2D array representing game board. Possible vales of it's fields are1
,2
orNone
:board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case
rows = 6
andcols = 7
)main loop:
should_play = 10000000 transactions_freq = 10000 packing_freq = 50000 player = ReinforcementPlayer(dbroot.actions_values, config) while dbroot.games_played < should_play: # max_epsilon at start and then linearly drops to min_epsilon: epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1) dbroot.games_played += 1 sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play)) sys.stdout.flush() board_state = player.play_game(epsilon) if(dbroot.games_played % transactions_freq == 0): print('Commiting...') transaction.commit() if(dbroot.games_played % packing_freq == 0): print('Packing DB...') connection.db().pack()
(
pack
ing also takes much time but it's not the main problem; I could pack database after program finishes)Code operating on
dbroot
(insideReinforcementPlayer
):def get_actions_with_values(self, player_id, state): if player_id == 1: lookup_state = state else: lookup_state = state.switch_players() lookup_state_str = str(lookup_state) if lookup_state_str in self.actions_values: return self.actions_values[lookup_state_str] mirror_lookup_state_str = str(lookup_state.mirror()) if mirror_lookup_state_str in self.actions_values: return self.mirror_actions(self.actions_values[mirror_lookup_state_str]) return None def get_value_of_action(self, player_id, state, action, default=0): actions = self.get_actions_with_values(player_id, state) if actions is None: return default return actions.get(action, default) def set_value_of_action(self, player_id, state, action, value): if player_id == 1: lookup_state = state else: lookup_state = state.switch_players() lookup_state_str = str(lookup_state) if lookup_state_str in self.actions_values: self.actions_values[lookup_state_str][action] = value return mirror_lookup_state_str = str(lookup_state.mirror()) if mirror_lookup_state_str in self.actions_values: self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value return self.actions_values[lookup_state_str] = BTree() self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values)
is 6018450.
According to iotop
IO operations take 90% of the time.
Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt
Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have
__slots__
trick)Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.