何时提交在ZODB数据(when to commit data in ZODB)

2019-06-25 21:13发布

我想韩德尔由下面的一段代码生成的数据:

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

由于字典是大(10000个键X 10000列表,每3个元件),则难以使其保持在存储器中。 一旦它们被生成的值(在列表的形式)对:我寻找存储该密钥的溶液。 正是在这里建议, 写作和阅读的特定格式(Python)的字典 ,结合使用ZODB与B树。

忍耐一下,如果这是太天真了,我的问题是,当一个应该调用transaction.commit()提交数据? 如果我把它在内部循环结束时,生成的文件是非常大的(不知道为什么)。 这里是一个片段:

storage = FileStorage('Data.fs')
db = DB(store)
connection = db.open()
root = connection.root()
btree_container = IOBTree
root[0] = btree_container 
for nodes in G.nodes()
    btree_container[nodes] = PersistentList () ## I was loosing data prior to doing this 

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
        transaction.commit()

如果我把它叫做外两个循环? 就像是:

    ......
       ......
          score = SomeOperation on (Gvalue,Hvalue)
          btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    transaction.commit()

将所有的数据保存在存储器持有至我打电话器transaction.commit()? 同样,我不知道为什么,但这会导致磁盘上的文件较小。

我希望尽量减少在内存中保存的数据。 任何指导将不胜感激 !

Answer 1:

Your goal is to make your process manageable within memory constraints. To be able to do this with the ZODB as a tool you need to understand how ZODB transactions work, and how to use them.

Why your ZODB grows so large

First of all you need to understand what a transaction commit does here, which also explains why your Data.fs is getting so large.

The ZODB writes data out per transaction, where any persistent object that has changed gets written to disk. The important detail here is persistent object that has changed; the ZODB works in units of persistent objects.

Not every python value is a persistent object. If I define a straight-up python class, it will not be persistent, nor are any of the built-in python types such as int or list. On the other hand, any class you define that inherits from persistence.Persistent is a persistent object. The BTrees set of classes, as well as the PeristentList class you use in your code do inherit from Persistent.

Now, on a transaction commit, any persistent object that has changed is written to disk as part of that transaction. So any PersistentList object that has been append to will be written in it's entirety to disk. BTrees handle this a little more efficient; they store Buckets, themselves persistent, which in turn hold the actually stored objects. So for every few new nodes you create, a Bucket is written to the transaction, not the whole BTree structure. Note that because the items held in the tree are themselves persistent objects only references to them are stored in the Bucket records.

Now, the ZODB writes transaction data by appending it to the Data.fs file, and it does not remove old data automatically. It can construct the current state of the database by finding the most recent version of a given object from the store. This is why your Data.fs is growing so much, you are writing out new versions of larger and larger PersistentList instances as transactions are committed.

Removing the old data is called packing, which is similar to the VACUUM command in PostgreSQL and other relational databases. Simply call the .pack() method on the db variable to remove all old revisions, or use the t and days parameters of that method to set limits on how much history to retain, the first is a time.time() timestamp (seconds since the epoch) before which you can pack, and days is the number of days in the past to retain from current time or t if specified. Packing should reduce your data file considerably as the partial lists in older transactions are removed. Do note that packing is an expensive operation and thus can take a while, depending on the size of your dataset.

Using transaction to manage memory

You are trying to build a very large dataset, by using persistence to work around constraints with memory, and are using transactions to try and flush things to disk. Normally, however, using a transaction commit signals you have completed constructing your dataset, something you can use as one atomic whole.

What you need to use here is a savepoint. Savepoints are essentially subtransactions, a point during the whole transaction where you can ask for data to be temporarily stored on disk. They'll be made permanent when you commit the transaction. To create a savepoint, call the .savepoint method on the transaction:

for Gnodes in G.nodes():      # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append(
            [Hnodes, score, -1 ])
    transaction.savepoint(True)
transaction.commit()

In the above example I set the optimistic flag to True, meaning: I do not intent to roll back to this savepoint; some storages do not support rolling back, and signalling you do not need this makes your code work in such situations.

Also note that the transaction.commit() happens when the whole data set has been processed, which is what a commit is supposed to achieve.

One thing a savepoint does, is call for a garbage collection of the ZODB caches, which means that any data not currently in use is removed from memory.

Note the 'not currently in use' part there; if any of your code holds on to large values in a variable the data cannot be cleared from memory. As far as I can determine from the code you've shown us, this looks fine. But I do not know how your operations work or how you generate the nodes; be careful to avoid building complete lists in memory there when an iterator will do, or build large dictionaries where all your lists of lists are referenced, for example.

你可以尝试一些,因为到您创建保存点; 你可以创建一个每次处理完一个HNodes ,或仅当用做GNodes循环就像我之前所做的那样。 您正在构建每列表GNodes ,所以它会被保存在内存中,而遍历所有H.nodes()反正,并刷新到磁盘可能会才有意义,一旦你已经完成了全面建设它。

但是,如果你发现你需要更频繁地清除内存,你应该考虑使用一个BTrees.OOBTree.TreeSet类或BTrees.IOBTree.BTree类,而不是一个PersistentList将数据分解成更持久对象。 一个TreeSet是有序的,但不(容易)可转位,而BTree可以通过使用简单的递增索引键被用来作为一个列表:

for i, Hnodes in enumerate(H.nodes()):
    ...
    btree_container.setdefault(Gnodes, IOBTree())[i] = [Hnodes, score, -1]
    if i % 100 == 0:
        transaction.savepoint(True)

上述代码使用B树而不是PersistentList的,并创建一个保存点每100个HNodes处理。 因为B树使用桶,这是在自己的持久对象,整个结构可以更容易地,而不必留在内存中的所有刷新到保存点H.nodes()以进行处理。



Answer 2:

什么是交易取决于什么需要在你的应用程序“原子”。 如果交易失败,将rollbacked到以前的状态(只是在最后提交)。 看来从要计算每个Gnodes值应用程序代码。 所以,你的提交可以在Gnodes循环这样的走到底在:

for Gnodes in G.nodes():       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    # once we calculate the value for a Gnodes, commit
    transaction.commit()

它从你的代码看起来“Hvalue”组合不取决于g值或Gnodes。 如果它是一个昂贵的操作,你计算的方法对每个Gnodes 1000倍,即使它不影响其计算。 所以,我想将它移出循环。

# Hnodes iterates over 10000 values
hvals = dict((Hnodes, someoperation(Hnodes)) for Hnodes in H.nodes())
# now you have mapping of Hnodes and Hvalues

for Gnodes in G.nodes():       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes, Hvalue in hvals.iteritems(): 
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    # once we calculate the value for a given Gnodes, commit
    transaction.commit()


文章来源: when to commit data in ZODB
标签: python zodb