Why does PyYAML use generators to construct object

2020-02-06 05:33发布

问题:

I've been reading the PyYAML source code to try to understand how to define a proper constructor function that I can add with add_constructor. I have a pretty good understanding of how that code works now, but I still don't understand why the default YAML constructors in the SafeConstructor are generators. For example, the method construct_yaml_map of SafeConstructor:

def construct_yaml_map(self, node):
    data = {}
    yield data
    value = self.construct_mapping(node)
    data.update(value)

I understand how the generator is used in BaseConstructor.construct_object as follows to stub out an object and only populate it with data from the node if deep=False is passed to construct_mapping:

    if isinstance(data, types.GeneratorType):
        generator = data
        data = generator.next()
        if self.deep_construct:
            for dummy in generator:
                pass
        else:
            self.state_generators.append(generator)

And I understand how the data is generated in BaseConstructor.construct_document in the case where deep=False for construct_mapping.

def construct_document(self, node):
    data = self.construct_object(node)
    while self.state_generators:
        state_generators = self.state_generators
        self.state_generators = []
        for generator in state_generators:
            for dummy in generator:
                pass

What I don't understand is the benefit of stubbing out the data objects and working down through the objects by iterating over the generators in construct_document. Does this have to be done to support something in the YAML spec, or does it provide a performance benefit?

This answer on another question was somewhat helpful, but I don't understand why that answer does this:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

instead of this:

def foo_constructor(loader, node):
    state = loader.construct_mapping(node, deep=True)
    return Foo(**state)

I've tested that the latter form works for the examples posted on that other answer, but perhaps I am missing some edge case.

I am using version 3.10 of PyYAML, but it looks like the code in question is the same in the latest version (3.12) of PyYAML.

回答1:

In YAML you can have anchors and aliases. With that you can make self-referential structures, directly or indirectly.

If YAML would not have this possibility of self-reference, you could just first construct all the children and then create the parent structure in one go. But because of the self-references you might not have the child yet to "fill-out" the structure that you are creating. By using the two-step process of the generator (I call this two step, because it has only one yield before you come to the end of the method), you can create an object partially and the fill it out with a self-reference, because the object exist (i.e. its place in memory is defined).

The benefit is not in speed, but purely because of making the self-reference possible.

If you simplify the example from the answer you refer to a bit, the following loads:

import sys
import ruamel.yaml as yaml


class Foo(object):
    def __init__(self, s, l=None, d=None):
        self.s = s
        self.l1, self.l2 = l
        self.d = d


def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

yaml.add_constructor(u'!Foo', foo_constructor)

x = yaml.load('''
&fooref
!Foo
s: *fooref
l: [1, 2]
d: {try: this}
''', Loader=yaml.Loader)

yaml.dump(x, sys.stdout)

but if you change foo_constructor() to:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)
    return instance

(yield removed, added a final return), you get a ConstructorError: with as message

found unconstructable recursive node 
  in "<unicode string>", line 2, column 1:
    &fooref

PyYAML should give a similar message. Inspect the traceback on that error and you can see where ruamel.yaml/PyYAML tries to resolve the alias in the source code.