Say I have the following reST input:
Some text ...
:foo: bar
Some text ...
What I would like to end up with is a dict like this:
{"foo": "bar"}
I tried to use this:
tree = docutils.core.publish_parts(text)
It does parse the field list, but I end up with some pseudo XML in tree["whole"]?
:
<document source="<string>">
<docinfo>
<field>
<field_name>
foo
<field_body>
<paragraph>
bar
Since the tree
dict does not contain any other useful information and that is just a string, I am not sure how to parse the field list out of the reST document. How would I do that?
You can try to use something like the following code. Rather than using the publish_parts
method I have used publish_doctree
, to get the pseudo-XML representation of your document. I have then converted to an XML DOM in order to extract all the field
elements. Then I get the first field_name
and field_body
elements of each field
element.
from docutils.core import publish_doctree
source = """Some text ...
:foo: bar
Some text ...
"""
# Parse reStructuredText input, returning the Docutils doctree as
# an `xml.dom.minidom.Document` instance.
doctree = publish_doctree(source).asdom()
# Get all field lists in the document.
fields = doctree.getElementsByTagName('field')
d = {}
for field in fields:
# I am assuming that `getElementsByTagName` only returns one element.
field_name = field.getElementsByTagName('field_name')[0]
field_body = field.getElementsByTagName('field_body')[0]
d[field_name.firstChild.nodeValue] = \
" ".join(c.firstChild.nodeValue for c in field_body.childNodes)
print d # Prints {u'foo': u'bar'}
The xml.dom module isn't the easiest to work with (why do I need to use .firstChild.nodeValue
rather than just .nodeValue
for example), so you may wish to use the xml.etree.ElementTree module, which I find a lot easier to work with. If you use lxml you can also use XPATH notation to find all of the field
, field_name
and field_body
elements.
I have an alternative solution that I find to be less of a burden, but maybe more brittle. After reviewing the implementation of the node class https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/nodes.py you will see that it supports a walk method that can be used to pull out the wanted data without having to create two different xml representations of your data. Here is what I am using now, in my protoype code:
https://github.com/h4ck3rm1k3/gcc-introspector/blob/master/peewee_adaptor.py#L33
from docutils.core import publish_doctree
import docutils.nodes
and then
def walk_docstring(prop):
doc = prop.__doc__
doctree = publish_doctree(doc)
class Walker:
def __init__(self, doc):
self.document = doc
self.fields = {}
def dispatch_visit(self,x):
if isinstance(x, docutils.nodes.field):
field_name = x.children[0].rawsource
field_value = x.children[1].rawsource
self.fields[field_name]=field_value
w = Walker(doctree)
doctree.walk(w)
# the collected fields I wanted
pprint.pprint(w.fields)
Here's my ElementTree implementation:
from docutils.core import publish_doctree
from xml.etree.ElementTree import fromstring
source = """Some text ...
:foo: bar
Some text ...
"""
def gen_fields(source):
dom = publish_doctree(source).asdom()
tree = fromstring(dom.toxml())
for field in tree.iter(tag='field'):
name = next(field.iter(tag='field_name'))
body = next(field.iter(tag='field_body'))
yield {name.text: ''.join(body.itertext())}
Usage
>>> next(gen_fields(source))
{'foo': 'bar'}