As part of my graduation I have to migrate XML files to CouchDB. To convert the structure of the file to JSON is no problem at all, but there's one part which I can't figure out how to actually convert:
<p>We beg to send us immediately [...] <note>
<p>In the original, [...]</p>
</note><lb/><add>by post</add> one copy of
<title>A Book</title> by <persName>
<choice><abbr>Mrs.</abbr><expan>Misses</expan></choice>Jane Smith</persName>.
As soon<lb/> we know the <choice>
<sic>prize</sic>
<corr>price</corr>
</choice>the amount [...]<lb/> by post.<lb/>
</p>
I'd like to stick to JSON and don't use XML within JSON, as I would need to implement XML capabilities to my program then.
What would be 'best practices' or solutions to this problem?
The following will work, though is a PITA to work with. Represent each tag s follows:
{
attr => {...},
tag => "...",
content => [...]
}
And content
as an array of text (outside of tags) or else tags.
Ignoring whitespace and indentation your snippet would become something like:
{
tag => "p",
content => [
"We beg to send us immediately [...]",
{
tag => "note",
content => [
{
tag => "p",
content => [ "In the original, [...]" ]
}
]
},
{ tag => "lb" },
{
tag => "add",
content => [ "by post" ],
},
" one copy of ",
{
tag=> "title",
content => [ "A Book" ],
},
" by ",
{
tag => "persName",
content => [
{
tag => "choice",
content => [ ... ]
}
],
},
...
]
}
(I got bored representing it, sorry.)
Note that the data structure is very repetitive and verbose. But you'll be processing the JSON programmatically, and for that it is very useful that the data structure is perfectly predictable and regular.
Use Unicode to simplify the conversion of anonymous block boxes:
JSON.stringify({"domelement":
{
"p": "We beg to send us immediately [...]",
"note": {"p":"In the original, [...]"},
"add": "by post \u0022one copy of\u0022",
"title": "A Book \u0022by\u0022",
"choice": [{"abbr":"Mrs."}, {"expan":"Misses \u0022Jane Smith\u0022 \u0022As soon\u0022 \u0022we know the\u0022"}],
"choice": {"sic":"prize"},
"corr": "price \u0022the amount [...]\u0022"
}
})
References
- ECMAScript Strawman: Unicode Normalization