I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:
< text id="www.example.com>
and
< /text>
I would like to split the larger file by these tags. So that, for example,
< text id="www.example.com>
Hello
< /text>
< text id="www.example.com>
This is
< /text>
< text id="www.example.com>
An Example
< /text>
Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:
File 1
< text id="www.example.com>
Hello
< /text>
File 2
< text id="www.example.com>
This is
< /text>
File 3
< text id="www.example.com>
An Example
< /text>
I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.
I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?
Thanks in advance for any help!