Split file by XML tag

2019-06-05 18:55发布

问题:

I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:

< text id="www.example.com>

and

< /text>

I would like to split the larger file by these tags. So that, for example,

< text id="www.example.com>

Hello

< /text>

< text id="www.example.com>

This is

< /text>

< text id="www.example.com>

An Example

< /text>

Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:

File 1

< text id="www.example.com>

Hello

< /text>

File 2

< text id="www.example.com>

This is

< /text>

File 3

< text id="www.example.com>

An Example

< /text>

I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.

I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?

Thanks in advance for any help!

回答1:

The following awk solves the problem, but unfortunately caps out at around 1000 output files

awk '{print $0 ""> "file" NR}' RS='' input-file


回答2:

It's a lot more complicated than a simple awk command, and I don't if the file would be to big or not, but you could try using an XSLT V2.0 style sheet with result-document to produce all of your files.

One advantage of using XSLT over a regex is that it would have better support if the file format changes slightly or if there are attributes on the nodes you want to split with.



回答3:

The following PERL program found here: Split one file into multiple files based on delimiter

#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
    print FO $_;
    if(/^<\/text>/) # Added \
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die;
    }
}
close(FO);

Also seems to do the trick, with no maximum cap.

Cheers.