change string in file between two strings with cha

Provisional solution

This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete <Name>…</Name> entries and also a starting <Name> without the matching </Name> on the same line. Frankly, I'm not even sure how to start tackling that scenario.

The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full <Name>…</Name> version. I'm leaving that as an exercise for the reader (beware the < in the character class — it would need to become a /).

`script.sed`

/<Name>/! b
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
b
}
/<Name>/,/<\/Name>/{
  # Handle up to 4 lines to the end-name tag
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
  : l2
  s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
  t l2
}

The first line 'skips' processing of lines not containing <Name> (they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b to jump to the end of processing.

The new section is the /<Name>/,/<\/Name>/ code. This looks for <Name> on its own, and concatenates up to 4 lines until a </Name> is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2 in place of l1, the remainder is exactly the same as in the initial offering — sed regexes already accommodate newlines.

This is heavy-duty sed scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.

`data`

A slightly extended data file.

<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
        </Name>
<Name>
    Jim</Name>
<Name>
    Jim
        </Name>
<Name> Jason
Bourne </Name>
<Name> 
    Jason
        Bourne
            </Name>
<Name> Elijah </Name>
<Name>
Dennis
</Name>
<Name> Elijah
Wood </Name>
            <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name>
    <Name>Dennis The
Menace</Name>



<Name> Jason </Name>
to
<Name> XXXXX </Name>

2. (see no space)

 <Name>Jim</Name>
 to
 <Name>XXX</Name>

3.

<!--Name Jason /--> 
to 
<!--Name XXXXX /-->`

4.

<!--Name Jas /-->
to
<!--Name XXX /-->

starting tag, value and closing tag can all come in different line

5.

<Name>Jim
</Name>
to
<Name>XXX
</Name>

6.

<Name>
     Jim
       </Name>
to
<Name>
     XXX
       </Name>

7.

  <!--Name
     Jim
       /-->
to
  <!--Name
     XXX
       /-->

8.

<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>

9.

<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>

No claims are made that the data file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like <Name Value /> are converted into XML comments . The mapping actually isn't crucial; the opening part doesn't match <Name> (and the tail doesn't match </Name>) so they'd not be processed anyway.

Output

$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> XXXXX
        </Name>
<Name>
    XXX</Name>
<Name>
    XXX
        </Name>
<Name> XXXXX
XXXXXX </Name>
<Name> 
    XXXXX
        XXXXXX
            </Name>
<Name> XXXXXX </Name>
<Name>
XXXXXX
</Name>
<Name> XXXXXX
XXXX </Name>
            <Name> XXXXXX
XXX XXXXXX </Name>
<Name>XXXXXX
XXXX</Name>
    <Name>XXXXXX XXX
XXXXXX</Name>



<Name> XXXXX </Name>
to
<Name> XXXXX </Name>

2. (see no space)

 <Name>XXX</Name>
 to
 <Name>XXX</Name>

3.

<!--Name Jason /--> 
to 
<!--Name XXXXX /-->`

4.

<!--Name Jas /-->
to
<!--Name XXX /-->

starting tag, value and closing tag can all come in different line

5.

<Name>XXX
</Name>
to
<Name>XXX
</Name>

6.

<Name>
     XXX
       </Name>
to
<Name>
     XXX
       </Name>

7.

  <!--Name
     Jim
       /-->
to
  <!--Name
     XXX
       /-->

8.

<Name> XXXXX </Name> <Name> XXXXXXX </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>

9.

<Name> XXXXX XXXXXXX </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
$

Initial offering

A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:

`script.sed`

/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
}

That is pretty contorted, to be polite about it. It looks for <Name> followed by zero or more spaces. That can be followed by \(X[X[[:space:]]*\)\{0,1\}, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1 in the replacement. Then there's a single character that isn't an X, < or space, followed by zero or more any characters, zero or more spaces, and </Name>. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1 and the conditional branch t l1. All that operates only on a line with both <Name> and </Name>.

`data`

<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> Elijah </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>

Output

$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> XXXXXX </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
$

Note the replacement part way through the end. That line is going to cause headaches for anything more.

I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the </Name> is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.

Try this python script:

$ cat script.py
#!/usr/bin/python
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('allcases'), features="xml")
for tag in soup.findAll('Name'):
    for name in 'Jason Ignacio', 'Jason', 'Jim':
        tag.string =  re.sub(r'\b%s\b' % name, len(name)*'X', tag.string)
print(str(soup))

This code is compatible with either python2 or python3.

To make it work, you may need to install the BeautifulSoup module. On a debian-like system:

apt-get install python-bs4

Or, for python3:

apt-get install python3-bs4

Example

Let's consider this input file:

$ cat cases
<page>
<p>Jason</p>
<Name> Jason </Name>
<p>Jason</p>
 <Name>Jim</Name>
<p>Jim</p>
<Name>Jim
</Name>
<Name>
     Jim
       </Name>
<Name> Jason </Name> <Name> Ignacio </Name>
<Name> Jason Ignacio </Name>
</page>

Let's run our script and observe the output:

$ python script.py
<?xml version="1.0" encoding="utf-8"?>
<page>
<p>Jason</p>
<Name> XXXXX </Name>
<p>Jason</p>
<Name>XXX</Name>
<p>Jim</p>
<Name>XXX
</Name>
<Name>
     XXX
       </Name>
<Name> XXXXX </Name> <Name> Ignacio </Name>
<Name> XXXXXXXXXXXXX </Name>
</page>

Note that the names in <p> tags are left alone. The code only changes the names in <Name> tags.

Also, as per the design, Jim, Jason, and Jason Ignacio are changed to X's but other names are left alone. Even Ignacio, if it appears without an adjacent Jason, is left alone.