This question already has answers here:
Closed 3 years ago.
I wanna replace value between a tag by equal number of X. For example
1.
<Name> Jason </Name>
to
<Name> XXXXX </Name>
2.
(see no space)
<Name>Jim</Name>
to
<Name>XXX</Name>
3.
<Name Jason />
to
<Name XXXXX />`
4.
<Name Jas />
to
<Name XXX />
starting tag, value and closing tag can all come in different line
5.
<Name>Jim
</Name>
to
<Name>XXX
</Name>
6.
<Name>
Jim
</Name>
to
<Name>
XXX
</Name>
7.
<Name
Jim
/>
to
<Name
XXX
/>
8.
<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
both are fine
I tried this, but it didn't worked
file=mylog.log
search_str="<Name>"
end_str="</Name>"
sed -i -E ':a; s/('"$search_str"'X*)[^X'"$end_str"']/\1X/; ta' "$file"
Please let me know how to do this in bash script....
Update:
I tried this also, but didn't worked for 6 and 7 cases. case 1 to 5 worked.
sed -i -E '/<Name>/{:a; /<\/Name>/bb; n; ba; :b; s/(<Name>X*)[^X\<]/\1X/; tb; }' "$file"
sed -i -E '/<Name[[:space:]]/{:a; /\/>/bb; n; ba; :b; s/(<Name[[:space:]]X*)[^X\/]/\1X/; tb; }' "$file"
Provisional solution
This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete <Name>…</Name>
entries and also a starting <Name>
without the matching </Name>
on the same line. Frankly, I'm not even sure how to start tackling that scenario.
The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full <Name>…</Name>
version. I'm leaving that as an exercise for the reader (beware the <
in the character class — it would need to become a /
).
script.sed
/<Name>/! b
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
b
}
/<Name>/,/<\/Name>/{
# Handle up to 4 lines to the end-name tag
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
: l2
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l2
}
The first line 'skips' processing of lines not containing <Name>
(they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b
to jump to the end of processing.
The new section is the /<Name>/,/<\/Name>/
code. This looks for <Name>
on its own, and concatenates up to 4 lines until a </Name>
is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2
in place of l1
, the remainder is exactly the same as in the initial offering — sed
regexes already accommodate newlines.
This is heavy-duty sed
scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.
data
A slightly extended data file.
<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name>
Jim
</Name>
<Name> Jason
Bourne </Name>
<Name>
Jason
Bourne
</Name>
<Name> Elijah </Name>
<Name>
Dennis
</Name>
<Name> Elijah
Wood </Name>
<Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name>
<Name>Dennis The
Menace</Name>
<Name> Jason </Name>
to
<Name> XXXXX </Name>
2. (see no space)
<Name>Jim</Name>
to
<Name>XXX</Name>
3.
<!--Name Jason /-->
to
<!--Name XXXXX /-->`
4.
<!--Name Jas /-->
to
<!--Name XXX /-->
starting tag, value and closing tag can all come in different line
5.
<Name>Jim
</Name>
to
<Name>XXX
</Name>
6.
<Name>
Jim
</Name>
to
<Name>
XXX
</Name>
7.
<!--Name
Jim
/-->
to
<!--Name
XXX
/-->
8.
<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
No claims are made that the data
file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like <Name Value />
are converted into XML comments <!--Name Value /-->
. The mapping actually isn't crucial; the opening part doesn't match <Name>
(and the tail doesn't match </Name>
) so they'd not be processed anyway.
Output
$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> XXXXX
</Name>
<Name>
XXX</Name>
<Name>
XXX
</Name>
<Name> XXXXX
XXXXXX </Name>
<Name>
XXXXX
XXXXXX
</Name>
<Name> XXXXXX </Name>
<Name>
XXXXXX
</Name>
<Name> XXXXXX
XXXX </Name>
<Name> XXXXXX
XXX XXXXXX </Name>
<Name>XXXXXX
XXXX</Name>
<Name>XXXXXX XXX
XXXXXX</Name>
<Name> XXXXX </Name>
to
<Name> XXXXX </Name>
2. (see no space)
<Name>XXX</Name>
to
<Name>XXX</Name>
3.
<!--Name Jason /-->
to
<!--Name XXXXX /-->`
4.
<!--Name Jas /-->
to
<!--Name XXX /-->
starting tag, value and closing tag can all come in different line
5.
<Name>XXX
</Name>
to
<Name>XXX
</Name>
6.
<Name>
XXX
</Name>
to
<Name>
XXX
</Name>
7.
<!--Name
Jim
/-->
to
<!--Name
XXX
/-->
8.
<Name> XXXXX </Name> <Name> XXXXXXX </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> XXXXX XXXXXXX </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
$
Initial offering
A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:
script.sed
/<Name>.*<\/Name>/{
: l1
s/\(<Name>[[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
}
That is pretty contorted, to be polite about it. It looks for <Name>
followed by zero or more spaces. That can be followed by \(X[X[[:space:]]*\)\{0,1\}
, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1
in the replacement. Then there's a single character that isn't an X
, <
or space, followed by zero or more any characters, zero or more spaces, and </Name>
. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1
and the conditional branch t l1
. All that operates only on a line with both <Name>
and </Name>
.
data
<Name> Jason </Name>
<Name>Jim</Name>
<Name> Jason Bourne </Name>
<Name> Elijah </Name> <Name> Dennis </Name>
<Name> Elijah Wood </Name> <Name> Dennis The Menace </Name>
<Name>Elijah Wood</Name> <Name>Dennis The Menace</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> Elijah </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
Output
$ sed -f script.sed data
<Name> XXXXX </Name>
<Name>XXX</Name>
<Name> XXXXX XXXXXX </Name>
<Name> XXXXXX </Name> <Name> XXXXXX </Name>
<Name> XXXXXX XXXX </Name> <Name> XXXXXX XXX XXXXXX </Name>
<Name>XXXXXX XXXX</Name> <Name>XXXXXX XXX XXXXXX</Name>
<Name> Jason
</Name>
<Name>
Jim</Name>
<Name> Jason
Bourne </Name>
<Name> XXXXXX </Name> <Name> Dennis
</Name>
<Name> Elijah
Wood </Name> <Name> Dennis
The Menace </Name>
<Name>Elijah
Wood</Name> <Name>Dennis The
Menace</Name>
$
Note the replacement part way through the end. That line is going to cause headaches for anything more.
I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the </Name>
is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.
Try this python script:
$ cat script.py
#!/usr/bin/python
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('allcases'), features="xml")
for tag in soup.findAll('Name'):
for name in 'Jason Ignacio', 'Jason', 'Jim':
tag.string = re.sub(r'\b%s\b' % name, len(name)*'X', tag.string)
print(str(soup))
This code is compatible with either python2 or python3.
To make it work, you may need to install the BeautifulSoup module. On a debian-like system:
apt-get install python-bs4
Or, for python3:
apt-get install python3-bs4
Example
Let's consider this input file:
$ cat cases
<page>
<p>Jason</p>
<Name> Jason </Name>
<p>Jason</p>
<Name>Jim</Name>
<p>Jim</p>
<Name>Jim
</Name>
<Name>
Jim
</Name>
<Name> Jason </Name> <Name> Ignacio </Name>
<Name> Jason Ignacio </Name>
</page>
Let's run our script and observe the output:
$ python script.py
<?xml version="1.0" encoding="utf-8"?>
<page>
<p>Jason</p>
<Name> XXXXX </Name>
<p>Jason</p>
<Name>XXX</Name>
<p>Jim</p>
<Name>XXX
</Name>
<Name>
XXX
</Name>
<Name> XXXXX </Name> <Name> Ignacio </Name>
<Name> XXXXXXXXXXXXX </Name>
</page>
Note that the names in <p>
tags are left alone. The code only changes the names in <Name>
tags.
Also, as per the design, Jim
, Jason
, and Jason Ignacio
are changed to X's but other names are left alone. Even Ignacio, if it appears without an adjacent Jason, is left alone.