I wanna replace value between a tag by equal number of X. For example
1.
<Name> Jason </Name>
to
<Name> XXXXX </Name>
2. (see no space)
<Name>Jim</Name>
to
<Name>XXX</Name>
3.
<Name Jason />
to
<Name XXXXX />`
4.
<Name Jas />
to
<Name XXX />
starting tag, value and closing tag can all come in different line
5.
<Name>Jim
</Name>
to
<Name>XXX
</Name>
6.
<Name>
Jim
</Name>
to
<Name>
XXX
</Name>
7.
<Name
Jim
/>
to
<Name
XXX
/>
8.
<Name> Jason </Name> <Name> Ignacio </Name>
to
<Name> XXXXX </Name> <Name> XXXXXX </Name>
9.
<Name> Jason Ignacio </Name>
to
<Name> XXXXX XXXXXXX </Name>
or
<Name> XXXXXXXXXXXXX </Name>
both are fine
I tried this, but it didn't worked
file=mylog.log
search_str="<Name>"
end_str="</Name>"
sed -i -E ':a; s/('"$search_str"'X*)[^X'"$end_str"']/\1X/; ta' "$file"
Please let me know how to do this in bash script....
Update:
I tried this also, but didn't worked for 6 and 7 cases. case 1 to 5 worked.
sed -i -E '/<Name>/{:a; /<\/Name>/bb; n; ba; :b; s/(<Name>X*)[^X\<]/\1X/; tb; }' "$file"
sed -i -E '/<Name[[:space:]]/{:a; /\/>/bb; n; ba; :b; s/(<Name[[:space:]]X*)[^X\/]/\1X/; tb; }' "$file"
Try this python script:
This code is compatible with either python2 or python3.
To make it work, you may need to install the BeautifulSoup module. On a debian-like system:
Or, for python3:
Example
Let's consider this input file:
Let's run our script and observe the output:
Note that the names in
<p>
tags are left alone. The code only changes the names in<Name>
tags.Also, as per the design,
Jim
,Jason
, andJason Ignacio
are changed to X's but other names are left alone. Even Ignacio, if it appears without an adjacent Jason, is left alone.Provisional solution
This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete
<Name>…</Name>
entries and also a starting<Name>
without the matching</Name>
on the same line. Frankly, I'm not even sure how to start tackling that scenario.The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full
<Name>…</Name>
version. I'm leaving that as an exercise for the reader (beware the<
in the character class — it would need to become a/
).script.sed
The first line 'skips' processing of lines not containing
<Name>
(they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's ab
to jump to the end of processing.The new section is the
/<Name>/,/<\/Name>/
code. This looks for<Name>
on its own, and concatenates up to 4 lines until a</Name>
is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the labell2
in place ofl1
, the remainder is exactly the same as in the initial offering —sed
regexes already accommodate newlines.This is heavy-duty
sed
scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.data
A slightly extended data file.
No claims are made that the
data
file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like<Name Value />
are converted into XML comments<!--Name Value /-->
. The mapping actually isn't crucial; the opening part doesn't match<Name>
(and the tail doesn't match</Name>
) so they'd not be processed anyway.Output
Initial offering
A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:
script.sed
That is pretty contorted, to be polite about it. It looks for
<Name>
followed by zero or more spaces. That can be followed by\(X[X[[:space:]]*\)\{0,1\}
, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as\1
in the replacement. Then there's a single character that isn't anX
,<
or space, followed by zero or more any characters, zero or more spaces, and</Name>
. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label: l1
and the conditional brancht l1
. All that operates only on a line with both<Name>
and</Name>
.data
Output
Note the replacement part way through the end. That line is going to cause headaches for anything more.
I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the
</Name>
is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.