Is there any automatic way to convert a piece of code from python's old style string formatting (using %
) to the new style (using .format
)? For example, consider the formatting of a PDB atom specification:
spec = "%-6s%5d %4s%1s%3s %1s%4d%1s %8.3f%8.3f%8.3f%6.2f%6.2f %2s%2s"
I've been converting some of these specifications by hand as needed, but this is both error prone, and time-consuming as I have many such specifications.
The functionality of the two forms does not match up exactly, so there is no way you could automatically translate every %
string into an equivalent {}
string or (especially) vice-versa.
Of course there is a lot of overlap, and many of the sub-parts of the two formatting languages are the same or very similar, so someone could write a partial converter (which could, e.g., raise an exception for non-convertible code).
For a small subset of the language like what you seem to be using, you could do it pretty trivially with a simple regex—every pattern starts with %
and ends with one of [sdf]
, and something like {:\1\2}
as a replacement pattern ought to be all you need.
But why bother? Except as an exercise in writing parsers, what would be the benefit? The %
operator is not deprecated, and using %
with an existing %
format string will obviously do at least as well as using format
with a %
format string converted to {}
.
If you are looking at this as an exercise in writing parsers, I believe there's an incomplete example buried inside pyparsing.
Some differences that are hard to translate, off the top of my head:
*
for dynamic field width or precision; format
has a similar feature, but does it differently.
%(10)s
, because format
tries to interpret the key name as a number first, then falls back to a dict key.
%(a[b])s
, because format
doesn't quote or otherwise separate the key from the rest of the field, so a variety of characters simply can't be used.
%c
takes integers or single-char strings; :c
only integers.
%r
/%s
/%a
analogues are not part of the format string, but a separate part of the field (which also comes on the opposite side).
%g
and :g
have slightly different cutoff rules.
%a
and !a
don't do the exact same thing.
The actual differences aren't listed anywhere; you will have to dig them out by a thorough reading of the Format Specification Mini-Language vs. the printf
-style String Formatting language.
The docs explain some of the differences. As far as I can tell -- although I'm not very familiar with old-style format strings -- is that the functionality of the new style is a superset of the functionality of the oldstyle.
You'd have to do more tweaking to handle edge cases, but I think something simple like
re.replace(r'%(\w+)([sbcdoXnf...])', r'{\1\2}', your_string)
would get you 90% of the way there. The remaining translation -- going from things like %x
to {0:x}
-- will be too complex for a regular expression to handle (without writing some ridiculously complex conditionals inside of your regex).