I'm writing a utility (which happens to be in python) which is generating output in the form of a TCL script. Given some arbitrary string variable (not unicode) in the python, I want to produce a TCL line like
set s something
... which will set TCL variable 's
' to that exact string, regardless of what strange characters are in it. Without getting too weird, I don't want to make the output messier than needed. I believe a decent approach is
if the string is not empty and contains only alphanumerics, and some characters like
.-_
(but definitely not$"{}\
) then it can be used as-is;if it contains only printable characters and no double-quotes or curly braces (and does not end in backslash ) then simply put
{}
around it;otherwise, put
""
around it after using\
escapes for"
{
}
\
$
[
]
, and\nnn
escapes for non-printing characters.
Question: is that the full set of characters which need escaping inside double quotes? I can't find this in the docs. And did I miss something (I almost missed that strings for (2) can't end in \ for instance).
I know there are many other strings which can be quoted by {}
, but it seems difficult to identify them easily. Also, it looks like non-printing characters (in particular, newline) are OK with (2) if you don't mind them being literally present in the TCL output.
Tcl has very few metacharacters once you're inside a double-quoted string, and all of them can be quoted by putting a backslash in front of them. The characters you must quote are
\
itself,$
and[
, but it's considered good practice to also quote]
,{
and}
so that the script itself is embeddable. (Tcl's ownlist
command does this, except that it doesn't actually wrap the double quotes so it also handles backslashes and it will also try to use other techniques on “nice” strings. There's an algorithm for doing this, but I advise not bothering with that much complexity in your code; simple universal rules are much better for correct coding.)The second step is to get the data into Tcl. If you are generating a file, your best option is to write it as UTF-8 and use the
-encoding
option to tclsh/wish or to thesource
command to explicitly state what the encoding is. (If you're inside the same process, write UTF-8 data into a string and evaluate that. Job Done.) That option (introduced in Tcl 8.5) is specifically for dealing with this sort of problem:If that's not possible, you're going to have to fall back to adding additional quoting. The best thing is to then assume you've only got ASCII support available (a good lowest common denominator) and quote everything else as a separate step to the quoting described in the first paragraph. To quote, convert every Unicode character from U+00080 up to an escape sequence of the form
\uXXXX
where XXXX are exactly four hex digits[1] and the other two are literal characters. Don't use the\xXX
form, as that has some “surprising” misfeatures (alas).[1] There's an open bug in Tcl about handling characters outside the Basic Multilingual Pane, part of which is that the
\u
form isn't able to cope. Fortunately, non-BMP characters are still reasonably rare in practice.You really only need 2 rules,
You don't need to worry about newlines, non printable characters etc. They are valid in a literal string, and TCL has excellent Unicode support.
Edit In light of your comment, you can do the following:
[]
{}
and$
set s [subst { $output } ]
The beauty of Tcl is it a has a very simple grammar. There are no other characters besides the 3 above needed to be escaped.
Edit 2 One last try.
If you pass
subst
some options, you will only need to escape\
and{}
set s [subst -nocommands -novariables { $output } ]
You would need to come up with a regex to convert non printable characters to their escaped codes however.
Good luck!
To do it right you should also specify the encoding your python string is in, typically sys.getdefaultencoding(). Otherwise you might garble encodings when translating it to Tcl.
If you have binary data in your string and want Tcl binary strings as a result this will always work:
Will look like a hex dump though, but well, it is a hex dump...
If you use any special encoding like UTF-8 you can enhance that a bit by using encoding convertfrom/convertto and the appropriate Python idiom.
You can of course refine this a bit, avoiding the \u encoding of all the non special chars, but the above is safe in any case.