Note: This question is for informational purposes only. I am interested to see how deep into Python's internals it is possible to go with this.
Not very long ago, a discussion began inside a certain question regarding whether the strings passed to print statements could be modified after/during the call to print
has been made. For example, consider the function:
def print_something():
print('This cat was scared.')
Now, when print
is run, then the output to the terminal should display:
This dog was scared.
Notice the word "cat" has been replaced by the word "dog". Something somewhere somehow was able to modify those internal buffers to change what was printed. Assume this is done without the original code author's explicit permission (hence, hacking/hijacking).
This comment from the wise @abarnert, in particular, got me thinking:
There are a couple of ways to do that, but they're all very ugly, and should never be done. The least ugly way is to probably replace the
code
object inside the function with one with a differentco_consts
list. Next is probably reaching into the C API to access the str's internal buffer. [...]
So, it looks like this is actually possible.
Here's my naive way of approaching this problem:
>>> import inspect
>>> exec(inspect.getsource(print_something).replace('cat', 'dog'))
>>> print_something()
This dog was scared.
Of course, exec
is bad, but that doesn't really answer the question, because it does not actually modify anything during when/after print
is called.
How would it be done as @abarnert has explained it?
Monkey-patch
print
print
is a builtin function so it will use theprint
function defined in thebuiltins
module (or__builtin__
in Python 2). So whenever you want to modify or change the behavior of a builtin function you can simply reassign the name in that module.This process is called
monkey-patching
.After that every
print
call will go throughcustom_print
, even if theprint
is in an external module.However you don't really want to print additional text, you want to change the text that is printed. One way to go about that is to replace it in the string that would be printed:
And indeed if you run:
Or if you write that to a file:
test_file.py
and import it:
So it really works as intended.
However, in case you only temporarily want to monkey-patch print you could wrap this in a context-manager:
So when you run that it depends on the context what is printed:
So that's how you could "hack"
print
by monkey-patching.Modify the target instead of the
print
If you look at the signature of
print
you'll notice afile
argument which issys.stdout
by default. Note that this is a dynamic default argument (it really looks upsys.stdout
every time you callprint
) and not like normal default arguments in Python. So if you changesys.stdout
print
will actually print to the different target even more convenient that Python also provides aredirect_stdout
function (from Python 3.4 on, but it's easy to create an equivalent function for earlier Python versions).The downside is that it won't work for
print
statements that don't print tosys.stdout
and that creating your ownstdout
isn't really straightforward.However this also works:
Summary
Some of these points have already be mentioned by @abarnet but I wanted to explore these options in more detail. Especially how to modify it across modules (using
builtins
/__builtin__
) and how to make that change only temporary (using contextmanagers).A simple way to capture all output from a
print
function and then process it, is to change the output stream to something else, e.g. a file.I'll use a
PHP
naming conventions (ob_start, ob_get_contents,...)Usage:
Would print
Let's combine this with frame introspection!
You'll find this trick prefaces every greeting with the calling function or method. This might be very useful for logging or debugging; especially as it lets you "hijack" print statements in third party code.
First, there's actually a much less hacky way. All we want to do is change what
print
prints, right?Or, similarly, you can monkeypatch
sys.stdout
instead ofprint
.Also, nothing wrong with the
exec … getsource …
idea. Well, of course there's plenty wrong with it, but less than what follows here…But if you do want to modify the function object's code constants, we can do that.
If you really want to play around with code objects for real, you should use a library like
bytecode
(when it's finished) orbyteplay
(until then, or for older Python versions) instead of doing it manually. Even for something this trivial, theCodeType
initializer is a pain; if you actually need to do stuff like fixing uplnotab
, only a lunatic would do that manually.Also, it goes without saying that not all Python implementations use CPython-style code objects. This code will work in CPython 3.7, and probably all versions back to at least 2.2 with a few minor changes (and not the code-hacking stuff, but things like generator expressions), but it won't work with any version of IronPython.
What could go wrong with hacking up code objects? Mostly just segfaults,
RuntimeError
s that eat up the whole stack, more normalRuntimeError
s that can be handled, or garbage values that will probably just raise aTypeError
orAttributeError
when you try to use them. For examples, try creating a code object with just aRETURN_VALUE
with nothing on the stack (bytecodeb'S\0'
for 3.6+,b'S'
before), or with an empty tuple forco_consts
when there's aLOAD_CONST 0
in the bytecode, or withvarnames
decremented by 1 so the highestLOAD_FAST
actually loads a freevar/cellvar cell. For some real fun, if you get thelnotab
wrong enough, your code will only segfault when run in the debugger.Using
bytecode
orbyteplay
won't protect you from all of those problems, but they do have some basic sanity checks, and nice helpers that let you do things like insert a chunk of code and let it worry about updating all offsets and labels so you can't get it wrong, and so on. (Plus, they keep you from having to type in that ridiculous 6-line constructor, and having to debug the silly typos that come from doing so.)Now on to #2.
I mentioned that code objects are immutable. And of course the consts are a tuple, so we can't change that directly. And the thing in the const tuple is a string, which we also can't change directly. That's why I had to build a new string to build a new tuple to build a new code object.
But what if you could change a string directly?
Well, deep enough under the covers, everything is just a pointer to some C data, right? If you're using CPython, there's a C API to access the objects, and you can use
ctypes
to access that API from within Python itself, which is such a terrible idea that they put apythonapi
right there in the stdlib'sctypes
module. :) The most important trick you need to know is thatid(x)
is the actual pointer tox
in memory (as anint
).Unfortunately, the C API for strings won't let us safely get at the internal storage of an already-frozen string. So screw safely, let's just read the header files and find that storage ourselves.
If you're using CPython 3.4 - 3.7 (it's different for older versions, and who knows for the future), a string literal from a module that's made of pure ASCII is going to be stored using the compact ASCII format, which means the struct ends early and the buffer of ASCII bytes follows immediately in memory. This will break (as in probably segfault) if you put a non-ASCII character in the string, or certain kinds of non-literal strings, but you can read up on the other 4 ways to access the buffer for different kinds of strings.
To make things slightly easier, I'm using the
superhackyinternals
project off my GitHub. (It's intentionally not pip-installable because you really shouldn't be using this except to experiment with your local build of the interpreter and the like.)If you want to play with this stuff,
int
is a whole lot simpler under the covers thanstr
. And it's a lot easier to guess what you can break by changing the value of2
to1
, right? Actually, forget imagining, let's just do it (using the types fromsuperhackyinternals
again):… pretend that code box has an infinite-length scrollbar.
I tried the same thing in IPython, and the first time I tried to evaluate
2
at the prompt, it went into some kind of uninterruptable infinite loop. Presumably it's using the number2
for something in its REPL loop, while the stock interpreter isn't?