How exactly is Python Bytecode Run in CPython?

2019-01-20 23:54发布

问题:

I am trying to understand how Python works (because I use it all the time!). To my understanding, when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly.

How is this bytecode read in? Is it similar to how a text file is read in C? I am unsure how the Python code is converted to machine code. Is it the case that the Python interpreter (the python command in the CLI) is really just a precompiled C program that is already converted to machine code and then the python bytecode files are just put through that program? In other words, is my Python program never actually converted into machine code? Is the python interpreter already in machine code, so my script never has to be?

回答1:

Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".

http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790

Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.



回答2:

If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis module will tell you exactly what you need. For example:

>>> dis.dis('i/3')
  1           0 LOAD_NAME                0 (i)
              3 LOAD_CONST               0 (3)
              6 BINARY_TRUE_DIVIDE
              7 RETURN_VALUE

The dis docs explain what each bytecode means. For example, LOAD_NAME:

Pushes the value associated with co_names[namei] onto the stack.

To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names is. The inspect module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names is an attribute of code objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0 pushes the value associated with the 0th local variable (and dis helpfully looks this up and sees that the 0th local variable is named 'i').

And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).

The inspect module also has some tools that can help you further in investigating live code.

This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal or global statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.

(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)


To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c source code. The answers by thy435 and eyquem already cover this.


Understanding how pyc files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code object, serialized by marshal.


To understand how source gets compiled to bytecode, that's the fun part.

Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)

For the early stuff—tokenizing and parsing—you can just use the ast module to jump right to the point where it's time to do the actual compiling. Then see compile.c for how that AST gets turned into bytecode.

The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop and friends to emit bytecodes at the current level, it all makes sense.

One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)


And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.



回答3:

Having read the answer of thg4535, I am sure you will find interesting the following explanations on ceval.c : Hello, ceval.c!

This article is part of a series written by Yaniv Aknin whose I'm sort of a fan: Python's Innards