The Inner Workings of Python: Beyond the Surface
by Gary Worthington, More Than Monkeys

Python is famous for readability. You can write for x in items: and it works without boilerplate. But what happens when you run:
python script.py
This article explains the internals of CPython i.e. the reference implementation. We’ll cover each layer: parsing, compilation, bytecode evaluation, memory management, the Global Interpreter Lock (GIL), performance characteristics, alternative implementations, and upcoming changes like Faster CPython and No-GIL Python.
This is not a “what is Python” primer. It’s aimed at engineers who want to understand how the interpreter works.
1. From Source to Execution
Python doesn’t execute raw text. Instead:
- Parsing → Source code → Abstract Syntax Tree (AST)
- Compilation → AST → Bytecode (.pyc)
- Evaluation → Bytecode → CPython Virtual Machine
2. Parsing: The PEG Parser
Since Python 3.9, CPython uses a PEG parser (Parsing Expression Grammar). This replaced the old LL(1) parser and allows more complex syntax (e.g. match/case) PEP 617.
import ast
tree = ast.parse("x = (3 + 4) * 5")
print(ast.dump(tree, indent=4))
Output:
Module(
body=[
Assign(
targets=[Name(id='x')],
value=BinOp(
left=BinOp(left=Constant(value=3), op=Add(), right=Constant(value=4)),
op=Mult(),
right=Constant(value=5)
)
)
]
)
This AST is Python’s internal tree representation. Every if, for, or def is a node in this structure.
3. Compilation: AST to Bytecode
The compiler translates AST nodes into bytecode instructions. Inspect with dis:
import dis
def add(a: int, b: int) -> int:
return a + b
dis.dis(add)
2 0 LOAD_FAST 0 (a)
2 LOAD_FAST 1 (b)
4 BINARY_ADD
6 RETURN_VALUE
Each instruction is a single bytecode. CPython has ~120 opcodes.
🔗 Disassembly with dis
🔗 Bytecode Instructions
4. Execution: The Eval Loop in ceval.c
At runtime, CPython executes bytecode in a loop inside Python/ceval.c:
for (;;) {
opcode = NEXTOP();
switch (opcode) {
case LOAD_FAST:
PUSH(f->f_localsplus[oparg]);
break;
case BINARY_ADD: {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum = PyNumber_Add(left, right);
SET_TOP(sum);
Py_DECREF(left);
Py_DECREF(right);
break;
}
case RETURN_VALUE:
retval = POP();
goto exit_eval_frame;
}
}
Every Python operation runs through this interpreter loop: stack manipulation, type-dispatched calls (PyNumber_Add, PyObject_GetAttr), and refcounting.
5. Frames: Execution Contexts
Every function call in CPython creates a frame object. These are central to how the interpreter tracks execution state.
The C structure is defined in Include/cpython/frameobject.h:
typedef struct _frame {
PyObject_VAR_HEAD
struct _frame *f_back; /* previous frame, or NULL */
PyCodeObject *f_code; /* code object being executed */
PyObject *f_builtins; /* builtins dict */
PyObject *f_globals; /* globals dict */
PyObject *f_locals; /* locals dict */
PyObject **f_valuestack; /* value stack */
int f_lasti; /* index of last attempted instruction */
int f_lineno; /* current line number */
...
} PyFrameObject;
Each frame stores:
- f_back → a pointer to the previous frame (so the runtime maintains a call stack).
- f_code → the PyCodeObject containing bytecode.
- f_globals / f_locals → the current namespaces.
- f_valuestack → the operand stack used by the eval loop.
- f_lasti → the instruction pointer into bytecode.
Inspect from Python itself:
import inspect
def inner():
frame = inspect.currentframe()
print(f"Function: {frame.f_code.co_name}")
print(f"Line: {frame.f_lineno}")
print(f"Globals: {list(frame.f_globals.keys())[:5]}")
print(f"Back: {frame.f_back.f_code.co_name}")
def outer():
inner()
outer()
Output:
Function: inner
Line: 5
Globals: ['__name__', '__doc__', '__package__', '__loader__', '__spec__']
Back: outer
Because frames link together via f_back, Python debuggers, tracers, and profilers can walk the call stack. This flexibility is a feature, but also explains why Python function calls are heavier than in C — each call means allocating and initialising a new PyFrameObject.
6. Memory Management
Object Layout
Every Python object starts with a PyObject_HEAD:
typedef struct _object {
Py_ssize_t ob_refcnt; /* reference count */
struct _typeobject *ob_type; /* pointer to type object */
} PyObject;
- ob_refcnt → how many references point to the object.
- ob_type → pointer to the object’s type (e.g. <class 'int'>).
This uniform header is why everything in Python is an object: ints, strings, lists, even functions.
Reference Counting
CPython uses macros to manage refcounts:
#define Py_INCREF(op) ((op)->ob_refcnt++)
#define Py_DECREF(op) \
if (--(op)->ob_refcnt == 0) \
_Py_Dealloc((PyObject *)(op))
From Python, you can observe this:
import sys
a = [1, 2, 3]
print(sys.getrefcount(a)) # usually 2: one for 'a', one for getrefcount arg
When ob_refcnt drops to zero, _Py_Dealloc frees the object.
Cycles and Generational GC
Reference counting cannot handle cycles:
a = []
a.append(a) # list refers to itself
Even after del a, the refcount is not zero. CPython includes a cyclic garbage collector in Modules/gc.c with a generational scheme. Thresholds determine when collections run. You can inspect thresholds:
import gc
print(gc.get_threshold())
Typical output:
(700, 10, 10)
To handle this, CPython includes a cyclic garbage collector in Modules/gc.c. It uses a generational scheme:
- Generation 0: newly allocated container objects.
- Generation 1: survivors of collection.
- Generation 2: long-lived objects.
Demonstration:
import gc, weakref
class Node:
def __init__(self): self.ref = None
a, b = Node(), Node()
a.ref, b.ref = b, a # cycle
ref_a = weakref.ref(a)
del a, b
print(ref_a()) # still alive, cycle not freed yet
gc.collect() # force GC
print(ref_a()) # None, collected
Typical output:
<__main__.Node object at 0x...>
None
7. The Global Interpreter Lock (GIL)
The GIL is a single mutex protecting Python’s object state. It lives in Python/ceval_gil.h:
static inline void take_gil(PyThreadState *tstate) {
while (_Py_atomic_load_relaxed(&gil_locked)) {
/* spin or wait */
}
gil_locked = 1;
current_thread = tstate;
}
Why the GIL Exists
- Reference counting updates must be atomic
- Many C API calls are not thread safe
- A global lock simplified early interpreter design
Demonstration: sequential vs threads vs processes (CPU-bound)
import sys
import time
import threading
import multiprocessing as mp
# Make thread switching explicit (default is ~5ms, but let's be clear)
sys.setswitchinterval(0.005)
def cpu_task(n: int) -> None:
"""Tight Python loop that keeps hold of the GIL."""
x = 0
for _ in range(n):
x += 1
def run_sequential(n: int, repeats: int) -> float:
start = time.perf_counter()
for _ in range(repeats):
cpu_task(n)
end = time.perf_counter()
print(f"sequential start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
return end - start
def run_threads(n: int, workers: int) -> float:
ts = [threading.Thread(target=cpu_task, args=(n,)) for _ in range(workers)]
start = time.perf_counter()
for t in ts: t.start()
for t in ts: t.join()
end = time.perf_counter()
print(f"2 threads start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
return end - start
def run_processes(n: int, workers: int) -> float:
ps = [mp.Process(target=cpu_task, args=(n,)) for _ in range(workers)]
start = time.perf_counter()
for p in ps: p.start()
for p in ps: p.join()
end = time.perf_counter()
print(f"2 processes start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
return end - start
if __name__ == "__main__":
N = 50_000_000
seq = run_sequential(N, repeats=10) # two back-to-back runs in one thread
th = run_threads(N, workers=10) # two threads concurrently
pr = run_processes(N, workers=10) # two processes concurrently
print(f"\nSummary (lower is better):")
print(f" sequential (2× tasks, 1 thread): {seq:.3f}s")
print(f" 2 threads (GIL time-slicing): {th:.3f}s")
print(f" 2 processes (true parallel): {pr:.3f}s")
Typical output (on a 2-core machine):
sequential start: 4934.729760, end: 4942.446973, elapsed: 7.717s
2 threads start: 4942.447424, end: 4951.109154, elapsed: 8.662s
2 processes start: 4951.109298, end: 4952.783269, elapsed: 1.674s
Summary (lower is better):
sequential (2× tasks, 1 thread): 7.717s
2 threads (GIL time-slicing): 8.662s
2 processes (true parallel): 1.674s
Threads do not reduce wall time because they contend for the GIL. Processes run in parallel across cores and cut the time roughly in quarter.
🔗 GIL documentation
🔗David Beazley’s GIL talk
8. Performance Characteristics
Function Calls
Function calls create frames. Compare different call types:
def f(x): return x + 1
lambda_f = lambda x: x + 1
import operator, timeit
print("def func:", timeit.timeit("f(42)", globals=globals(), number=1_000_000))
print("lambda:", timeit.timeit("lambda_f(42)", globals=globals(), number=1_000_000))
print("operator.add:", timeit.timeit("operator.add(42,1)", globals=globals(), number=1_000_000))
print("native +:", timeit.timeit("42+1", globals=globals(), number=1_000_000))
def func: 0.20
lambda: 0.19
operator.add: 0.11
native +: 0.03
This shows that:
- A def function or lambda call costs around 200 ns per call.
- A C built-in like operator.add is faster, at ~110 ns.
- Native arithmetic is fastest, at ~30 ns, because it avoids Python frame creation entirely.
Built-ins like operator.add run faster because they’re implemented in C and avoid Python frame overhead.
Attribute Lookup
Attribute access goes through __getattribute__, falling back to __getattr__, then walking the MRO. This is slower than field access in compiled languages.
Python 3.11 introduced inline caches: bytecode instructions like LOAD_ATTR specialise at runtime to avoid repeated dictionary lookups PEP 659.
Comprehensions vs Loops
List comprehensions are faster because the loop is executed in C:
import timeit
print("listcomp:", timeit.timeit("[x*x for x in range(1000)]", number=10_000))
print("for+append:", timeit.timeit("res=[]\nfor x in range(1000): res.append(x*x)", number=10_000))
listcomp: 0.08
for+append: 0.12
9. Alternative Implementations
- PyPy: Uses a tracing JIT that detects hot loops, compiles them to machine code, and can deliver 4–10× speedups
- Jython: Compiles Python code to JVM bytecode, allowing direct interop with Java libraries.
- IronPython: Targets the .NET CLR, giving access to .NET assemblies.
- MicroPython: Stripped-down implementation designed for microcontrollers, omits refcounting in favour of a mark-and-sweep GC.
Benchmark example (looping 10 million times):
import time
def loop_test(n: int) -> int:
x = 0
for _ in range(n):
x += 1
return x
start = time.time()
loop_test(10_000_000)
print(f"CPython 3.11: {time.time() - start:.2f}s")
- CPython 3.11: ~1.3 s
- PyPy 7.3: ~0.2 s
The difference is due to PyPy’s JIT, which avoids per-iteration interpreter overhead once the loop is traced and compiled.
10. The Future of Python
Faster CPython
The Faster CPython project GitHub aims for 5× performance by 3.13. Key techniques:
- Adaptive bytecode: instructions rewrite themselves into specialised forms (BINARY_ADD_INT, BINARY_ADD_FLOAT).
- Inline caches: store the result of name and attribute lookups next to the opcode.
- Optimised eval loop: reduced overhead per opcode.
No-GIL Python (PEP 703)
Sam Gross’s proposal PEP 703 removes the GIL by:
- Making refcount updates atomic.
- Introducing per-object locks.
- Using biased reference counting for performance.
Experimental results show CPU-bound multi-threaded programs scaling almost linearly with cores, though single-thread performance drops ~10%.
HPy: A New C API
Another future effort is HPy, a modern replacement for the CPython C API. It’s designed to remove assumptions about refcounting and make extension modules more portable across Python implementations.
11. Why This Matters
- Performance: explains why NumPy is fast NumPy internals.
- Concurrency: clarifies why multiprocessing works better than threads multiprocessing docs.
- Memory leaks: shows how cycles are collected gc module.
- Library design: highlights why C extensions can release the GIL C-API docs.
Closing Thoughts
Python looks simple but relies on a complex engine: ASTs, bytecode, an eval loop, reference counting, a garbage collector, and the GIL.
Understanding these internals is not just an academic exercise. It explains both Python’s productivity and its performance limitations.
And with Faster CPython and PEP 703, the interpreter is evolving, faster, more parallel, and more predictable.
Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.
Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software — from tech strategy and agile delivery to product validation and team development.
Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.
Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.