The Inner Workings of Python: Beyond the Surface

Python is famous for readability. You can write for x in items: and it works without boilerplate. But what happens when you run:

python script.py

This article explains the internals of CPython i.e. the reference implementation. We’ll cover each layer: parsing, compilation, bytecode evaluation, memory management, the Global Interpreter Lock (GIL), performance characteristics, alternative implementations, and upcoming changes like Faster CPython and No-GIL Python.

This is not a “what is Python” primer. It’s aimed at engineers who want to understand how the interpreter works.

1. From Source to Execution

Python doesn’t execute raw text. Instead:

Parsing → Source code → Abstract Syntax Tree (AST)
Compilation → AST → Bytecode (.pyc)
Evaluation → Bytecode → CPython Virtual Machine

🔗 CPython Execution Model

2. Parsing: The PEG Parser

Since Python 3.9, CPython uses a PEG parser (Parsing Expression Grammar). This replaced the old LL(1) parser and allows more complex syntax (e.g. match/case) PEP 617.

import ast

tree = ast.parse("x = (3 + 4) * 5")
print(ast.dump(tree, indent=4))

Output:

Module(
    body=[
        Assign(
            targets=[Name(id='x')],
            value=BinOp(
                left=BinOp(left=Constant(value=3), op=Add(), right=Constant(value=4)),
                op=Mult(),
                right=Constant(value=5)
            )
        )
    ]
)

This AST is Python’s internal tree representation. Every if, for, or def is a node in this structure.

🔗Python ast module

3. Compilation: AST to Bytecode

The compiler translates AST nodes into bytecode instructions. Inspect with dis:

import dis

def add(a: int, b: int) -> int:
    return a + b

dis.dis(add)

2             0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

Each instruction is a single bytecode. CPython has ~120 opcodes.

🔗 Disassembly with dis
🔗 Bytecode Instructions

4. Execution: The Eval Loop in ceval.c

At runtime, CPython executes bytecode in a loop inside Python/ceval.c:

for (;;) {
    opcode = NEXTOP();
    switch (opcode) {
        case LOAD_FAST:
            PUSH(f->f_localsplus[oparg]);
            break;
        case BINARY_ADD: {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum = PyNumber_Add(left, right);
            SET_TOP(sum);
            Py_DECREF(left);
            Py_DECREF(right);
            break;
        }
        case RETURN_VALUE:
            retval = POP();
            goto exit_eval_frame;
    }
}

Every Python operation runs through this interpreter loop: stack manipulation, type-dispatched calls (PyNumber_Add, PyObject_GetAttr), and refcounting.

🔗 CPython eval loop

5. Frames: Execution Contexts

Every function call in CPython creates a frame object. These are central to how the interpreter tracks execution state.

The C structure is defined in Include/cpython/frameobject.h:

typedef struct _frame {
    PyObject_VAR_HEAD
    struct _frame *f_back;        /* previous frame, or NULL */
    PyCodeObject *f_code;         /* code object being executed */
    PyObject *f_builtins;         /* builtins dict */
    PyObject *f_globals;          /* globals dict */
    PyObject *f_locals;           /* locals dict */
    PyObject **f_valuestack;      /* value stack */
    int f_lasti;                  /* index of last attempted instruction */
    int f_lineno;                 /* current line number */
    ...
} PyFrameObject;

Each frame stores:

f_back → a pointer to the previous frame (so the runtime maintains a call stack).
f_code → the PyCodeObject containing bytecode.
f_globals / f_locals → the current namespaces.
f_valuestack → the operand stack used by the eval loop.
f_lasti → the instruction pointer into bytecode.

Inspect from Python itself:

import inspect

def inner():
    frame = inspect.currentframe()
    print(f"Function: {frame.f_code.co_name}")
    print(f"Line: {frame.f_lineno}")
    print(f"Globals: {list(frame.f_globals.keys())[:5]}")
    print(f"Back: {frame.f_back.f_code.co_name}")

def outer():
    inner()

outer()

Output:

Function: inner
Line: 5
Globals: ['__name__', '__doc__', '__package__', '__loader__', '__spec__']
Back: outer

Because frames link together via f_back, Python debuggers, tracers, and profilers can walk the call stack. This flexibility is a feature, but also explains why Python function calls are heavier than in C — each call means allocating and initialising a new PyFrameObject.

🔗 Frame Objects in CPython

6. Memory Management

Object Layout

Every Python object starts with a PyObject_HEAD:

typedef struct _object {
    Py_ssize_t ob_refcnt;   /* reference count */
    struct _typeobject *ob_type;  /* pointer to type object */
} PyObject;

ob_refcnt → how many references point to the object.
ob_type → pointer to the object’s type (e.g. <class 'int'>).

This uniform header is why everything in Python is an object: ints, strings, lists, even functions.

Reference Counting

CPython uses macros to manage refcounts:

#define Py_INCREF(op)   ((op)->ob_refcnt++)
#define Py_DECREF(op) \
    if (--(op)->ob_refcnt == 0) \
        _Py_Dealloc((PyObject *)(op))

From Python, you can observe this:

import sys
a = [1, 2, 3]
print(sys.getrefcount(a))  # usually 2: one for 'a', one for getrefcount arg

When ob_refcnt drops to zero, _Py_Dealloc frees the object.

Cycles and Generational GC

Reference counting cannot handle cycles:

a = []
a.append(a)  # list refers to itself

Even after del a, the refcount is not zero. CPython includes a cyclic garbage collector in Modules/gc.c with a generational scheme. Thresholds determine when collections run. You can inspect thresholds:

import gc
print(gc.get_threshold())

Typical output:

(700, 10, 10)

To handle this, CPython includes a cyclic garbage collector in Modules/gc.c. It uses a generational scheme:

Generation 0: newly allocated container objects.
Generation 1: survivors of collection.
Generation 2: long-lived objects.

Demonstration:

import gc, weakref

class Node:
    def __init__(self): self.ref = None

a, b = Node(), Node()
a.ref, b.ref = b, a  # cycle
ref_a = weakref.ref(a)

del a, b
print(ref_a())        # still alive, cycle not freed yet
gc.collect()          # force GC
print(ref_a())        # None, collected

Typical output:

<__main__.Node object at 0x...>
None

🔗 gc module

7. The Global Interpreter Lock (GIL)

The GIL is a single mutex protecting Python’s object state. It lives in Python/ceval_gil.h:

static inline void take_gil(PyThreadState *tstate) {
    while (_Py_atomic_load_relaxed(&gil_locked)) {
        /* spin or wait */
    }
    gil_locked = 1;
    current_thread = tstate;
}

Why the GIL Exists

Reference counting updates must be atomic
Many C API calls are not thread safe
A global lock simplified early interpreter design

Demonstration: sequential vs threads vs processes (CPU-bound)

import sys
import time
import threading
import multiprocessing as mp

# Make thread switching explicit (default is ~5ms, but let's be clear)
sys.setswitchinterval(0.005)

def cpu_task(n: int) -> None:
    """Tight Python loop that keeps hold of the GIL."""
    x = 0
    for _ in range(n):
        x += 1

def run_sequential(n: int, repeats: int) -> float:
    start = time.perf_counter()
    for _ in range(repeats):
        cpu_task(n)
    end = time.perf_counter()
    print(f"sequential start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
    return end - start

def run_threads(n: int, workers: int) -> float:
    ts = [threading.Thread(target=cpu_task, args=(n,)) for _ in range(workers)]
    start = time.perf_counter()
    for t in ts: t.start()
    for t in ts: t.join()
    end = time.perf_counter()
    print(f"2 threads  start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
    return end - start

def run_processes(n: int, workers: int) -> float:
    ps = [mp.Process(target=cpu_task, args=(n,)) for _ in range(workers)]
    start = time.perf_counter()
    for p in ps: p.start()
    for p in ps: p.join()
    end = time.perf_counter()
    print(f"2 processes start: {start:.6f}, end: {end:.6f}, elapsed: {end - start:.3f}s", flush=True)
    return end - start

if __name__ == "__main__":
    N = 50_000_000
    seq = run_sequential(N, repeats=10)      # two back-to-back runs in one thread
    th  = run_threads(N, workers=10)         # two threads concurrently
    pr  = run_processes(N, workers=10)       # two processes concurrently

    print(f"\nSummary (lower is better):")
    print(f"  sequential (2× tasks, 1 thread): {seq:.3f}s")
    print(f"  2 threads (GIL time-slicing):    {th:.3f}s")
    print(f"  2 processes (true parallel):     {pr:.3f}s")

Typical output (on a 2-core machine):

sequential start: 4934.729760, end: 4942.446973, elapsed: 7.717s
2 threads  start: 4942.447424, end: 4951.109154, elapsed: 8.662s
2 processes start: 4951.109298, end: 4952.783269, elapsed: 1.674s

Summary (lower is better):
  sequential (2× tasks, 1 thread): 7.717s
  2 threads (GIL time-slicing):    8.662s
  2 processes (true parallel):     1.674s

Threads do not reduce wall time because they contend for the GIL. Processes run in parallel across cores and cut the time roughly in quarter.

🔗 GIL documentation
🔗David Beazley’s GIL talk

8. Performance Characteristics

Function Calls

Function calls create frames. Compare different call types:

def f(x): return x + 1
lambda_f = lambda x: x + 1

import operator, timeit

print("def func:", timeit.timeit("f(42)", globals=globals(), number=1_000_000))
print("lambda:", timeit.timeit("lambda_f(42)", globals=globals(), number=1_000_000))
print("operator.add:", timeit.timeit("operator.add(42,1)", globals=globals(), number=1_000_000))
print("native +:", timeit.timeit("42+1", globals=globals(), number=1_000_000))

def func:      0.20
lambda:        0.19
operator.add:  0.11
native +:      0.03

This shows that:

A def function or lambda call costs around 200 ns per call.
A C built-in like operator.add is faster, at ~110 ns.
Native arithmetic is fastest, at ~30 ns, because it avoids Python frame creation entirely.

Built-ins like operator.add run faster because they’re implemented in C and avoid Python frame overhead.

Attribute Lookup

Attribute access goes through __getattribute__, falling back to __getattr__, then walking the MRO. This is slower than field access in compiled languages.

Python 3.11 introduced inline caches: bytecode instructions like LOAD_ATTR specialise at runtime to avoid repeated dictionary lookups PEP 659.

Comprehensions vs Loops

List comprehensions are faster because the loop is executed in C:

import timeit

print("listcomp:", timeit.timeit("[x*x for x in range(1000)]", number=10_000))
print("for+append:", timeit.timeit("res=[]\nfor x in range(1000): res.append(x*x)", number=10_000))

listcomp:   0.08
for+append: 0.12

9. Alternative Implementations

PyPy: Uses a tracing JIT that detects hot loops, compiles them to machine code, and can deliver 4–10× speedups
Jython: Compiles Python code to JVM bytecode, allowing direct interop with Java libraries.
IronPython: Targets the .NET CLR, giving access to .NET assemblies.
MicroPython: Stripped-down implementation designed for microcontrollers, omits refcounting in favour of a mark-and-sweep GC.

Benchmark example (looping 10 million times):

import time

def loop_test(n: int) -> int:
    x = 0
    for _ in range(n):
        x += 1
    return x

start = time.time()
loop_test(10_000_000)
print(f"CPython 3.11: {time.time() - start:.2f}s")

CPython 3.11: ~1.3 s
PyPy 7.3: ~0.2 s

The difference is due to PyPy’s JIT, which avoids per-iteration interpreter overhead once the loop is traced and compiled.

10. The Future of Python

Faster CPython

The Faster CPython project GitHub aims for 5× performance by 3.13. Key techniques:

Adaptive bytecode: instructions rewrite themselves into specialised forms (BINARY_ADD_INT, BINARY_ADD_FLOAT).
Inline caches: store the result of name and attribute lookups next to the opcode.
Optimised eval loop: reduced overhead per opcode.

No-GIL Python (PEP 703)

Sam Gross’s proposal PEP 703 removes the GIL by:

Making refcount updates atomic.
Introducing per-object locks.
Using biased reference counting for performance.

Experimental results show CPU-bound multi-threaded programs scaling almost linearly with cores, though single-thread performance drops ~10%.

HPy: A New C API

Another future effort is HPy, a modern replacement for the CPython C API. It’s designed to remove assumptions about refcounting and make extension modules more portable across Python implementations.

11. Why This Matters

Performance: explains why NumPy is fast NumPy internals.
Concurrency: clarifies why multiprocessing works better than threads multiprocessing docs.
Memory leaks: shows how cycles are collected gc module.
Library design: highlights why C extensions can release the GIL C-API docs.

Closing Thoughts

Python looks simple but relies on a complex engine: ASTs, bytecode, an eval loop, reference counting, a garbage collector, and the GIL.

Understanding these internals is not just an academic exercise. It explains both Python’s productivity and its performance limitations.

And with Faster CPython and PEP 703, the interpreter is evolving, faster, more parallel, and more predictable.

Gary Worthington is a software engineer, delivery consultant, and agile coach who helps teams move fast, learn faster, and scale when it matters. He writes about modern engineering, product thinking, and helping teams ship things that matter.

Through his consultancy, More Than Monkeys, Gary helps startups and scaleups improve how they build software — from tech strategy and agile delivery to product validation and team development.

Visit morethanmonkeys.co.uk to learn how we can help you build better, faster.

Follow Gary on LinkedIn for practical insights into engineering leadership, agile delivery, and team performance.