Introduction
LLMs (large language models) often produce overly verbose code, and natural language doesn’t map cleanly to implementation. This article shows how to constrain what an LLM can read, write, and execute, and how to validate the result with mypy. The idea is to define types and symbols in a .pyi stub file, then prompt the LLM to generate a Python implementation that adheres to those definitions, and finally check it using mypy.
Part of the inspiration for this is “spec” becoming a bigger buzz word yet most uses I see are just markdown slop from the LLM or user-defined markdown which is still just plaintext natural language.
═══════════════════
def add(a: int, b: int) -> int: ...
"] C[LLM] -->|generates| D["example.py
═══════════════════
def add(a: int, b: int) -> int:
return a + b
"] style A fill:#f0f0f0,stroke:#666,stroke-width:2px style B fill:#e1f5ff,stroke:#0066cc,stroke-width:3px style C fill:#e8f5e9,stroke:#4caf50,stroke-width:2px style D fill:#fff4e1,stroke:#cc6600,stroke-width:3px
Why Python?
The LLM likely knows this language best. There’s a lot more friction in using a less common language such as Julia.
What is mypy?
mypy is a static type checker for Python that analyzes your code without running it. It can prevent things like calling a function with an integer where a string is expected.
Outline:
- Introduction
- Walkthrough
- Module Docstring instead of extra prompt
- Different Angle for User Error
- Future Work
- Conclusion
Walkthrough
In the following, I run through the whole processing in my shell in which we
- create a linux user and group agent:agents
- setup project dir
- define .pyi (python stub file which contains only type annotations)
- prompt the LLM to generate code
- run mypy
I was using byobu, a terminal multiplexer, to make interactive setup easier, but sub in switch-user-do (sudo) commands for him to combine output below
Prerequisites
- linux user and group management (usermod, groupadd, useradd)
- uv (python venv and pkg manager)
- claude (claude-code) https://code.claude.com/docs/en/quickstart
you could use an alternative LLM
not used, but black via uv add black and uv run black users_store.py should help fix python whitespace and linting
Project Setup
On one of my servers, hostname arch, I roughly do this as setup
sudo mkdir -p /opt/sig_impl # a directory with default access for multiple users
cd /opt/sig_impl
Access Control
I haven’t seen this much throughout the community, but file permissions/modes are a great solution to what they can read, write, or execute within a project or across your file system in use with a separate user:group for your user running the LLM.
sudo groupadd agents
sudo usermod --append --groups agents $USER
id $USER # verify your user is in the agents group
# creating home for new user bc `cluade` defaults to using home directory for a config file
sudo useradd agent --create-home --groups agents
sudo chown --recursive $USER:agents . # update the project dir ownership
Continued project setup
uv init --name sig_impl
ls -valh # check default files created
uv add mypy
uv run mypy # verify setup
sudo chown --recursive $USER:agents . # update the project dir ownership
Now with this, we’ll be running claude to generate code doing a switch-user-do-as (sudo).
User-Defined Spec | users_store.pyi
For the actual example, we intend for the LLM to create the implementation of a users store lib for which we can create, read, update, or delete users from a DB easily.
users_store.pyi
from dataclasses import dataclass
@dataclass
class DBConfig:
host: str
port: int
database: str
user: str
password: str
@dataclass
class User:
id: str
user_name: str
def init_db(db_config: DBConfig):
"""connects to the database using the db_config settings"""
...
def create_user(user_name: str) -> User: ...
def get_user(id: str) -> User: ...
def update_user(id: str, user: User) -> User: ...
def delete_user(id: str) -> User: ...
def test_module(): ...
def command_line():
"""executed if file ran directly. does not take any options or args.
simply runs test_module
"""
...
Don’t let the LLM slop around overwriting your instructions.
# sets only the read permission on the file for the group (agents)
sudo chmod g+r users_store.pyi
Generated Implementation | users_store.py
I define some prompts, but reflect later noting these should be tweaked via .pyi instead of actual prompt below. Ideally, the MCP-maintainer of this .pyi MCP (if there is one one day), would tweak a “system prompt” for this.
sudo --user agent claude \
--system-prompt 'if mypy fails, your changes fail and will be discarded' \
'implement users_store.py as per users_store.pyi specification'
The following was generated for me
users_store.py
from dataclasses import dataclass
import uuid
@dataclass
class DBConfig:
host: str
port: int
database: str
user: str
password: str
@dataclass
class User:
id: str
user_name: str
# Module-level variable to store DB connection state
_db_connection: DBConfig | None = None
# In-memory storage for users (simulating a database)
_users_db: dict[str, User] = {}
def init_db(db_config: DBConfig):
"""connects to the database using the db_config settings"""
global _db_connection
_db_connection = db_config
print(
f"Connected to database at {db_config.host}:{db_config.port}/{db_config.database}"
)
def create_user(user_name: str) -> User:
"""Create a new user with a generated UUID"""
user_id = str(uuid.uuid4())
user = User(id=user_id, user_name=user_name)
_users_db[user_id] = user
return user
def get_user(id: str) -> User:
"""Retrieve a user by their ID"""
if id not in _users_db:
raise KeyError(f"User with id '{id}' not found")
return _users_db[id]
def update_user(id: str, user: User) -> User:
"""Update an existing user"""
if id not in _users_db:
raise KeyError(f"User with id '{id}' not found")
_users_db[id] = user
return user
def delete_user(id: str) -> User:
"""Delete a user and return the deleted user"""
if id not in _users_db:
raise KeyError(f"User with id '{id}' not found")
user = _users_db[id]
del _users_db[id]
return user
def test_module():
"""Run basic tests to verify the module works correctly"""
print("Running smoke tests...")
# Test 1: Initialize database
print("Test 1: Initializing database...")
db_config = DBConfig(
host="localhost",
port=5432,
database="testdb",
user="testuser",
password="testpass",
)
init_db(db_config)
# Test 2: Create a user
print("Test 2: Creating a user...")
user1 = create_user("john_doe")
print(f"Created user: {user1}")
assert user1.user_name == "john_doe"
assert user1.id is not None
# Test 3: Get the user
print("Test 3: Getting the user...")
retrieved_user = get_user(user1.id)
print(f"Retrieved user: {retrieved_user}")
assert retrieved_user.id == user1.id
assert retrieved_user.user_name == "john_doe"
# Test 4: Update the user
print("Test 4: Updating the user...")
updated_user = User(id=user1.id, user_name="jane_doe")
result = update_user(user1.id, updated_user)
print(f"Updated user: {result}")
assert result.user_name == "jane_doe"
# Test 5: Delete the user
print("Test 5: Deleting the user...")
deleted_user = delete_user(user1.id)
print(f"Deleted user: {deleted_user}")
assert deleted_user.id == user1.id
# Test 6: Verify user is deleted
print("Test 6: Verifying user is deleted...")
try:
get_user(user1.id)
assert False, "User should have been deleted"
except KeyError:
print("User successfully deleted (KeyError raised as expected)")
print("\n All smoke tests passed!")
def command_line():
"""executed if file ran directly. does not take any options or args.
simply runs test_module
"""
test_module()
if __name__ == "__main__":
command_line()
There’s a few implementation details that may be concerning above, requiring further user-defined detail in the .pyi docstrings or the prompt such as
- it defaults to using a proxy “DB”
- which DB?
- general ambiguity of requirements, is this module an interface DB(s) or used as a proxy to a DB for testing
- program prereqs, i.e.: running a DB locally, pkg deps, OS and hardware reqs, etc.
- how should resource management be handled or should this module be concerned with this?
- connections, pools, timeouts, concurrency, general performance, DB administration, etc.
These are all besides the point of this article, but it does show that the .pyi is too slim.
Static Analysis with mypy
uv run mypy --strict users_store.py
This could be ran by agent user via the LLM, claude, allowing a better automated feedback loop enabling
more of an “agent” like behavior people were hyping up over the last year (2025).
Notes
The key point is that the LLM only has access to the module it’s intended to write and that we prompted by:
- signatures or type definitions of: data types, functions, variables, classes, etc.
- docstrings defined for any of those user-defined constructs
- the actual natural language prompt to
claude
I was a little surprised the docstring gets copied over. It would be annoying that there’s a different docstring for the signature and a different one for its equivalent implementation, but at least it would let us distinguish between what I want and what the LLM “thinks” the implementation is. Ideally, the LLM should align only to what I specify and break things down into smaller funcs. Regardless, the public API at least should be tested such that we can trust and verify it works.
Type Checking Validation
mypy does not throw if
the .py doesn’t implement everything whereas compilation of a C header (.h) and source (.c) would fail.
To add a layer of validation, we could parse AST (abstract syntax trees), then
compare the signatures and the LLM’s generated implementation. In the meantime, you can open the two files in your IDE
of choice and spot check the symbols defined in them. The .py would often have more symbols defined.
mypy compares whats specified in
.pyi and what’s used in the .py equivalent. The LLM has to copy definitions for data types like User and DBConfig.
We can expect our spec to be implemented, thus import it from the .py, and use it like the public API.
Alternatively, we could take the same idea and apply it in an actual statically typed language such as C or Rust. User defines spec in a .h (C header file) then LLM generates its .c (C source file). We could also define interfaces/traits in Rust then prompt the LLM to implement those in another file, but this has less support for automatic validation.
Module Docstring instead of extra prompt
I thought of this after making this, but it makes a lot more sense to define a module docstring in the .pyi than to refine the prompt. It keeps the overall intent grouped with the spec that further breaks down intended program. Docstring would double as docs and as additional natural language guidance on implementing it. It’s placed at the top of the file. LLMs seem to default to using Sphinx Napoleon style / Google style in their docstrings.
An example would be:
"""
The users_store module that is the interface for CRUD operations in the DB on users.
Example::
import users_store
db_config = users_store.DBConfig(host="localhost", port=5432, database="testdb", user="testuser", password="")
users_store.init_db(db_config)
users_store.create_user("alice")
"""
Different Angle for User Error
Responsibility type for the user in this system has been shifted compared to pure vibe coding. Logic errors, ETC (easy to change) factor, or performance issues could be propagated because of your definitions. As with all other LLM use, an expert in the loop is better than a novice.
Future Work
- put this in MCP (model context protocol)
- user defines some_module.pyi, tells LLM to implement it
- improve dev experience of LLM agent as a separate linux user
- additional layer of validation: parse AST and compare signatures
- experiment doing this in Rust or another language
- dogfood the idea more
Conclusion
We looked at a structured way to define what Python the LLM should generate and automatically validated the output providing a self-feedback loop to the agent for one requirement, signatures and types.