Generate Implementation from Signature Spec

Introduction

LLMs (large language models) often produce overly verbose code, and natural language doesn’t map cleanly to implementation. This article shows how to constrain what an LLM can read, write, and execute, and how to validate the result with mypy. The idea is to define types and symbols in a .pyi stub file, then prompt the LLM to generate a Python implementation that adheres to those definitions, and finally check it using mypy.

Part of the inspiration for this is “spec” becoming a bigger buzz word yet most uses I see are just markdown slop from the LLM or user-defined markdown which is still just plaintext natural language.

graph LR A[Human] -->|writes| B["example.pyi
═══════════════════

def add(a: int, b: int) -> int: ...

"] C[LLM] -->|generates| D["example.py
═══════════════════

def add(a: int, b: int) -> int:
return a + b

"] style A fill:#f0f0f0,stroke:#666,stroke-width:2px style B fill:#e1f5ff,stroke:#0066cc,stroke-width:3px style C fill:#e8f5e9,stroke:#4caf50,stroke-width:2px style D fill:#fff4e1,stroke:#cc6600,stroke-width:3px

Why Python?

The LLM likely knows this language best. There’s a lot more friction in using a less common language such as Julia.

What is mypy?

mypy is a static type checker for Python that analyzes your code without running it. It can prevent things like calling a function with an integer where a string is expected.

Outline:

Introduction
Walkthrough
Module Docstring instead of extra prompt
Different Angle for User Error
Future Work
Conclusion

Walkthrough

In the following, I run through the whole processing in my shell in which we

create a linux user and group agent:agents
setup project dir
define .pyi (python stub file which contains only type annotations)
prompt the LLM to generate code
run mypy

I was using byobu, a terminal multiplexer, to make interactive setup easier, but sub in switch-user-do (sudo) commands for him to combine output below

Prerequisites

linux user and group management (usermod, groupadd, useradd)
uv (python venv and pkg manager)
claude (claude-code) https://code.claude.com/docs/en/quickstart

you could use an alternative LLM

not used, but black via uv add black and uv run black users_store.py should help fix python whitespace and linting

Project Setup

On one of my servers, hostname arch, I roughly do this as setup

sudo mkdir -p /opt/sig_impl # a directory with default access for multiple users
cd /opt/sig_impl

Access Control

I haven’t seen this much throughout the community, but file permissions/modes are a great solution to what they can read, write, or execute within a project or across your file system in use with a separate user:group for your user running the LLM.

sudo groupadd agents
sudo usermod --append --groups agents $USER
id $USER # verify your user is in the agents group
# creating home for new user bc `cluade` defaults to using home directory for a config file
sudo useradd agent --create-home --groups agents
sudo chown --recursive $USER:agents . # update the project dir ownership

Continued project setup

uv init --name sig_impl
ls -valh # check default files created
uv add mypy
uv run mypy # verify setup
sudo chown --recursive $USER:agents . # update the project dir ownership

Now with this, we’ll be running claude to generate code doing a switch-user-do-as (sudo).

User-Defined Spec | users_store.pyi

For the actual example, we intend for the LLM to create the implementation of a users store lib for which we can create, read, update, or delete users from a DB easily.

users_store.pyi

from dataclasses import dataclass


@dataclass
class DBConfig:
  host: str
  port: int
  database: str
  user: str
  password: str


@dataclass
class User:
  id: str
  user_name: str


def init_db(db_config: DBConfig):
  """connects to the database using the db_config settings"""
  ...


def create_user(user_name: str) -> User: ...


def get_user(id: str) -> User: ...


def update_user(id: str, user: User) -> User: ...


def delete_user(id: str) -> User: ...


def test_module(): ...


def command_line():
  """executed if file ran directly. does not take any options or args.
      simply runs test_module
  """
  ...

Don’t let the LLM slop around overwriting your instructions.

# sets only the read permission on the file for the group (agents)
sudo chmod g+r users_store.pyi

Generated Implementation | users_store.py

I define some prompts, but reflect later noting these should be tweaked via .pyi instead of actual prompt below. Ideally, the MCP-maintainer of this .pyi MCP (if there is one one day), would tweak a “system prompt” for this.

sudo --user agent claude \
  --system-prompt 'if mypy fails, your changes fail and will be discarded' \
  'implement users_store.py as per users_store.pyi specification'

The following was generated for me

users_store.py

from dataclasses import dataclass
import uuid


@dataclass
class DBConfig:
  host: str
  port: int
  database: str
  user: str
  password: str


@dataclass
class User:
  id: str
  user_name: str


# Module-level variable to store DB connection state
_db_connection: DBConfig | None = None

# In-memory storage for users (simulating a database)
_users_db: dict[str, User] = {}


def init_db(db_config: DBConfig):
  """connects to the database using the db_config settings"""
  global _db_connection
  _db_connection = db_config
  print(
    f"Connected to database at {db_config.host}:{db_config.port}/{db_config.database}"
  )


def create_user(user_name: str) -> User:
  """Create a new user with a generated UUID"""
  user_id = str(uuid.uuid4())
  user = User(id=user_id, user_name=user_name)
  _users_db[user_id] = user
  return user


def get_user(id: str) -> User:
  """Retrieve a user by their ID"""
  if id not in _users_db:
    raise KeyError(f"User with id '{id}' not found")
  return _users_db[id]


def update_user(id: str, user: User) -> User:
  """Update an existing user"""
  if id not in _users_db:
    raise KeyError(f"User with id '{id}' not found")
  _users_db[id] = user
  return user


def delete_user(id: str) -> User:
  """Delete a user and return the deleted user"""
  if id not in _users_db:
    raise KeyError(f"User with id '{id}' not found")
  user = _users_db[id]
  del _users_db[id]
  return user


def test_module():
  """Run basic tests to verify the module works correctly"""
  print("Running smoke tests...")

  # Test 1: Initialize database
  print("Test 1: Initializing database...")
  db_config = DBConfig(
    host="localhost",
    port=5432,
    database="testdb",
    user="testuser",
    password="testpass",
  )
  init_db(db_config)

  # Test 2: Create a user
  print("Test 2: Creating a user...")
  user1 = create_user("john_doe")
  print(f"Created user: {user1}")
  assert user1.user_name == "john_doe"
  assert user1.id is not None

  # Test 3: Get the user
  print("Test 3: Getting the user...")
  retrieved_user = get_user(user1.id)
  print(f"Retrieved user: {retrieved_user}")
  assert retrieved_user.id == user1.id
  assert retrieved_user.user_name == "john_doe"

  # Test 4: Update the user
  print("Test 4: Updating the user...")
  updated_user = User(id=user1.id, user_name="jane_doe")
  result = update_user(user1.id, updated_user)
  print(f"Updated user: {result}")
  assert result.user_name == "jane_doe"

  # Test 5: Delete the user
  print("Test 5: Deleting the user...")
  deleted_user = delete_user(user1.id)
  print(f"Deleted user: {deleted_user}")
  assert deleted_user.id == user1.id

  # Test 6: Verify user is deleted
  print("Test 6: Verifying user is deleted...")
  try:
    get_user(user1.id)
    assert False, "User should have been deleted"
  except KeyError:
    print("User successfully deleted (KeyError raised as expected)")

  print("\n All smoke tests passed!")


def command_line():
  """executed if file ran directly. does not take any options or args.

  simply runs test_module

  """
  test_module()


if __name__ == "__main__":
  command_line()

There’s a few implementation details that may be concerning above, requiring further user-defined detail in the .pyi docstrings or the prompt such as

it defaults to using a proxy “DB”
which DB?
general ambiguity of requirements, is this module an interface DB(s) or used as a proxy to a DB for testing
program prereqs, i.e.: running a DB locally, pkg deps, OS and hardware reqs, etc.
how should resource management be handled or should this module be concerned with this?
- connections, pools, timeouts, concurrency, general performance, DB administration, etc.

These are all besides the point of this article, but it does show that the .pyi is too slim.

Static Analysis with mypy

uv run mypy --strict users_store.py

This could be ran by agent user via the LLM, claude, allowing a better automated feedback loop enabling more of an “agent” like behavior people were hyping up over the last year (2025).

Notes

The key point is that the LLM only has access to the module it’s intended to write and that we prompted by:

signatures or type definitions of: data types, functions, variables, classes, etc.
docstrings defined for any of those user-defined constructs
the actual natural language prompt to claude

I was a little surprised the docstring gets copied over. It would be annoying that there’s a different docstring for the signature and a different one for its equivalent implementation, but at least it would let us distinguish between what I want and what the LLM “thinks” the implementation is. Ideally, the LLM should align only to what I specify and break things down into smaller funcs. Regardless, the public API at least should be tested such that we can trust and verify it works.

Type Checking Validation

mypy does not throw if the .py doesn’t implement everything whereas compilation of a C header (.h) and source (.c) would fail. To add a layer of validation, we could parse AST (abstract syntax trees), then compare the signatures and the LLM’s generated implementation. In the meantime, you can open the two files in your IDE of choice and spot check the symbols defined in them. The .py would often have more symbols defined.

mypy compares whats specified in .pyi and what’s used in the .py equivalent. The LLM has to copy definitions for data types like User and DBConfig. We can expect our spec to be implemented, thus import it from the .py, and use it like the public API.

Alternatively, we could take the same idea and apply it in an actual statically typed language such as C or Rust. User defines spec in a .h (C header file) then LLM generates its .c (C source file). We could also define interfaces/traits in Rust then prompt the LLM to implement those in another file, but this has less support for automatic validation.

Module Docstring instead of extra prompt

I thought of this after making this, but it makes a lot more sense to define a module docstring in the .pyi than to refine the prompt. It keeps the overall intent grouped with the spec that further breaks down intended program. Docstring would double as docs and as additional natural language guidance on implementing it. It’s placed at the top of the file. LLMs seem to default to using Sphinx Napoleon style / Google style in their docstrings.

An example would be:

"""
The users_store module that is the interface for CRUD operations in the DB on users.

Example::

    import users_store
    db_config = users_store.DBConfig(host="localhost", port=5432, database="testdb", user="testuser", password="")
    users_store.init_db(db_config)
    users_store.create_user("alice")
"""

Different Angle for User Error

Responsibility type for the user in this system has been shifted compared to pure vibe coding. Logic errors, ETC (easy to change) factor, or performance issues could be propagated because of your definitions. As with all other LLM use, an expert in the loop is better than a novice.

Future Work

put this in MCP (model context protocol)
- user defines some_module.pyi, tells LLM to implement it
improve dev experience of LLM agent as a separate linux user
additional layer of validation: parse AST and compare signatures
experiment doing this in Rust or another language
dogfood the idea more

Conclusion

We looked at a structured way to define what Python the LLM should generate and automatically validated the output providing a self-feedback loop to the agent for one requirement, signatures and types.