A plugin to find bugs in a codebase using property-based testing
You can install this plugin from any of these themed marketplaces. Choose one, add it as a marketplace, then install the plugin.
Choose your preferred installation method below
A marketplace is a collection of plugins. Every plugin gets an auto-generated marketplace JSON for individual installation, plus inclusion in category and themed collections. Add a marketplace once (step 1), then install any plugin from it (step 2).
One-time setup for access to all plugins
When to use: If you plan to install multiple plugins now or later
Step 1: Add the marketplace (one-time)
/plugin marketplace add https://claudepluginhub.com/marketplaces/all.json
Run this once to access all plugins
Step 2: Install this plugin
/plugin install hypo-plugin@all
Use this plugin's auto-generated marketplace JSON for individual installation
When to use: If you only want to try this specific plugin
Step 1: Add this plugin's marketplace
/plugin marketplace add https://claudepluginhub.com/marketplaces/plugins/hypo-plugin.json
Step 2: Install the plugin
/plugin install hypo-plugin@hypo-plugin
Get a coding agent to find bugs in your codebase by mining properties and testing them via Hypothesis.
For the artifacts from the paper, including bug reports and rankings, see the paper
directory. Note that the code that was used in the paper is slightly behind what is in the main folder. See paper/README.md
for more details.
To see all the bugs our agent found, see our website.
The agent is a Claude Code command. You will need to have Claude Code installed to run it. You will need a subscription to Claude Code, or an API key (we recommend an API key if you are running it over a large number of packages, or to reproduce the paper).
The command is contained in the hypo.md
file. You will need to place this file in the .claude/commands/
directory, which can either be in ~
or in whichever directory you are running the agent from. The agent can then be invoked with /hypo <target>
.
You will need pytest
, hypothesis
, and the package you are testing installed.
The agent takes one argument, which is the target to test. This can be a file, a function, or a module. If no argument is given, it will test the entire codebase, i.e., the current working directory. You can pass whichever other arguments that Claude Code supports, like the model, permissions, etc.
Example usage:
claude "/hypo numpy"
claude "/hypo statistics.median" --model opus
You can also just start Claude Code, and then invoke the agent.
The run.py
script is a wrapper around the agent to test multiple packages, in parallel. It is what was used in the paper. This script does not require any other requirements beyond the standard library (of course, you still need to have Claude Code installed). You need python3
and pip
to be in your PATH
.
Note that the runner operates at the module level.
The only required argument is the path to a json file containing the packages to test, and which modules to test within each package. It looks like:
{
"pathlib": {
"type": "stdlib",
"modules": ["pathlib"]
},
"numpy": {
"type": "pypi",
"modules": ["numpy"]
}
}
The keys in the json file are the package names, either the standard library name or the PyPI name. For standard library packages, specify "stdlib", and for PyPI packages, specify "pypi". This is important so the runner knows how to set up the virtual environment.
The runner takes two optional arguments:
--max-workers
: the number of parallel workers to use. Default is 20.--model
: the model to use. Default is "opus".--preinstall-workers
: the number of parallel workers to use for setting up the virtual environments. Default is 10.The runner will output all bug reports in the results/
directory.
Example usage:
python run.py packages.json
In the example_packages/
directory, there are some example package json files to test:
packages_mini.json
: a mini set of modules to test (this took 6 minutes to run, with default settings)packages_10k.json
: top 10,000 pypi packages, with the main module and all submodules one level deepThe packages tested in the paper are in the paper/
directory.
The runner sets up virtual environments, with venv
, for each package. Standard library packages just use the same virtual environment, and PyPI packages get their own virtual environment. The runner will also install pytest
and hypothesis
in each virtual environment. It does this in parallel, which is controllable; see the CLI arguments below.
It then then sets up directories, up to a specified number of maximum workers (see CLI arguments below), which is a "sandbox" for the agent to run in. It only has permission to edit files within this sandbox. Each worker directory also contains .claude/commands/hypo.md
, so that the agent can run. The runner parallelizes across modules.
Note that the runner also checks if the module has already been tested, and skips it if so. So, you can easily resume a run by just running the runner again.
The runner calls the agent with restricted permissions. It only has permission to read/write/edit files in the sandbox in which it is called, and it also has read permission to the virtual environment, so that it can read the source code of the package. Furthermore, it can only write/edit .py
and .md
files. The only bash commands it can run are python
and pytest
. Note that because of how the virtual environments are set up, the Python command will be python
. Lastly, it also has access to the Todo
and WebFetch
tools.
You should still be careful with the runner, because running arbitrary code is dangerous!
In the results/
directory, there will be a directory named after the package. Each of these will have the following structure:
bug_reports/
logs/
claude_call_$id.json
<the log of the Claude Code call corresponding to this id>aux_files/
$id
call_mappings.jsonl
, with the following format:
call_id
: the Claude Code call idmodule
: the module testedtimestamp
: date executedbug_reports
: the filename of any bug reports in the bug_reports
directory written by this Claude Code callaux_files_dir
: the directory containing all files written by the agent during the Claude Code call corresponding to this idTo score the bug reports, you can run python scoring.py results/
. This uses the rubric contained in that file, and passes it to Claude (not Claude Code, just the Claude API). This script outputs a CSV file containing the scores for each bug report, as well as the reasoning.
It takes the following arguments:
--retry-failures
: if set, it will retry the bug reports that failed to score. This requires the CSV file to already exist, as it checks for failed scores in the CSV file.reports_dir
: the directory containing the bug reports to score. Default is "results/".--max-workers
: the number of parallel workers to use. Default is 20.--model
: the model to use. Default is "claude-opus-4-1" (note model names are different when using the Claude API directly)--csv-path
: the path to the CSV file to write the results to. Default is "scoring_results.csv".Example usage:
python scoring.py results/
1.0.0