Test-smell taxonomy¶

A curated, polyglot catalog of test-suite failure modes that are seeable, fixable, and language-independent — the subset of test-suite problems where an LLM's semantic judgment beats a linter's syntactic one.

The taxonomy is foundational. It names smells, describes how to see them, and prescribes how to fix them. It does not describe any particular tool.

The catalog¶

Ordered roughly by how much semantic reasoning the required judgment demands. Higher-numbered entries lean on mechanical signals; lower-numbered entries need reasoning a linter cannot do.

#	Slug	Core move	Severity	Detection Scope
1	`deliverable-fossils`	rename + regroup per product behavior	High	per-test, cross-suite
2	`semantic-redundancy`	cluster, pick canonical, fold/delete the rest	High	cross-suite
3	`wrong-level`	relocate to correct pyramid tier	Medium	cross-suite
4	`naming-lies`	rename test or strengthen body to match the claim	Medium	per-test
5	`vacuous-assertion`	strengthen the oracle	High	per-test
6	`pseudo-tested`	add assertion that kills the no-op mutant	High	per-test
7	`tautology-theatre`	delete or rewrite to exercise real SUT	Critical	per-test
8	`over-specified-mock`	relax to behavior-relevant interaction only	High	per-test
9	`implementation-coupled`	drive through public API instead	High	per-test
10	`presentation-coupled`	parse then assert semantics, not formatting	Medium	per-test
11	`conditional-logic`	split or pin the precondition	Medium	per-test
12	`shared-state`	move setup to per-test factory / restore globals	Medium	per-file
13	`mystery-guest`	inline a 1–3 line summary of relevant fixture shape	Low	per-test
14	`rotten-green`	delete the empty/dead scaffold or add the missing assertion	Low	per-test
15	`monolithic-test-file`	split file by behavior domain	Medium	per-file

Severity is a relative-harm/safety hint: how bad the smell is for the suite, weighted by how safe the canonical fix is. Critical smells can usually be deleted outright because they were killing no mutants. Lower severities need transforms and correspondingly more reviewer attention. No severity is a mandate to act; it is input to prioritization.

Non-goals¶

These are covered by existing tooling. Where a linter, mutation tool, or codemod runner already does the work deterministically, this taxonomy defers.

Syntactic smell counts (TsDetect-style scoreboards). The EMSE 2023 follow-up study¹ found classical smell counts uncorrelated with maintenance pain, and that machine-generated tests actually score better on smell detectors while being semantically worse. Optimizing for smell counts is an explicit anti-goal.
Net-new test generation (handled by tools like CoverUp²).
Framework migrations (handled by jest-codemods, OpenRewrite, unittest2pytest, and similar).
Flaky detection (handled by DeFlaker, pytest-rerunfailures, test-retry plugins). This catalog names flakiness root causes when they surface as shared-state or conditional-logic; detection itself is out of scope.

Governor rules¶

Every prescribed fix in the catalog is bounded by the governor rules in principles: knowledge-DRY not syntactic-DRY, no extract-for-testability, no speculative code, commit-before-refactor. And by the broader principle that a refactor must preserve regression-detection power.

Panichella, A. et al. (2023). Test Smells 20 Years Later: a Large-Scale Study. Empirical Software Engineering, 28(4). https://link.springer.com/article/10.1007/s10664-022-10207-5. Concludes that classical smell catalogs correlate poorly with real maintenance pain, and warns specifically against optimizing smell counts as a KPI. ↩
Pizzorno, J. & Berger, E. (2025). CoverUp: Effective High-Coverage Test Generation for Python. PACM SE 2025. https://arxiv.org/abs/2403.16218. Reference implementation: https://github.com/plasma-umass/coverup. ↩