Test-smell taxonomy¶
A curated, polyglot catalog of test-suite failure modes that are seeable, fixable, and language-independent — the subset of test-suite problems where an LLM's semantic judgment beats a linter's syntactic one.
The taxonomy is foundational. It names smells, describes how to see them, and prescribes how to fix them. It does not describe any particular tool.
The catalog¶
Ordered roughly by how much semantic reasoning the required judgment demands. Higher-numbered entries lean on mechanical signals; lower-numbered entries need reasoning a linter cannot do.
| # | Slug | Core move | Severity | Detection Scope |
|---|---|---|---|---|
| 1 | deliverable-fossils |
rename + regroup per product behavior | High | per-test, cross-suite |
| 2 | semantic-redundancy |
cluster, pick canonical, fold/delete the rest | High | cross-suite |
| 3 | wrong-level |
relocate to correct pyramid tier | Medium | cross-suite |
| 4 | naming-lies |
rename test or strengthen body to match the claim | Medium | per-test |
| 5 | vacuous-assertion |
strengthen the oracle | High | per-test |
| 6 | pseudo-tested |
add assertion that kills the no-op mutant | High | per-test |
| 7 | tautology-theatre |
delete or rewrite to exercise real SUT | Critical | per-test |
| 8 | over-specified-mock |
relax to behavior-relevant interaction only | High | per-test |
| 9 | implementation-coupled |
drive through public API instead | High | per-test |
| 10 | presentation-coupled |
parse then assert semantics, not formatting | Medium | per-test |
| 11 | conditional-logic |
split or pin the precondition | Medium | per-test |
| 12 | shared-state |
move setup to per-test factory / restore globals | Medium | per-file |
| 13 | mystery-guest |
inline a 1–3 line summary of relevant fixture shape | Low | per-test |
| 14 | rotten-green |
delete the empty/dead scaffold or add the missing assertion | Low | per-test |
| 15 | monolithic-test-file |
split file by behavior domain | Medium | per-file |
Severity is a relative-harm/safety hint: how bad the smell is for the suite, weighted by how safe the canonical fix is. Critical smells can usually be deleted outright because they were killing no mutants. Lower severities need transforms and correspondingly more reviewer attention. No severity is a mandate to act; it is input to prioritization.
Non-goals¶
These are covered by existing tooling. Where a linter, mutation tool, or codemod runner already does the work deterministically, this taxonomy defers.
- Syntactic smell counts (TsDetect-style scoreboards). The EMSE 2023 follow-up study1 found classical smell counts uncorrelated with maintenance pain, and that machine-generated tests actually score better on smell detectors while being semantically worse. Optimizing for smell counts is an explicit anti-goal.
- Net-new test generation (handled by tools like CoverUp2).
- Framework migrations (handled by jest-codemods, OpenRewrite, unittest2pytest, and similar).
- Flaky detection (handled by DeFlaker,
pytest-rerunfailures, test-retry plugins). This catalog names flakiness root causes when they surface asshared-stateorconditional-logic; detection itself is out of scope.
Governor rules¶
Every prescribed fix in the catalog is bounded by the governor rules in principles: knowledge-DRY not syntactic-DRY, no extract-for-testability, no speculative code, commit-before-refactor. And by the broader principle that a refactor must preserve regression-detection power.
-
Panichella, A. et al. (2023). Test Smells 20 Years Later: a Large-Scale Study. Empirical Software Engineering, 28(4). https://link.springer.com/article/10.1007/s10664-022-10207-5. Concludes that classical smell catalogs correlate poorly with real maintenance pain, and warns specifically against optimizing smell counts as a KPI. ↩
-
Pizzorno, J. & Berger, E. (2025). CoverUp: Effective High-Coverage Test Generation for Python. PACM SE 2025. https://arxiv.org/abs/2403.16218. Reference implementation: https://github.com/plasma-umass/coverup. ↩