Skip to content

Test-smell taxonomy

A curated, polyglot catalog of test-suite failure modes that are seeable, fixable, and language-independent — the subset of test-suite problems where an LLM's semantic judgment beats a linter's syntactic one.

The taxonomy is foundational. It names smells, describes how to see them, and prescribes how to fix them. It does not describe any particular tool.

The catalog

Ordered roughly by how much semantic reasoning the required judgment demands. Higher-numbered entries lean on mechanical signals; lower-numbered entries need reasoning a linter cannot do.

# Slug Core move Severity Detection Scope
1 deliverable-fossils rename + regroup per product behavior High per-test, cross-suite
2 semantic-redundancy cluster, pick canonical, fold/delete the rest High cross-suite
3 wrong-level relocate to correct pyramid tier Medium cross-suite
4 naming-lies rename test or strengthen body to match the claim Medium per-test
5 vacuous-assertion strengthen the oracle High per-test
6 pseudo-tested add assertion that kills the no-op mutant High per-test
7 tautology-theatre delete or rewrite to exercise real SUT Critical per-test
8 over-specified-mock relax to behavior-relevant interaction only High per-test
9 implementation-coupled drive through public API instead High per-test
10 presentation-coupled parse then assert semantics, not formatting Medium per-test
11 conditional-logic split or pin the precondition Medium per-test
12 shared-state move setup to per-test factory / restore globals Medium per-file
13 mystery-guest inline a 1–3 line summary of relevant fixture shape Low per-test
14 rotten-green delete the empty/dead scaffold or add the missing assertion Low per-test
15 monolithic-test-file split file by behavior domain Medium per-file

Severity is a relative-harm/safety hint: how bad the smell is for the suite, weighted by how safe the canonical fix is. Critical smells can usually be deleted outright because they were killing no mutants. Lower severities need transforms and correspondingly more reviewer attention. No severity is a mandate to act; it is input to prioritization.

Non-goals

These are covered by existing tooling. Where a linter, mutation tool, or codemod runner already does the work deterministically, this taxonomy defers.

  • Syntactic smell counts (TsDetect-style scoreboards). The EMSE 2023 follow-up study1 found classical smell counts uncorrelated with maintenance pain, and that machine-generated tests actually score better on smell detectors while being semantically worse. Optimizing for smell counts is an explicit anti-goal.
  • Net-new test generation (handled by tools like CoverUp2).
  • Framework migrations (handled by jest-codemods, OpenRewrite, unittest2pytest, and similar).
  • Flaky detection (handled by DeFlaker, pytest-rerunfailures, test-retry plugins). This catalog names flakiness root causes when they surface as shared-state or conditional-logic; detection itself is out of scope.

Governor rules

Every prescribed fix in the catalog is bounded by the governor rules in principles: knowledge-DRY not syntactic-DRY, no extract-for-testability, no speculative code, commit-before-refactor. And by the broader principle that a refactor must preserve regression-detection power.


  1. Panichella, A. et al. (2023). Test Smells 20 Years Later: a Large-Scale Study. Empirical Software Engineering, 28(4). https://link.springer.com/article/10.1007/s10664-022-10207-5. Concludes that classical smell catalogs correlate poorly with real maintenance pain, and warns specifically against optimizing smell counts as a KPI. 

  2. Pizzorno, J. & Berger, E. (2025). CoverUp: Effective High-Coverage Test Generation for Python. PACM SE 2025. https://arxiv.org/abs/2403.16218. Reference implementation: https://github.com/plasma-umass/coverup