Deep Dive · Testing & Quality
Testing in the Age of GenAI: Why the Pyramid Still Matters (But the Timeline Changed)
A veteran technology executive breaks down how generative AI is reshaping software testing — not by replacing best practices, but by making them faster, smarter, and more enforceable than ever.
By Jon Sinclair · Founder, ClinRS AI · Former VP of Engineering, Advarra ·
The question keeps coming up
After publishing my piece on building ClinRS with generative AI, I heard similar questions from a few different people in my network: “What about testing? Where do they fit with the changing model?”
It's a fair question — and an important one. When you're moving fast with GenAI tools, testing is the first thing people assume you're skipping. The assumption is that speed and quality are in tension. In my experience, that's exactly backwards. You cannot blindly accept that code written by LLMs is done well and free of bugs. GenAI doesn't give you an excuse to skip testing. It gives you the tools to do it better and more consistently than ever before.
Here's how I think about it.
Start with the plan — not the code
Before any of this matters, you need a solid plan. I'll be covering this in depth in a future article, but the short version is this: I don't let agents run at a vague description. I (assisted by GenAI tools) build out architecture documents, design specs, and granular Jira stories — reviewed for security and threat modeling — before a single line of code gets written. The quality of what comes out of the AI is directly proportional to the quality of what goes in.
That foundation is what makes everything else in this article work. Keep that in mind as we walk through the execution phase.
The testing pyramid still applies
Let's be clear: the fundamentals haven't changed. The testing pyramid — lots of unit tests at the base, integration tests in the middle, UI tests at the top — is still the right model. GenAI didn't invalidate 30 years of software engineering wisdom. What it changed is the speed at which you can build and maintain tests at each level, and the timing of when certain types of tests make sense to invest in.
Unit tests: your foundation, built in from day one
Unit tests are non-negotiable. They're built into the implementation process — not added afterward. When I'm working with an agent to implement a feature, one of the standing rules in my workflow is that tests get written alongside the code, not as an afterthought.
GenAI is genuinely good at generating unit tests. But here's the catch: it can also generate tests that look like they're testing something without actually doing so. Trivial assertions. Tests that pass by design rather than by verification. Coverage that checks a box without checking the logic.
This is where the agent test review comes in — more on that shortly.
Integration tests and mock systems: where GenAI really shines
This is the area I'm most excited about, and the one that should change how engineering teams handle integration testing.
Integration testing has always required access to the systems you're integrating with — or a high-fidelity mock of those systems. Building and maintaining those mocks used to be expensive. It took real engineering time to create a mock backend that accurately simulated an external system's behavior, and even more time to keep it up to date as that system evolved.
GenAI has essentially solved this problem.
If a third-party system has a public-facing API — whether documented via Swagger, OpenAPI, or any other public specification — tools like Cursor or Claude can access that documentation and generate a fully functional mock backend in a fraction of the time it used to take. The mock persists data, responds correctly to API calls, and provides a realistic integration testing harness without needing access to the actual system.
The backend of the mock doesn't need to be sophisticated. It doesn't need to replicate every business rule of the source system. It just needs to respond with the right shape of data at the right endpoints. That's something GenAI handles extremely well, especially when the API specification is clearly defined.
Honestly, this is a great place to start for engineers who want to build something useful with generative AI that is relatively low-risk. Mock systems use fake data (no PHI or confidentiality exposure risk), and if you miss something in a mock you will hopefully come across it in end-to-end (e2e) testing prior to launch. Just make sure you use the information for what you find in e2e testing to improve your mock for the next time.
The result: higher fidelity integration tests, maintained more easily, at a fraction of the cost. This is a genuinely transformative capability for teams building integrations — which in healthcare and life sciences, is almost everyone.
UI tests: wait until you're stable
This is the most important timing decision in the whole pyramid, and one that trips up a lot of teams adopting GenAI development.
When you're iterating quickly on a UI — which you will be, because that's one of the things GenAI makes extremely fast — your interface is a moving target. Writing comprehensive UI tests against a UI that's going to change significantly next week is expensive and demoralizing. You'll spend more time fixing tests than building features.
The right approach: hold off on heavy UI test investment until you're genuinely happy with the interface. Once the UX stabilizes, that's when you invest in UI automation. At that point, those tests are valuable — they protect something you want to protect, and they won't need constant revision.
This isn't a reason to skip UI tests. It's a reason to sequence them correctly.
Compliance and accessibility: bake it in, don't bolt it on
There's been significant attention in the academic medical center and healthcare community around WCAG accessibility compliance. The rules are in place. Enforcement is coming. And for patient-facing platforms, accessibility isn't optional — it's a requirement.
There are a number of solid libraries out there to assist teams in ensuring their applications are compliant. One that I came across recently is from axe. I implemented axe-core on the ClinRS corporate site to evaluate how well it would work as part of an automated compliance strategy. The short verdict: it's genuinely useful, and I'd recommend it — but go in with realistic expectations. The tool covers a meaningful portion of WCAG criteria automatically, and having it run in your CI/CD pipeline means you catch a real class of problems on every commit rather than discovering them in an audit.
What it doesn't do is cover everything. Automated tools are only as good as what they're designed to check — and experienced engineers know there's often a gap between what's technically flagged and what's actually the right thing to do for your users. Knowing that the automated tools have a ceiling, I had GenAI generate a manual review checklist to supplement them — a structured set of checks covering the scenarios the automated tools can't evaluate. That checklist is meant to be run after the automated pass to fill in the gaps.
To give a tangible example, I'll point to the option prefers-reduced-motion. It wasn't flagged by axe, and correctly so — it isn't a requirement under WCAG AA compliance, which is the target for Section 508 compliance. But it appeared on the manual checklist as a best practice consideration for users with motion sensitivity settings enabled in their browser. Recognizing it as a simple, meaningful improvement for my site, I had GenAI implement the CSS changes necessary to honor that browser preference and reduce unnecessary motion on the site. A quick, targeted change that took minutes and made the experience meaningfully better for users who would never have complained, but who benefit from the consideration.
Automated suites enforce the baseline continuously. A GenAI-generated manual checklist covers what automation can't. And human judgment — informed by domain knowledge and genuine care for the end user — decides what's worth acting on. Together, they're far better than the alternative, which for most teams is a compliance gap that only shows up when someone complains or an auditor arrives.
GenAI makes all three layers practical. Setting up the automated suites, generating the manual checklist, and implementing the fixes — none of it requires the kind of time investment that used to make thorough accessibility work a luxury.
Agent-driven code and test review: a second pair of eyes, always
This is the piece that goes beyond the standard testing conversation — and it's become one of the most valuable parts of my workflow as a solo engineer.
When you're the only human on the project, there's no code review culture. No one to catch the thing you missed. No senior engineer looking over your shoulder. That's a real risk, and I take it seriously.
What I've built instead is a layered agent review process that runs before I ever take my own pass at the code. Here's the order of operations:
- 1Agent implementation — The agent writes the code based on well-specified stories.
- 2Agent code review — security pass — A separate agent reviews the code with fresh eyes, specifically looking for security vulnerabilities: injection risks, authentication gaps, insecure data handling, anything that would fail a security audit.
- 3Agent code review — quality and standards pass — Another pass checking that the code follows the conventions established in the project's Cursor and Claude markdown files. Is this consistent with how we've done everything else? Does it follow the patterns we've set? Does it read like it belongs here?
- 4Agent bug review — One more pass just looking for logical errors, edge cases, null handling, off-by-one issues. The things that don't cause test failures but show up in production at the worst possible time.
- 5Unit test validation — All new tests and all existing tests must pass.
- 6Agent test review — A dedicated pass reviewing the quality of the tests themselves. Not just coverage — quality. Are these tests actually verifying the right behavior? Are they testing logic or testing noise? This is where you catch the "assertTrue(true)" problem before it makes it into your codebase.
- 7Human final review — I go through everything one last time before merge. By this point, the obvious issues are already caught. My review is focused on whether the overall solution is right, not whether the individual lines are clean.
The economics of this are compelling. Each of these agent passes takes a few minutes. If any of them surface something real, that's time well spent. If they don't, you've lost a few minutes and gained confidence. Iterate on which passes yield value for your specific context — and drop the ones that don't.
Where do testers fit in?
If you've read this far and you work in testing, you might be wondering where you fit in a world where agents are writing tests, reviewing test quality, and flagging coverage gaps automatically.
The answer is: in a more important role than before.
In larger engineering organizations, a test lead isn't just someone who writes test cases. They're the person who understands the full picture of what's covered, where the test suite is brittle, and where the gaps are most likely to cause production incidents. They bring a systems-level view to quality that individual engineers — focused on their own features — can't always maintain.
That role doesn't go away with GenAI. If anything, it becomes more critical. The agents can execute the testing strategy, but someone still needs to own it. Someone needs to evaluate bugs that slip through and ask: how did our tests miss this? What does that tell us about where our coverage is weak? How do we make the agent review prompts smarter next time?
For engineers who specialize in testing, this is actually an opportunity. The repetitive, mechanical parts of the job get automated. What remains is the judgment work — understanding the system holistically, identifying the brittle seams, and continuously improving the testing infrastructure so that the agents are working from better inputs. That's a high-value, high-impact role in any engineering organization, and it's one that GenAI makes more visible, not less necessary.
The through-line
What ties all of this together is the same principle from my first article: GenAI amplifies the engineer, it doesn't replace them. A disciplined testing approach doesn't become less important when you're moving fast — it becomes more important, because you're shipping more, faster, and the surface area for problems expands.
What GenAI gives you is the ability to maintain that discipline without the overhead that used to make it painful. Mocks that used to take days to build. Accessibility tests that used to require external audits. Code reviews that used to require a second engineer. Agent test reviews that used to require a QA lead.
All of that is now available to a team of one — or a small team that used to have to choose which corners to cut.
You don't have to cut corners anymore. That's the point.
What's next
In an upcoming article, I'll go deep on the planning and story-building process that makes all of this work — how I architect features, break them into security-reviewed stories, and use that structured input to get consistent, auditable output from the agents. If the testing process is the engine, the planning process is the fuel.
This is what ClinRS is built to do
ClinRS applies an AI-first engineering philosophy to help healthcare and life sciences companies build with high influence and high impact — without the overhead of a traditional engineering build-out. Whether you need fractional CTO leadership, a technical partner for a specific build, or a strategic sounding board for your engineering roadmap, the model is the same: senior judgment, AI-amplified velocity, domain-grounded precision.
If you're scaling a healthcare or life sciences product and you're thinking through how to build a testing strategy that holds up under compliance scrutiny — let's talk.
Let's talk →