Machine Learning Validation: The Accuracy Problem
AI Foundations Series - Part 4B
February 23, 2026
Introduction
In Part 4A, we covered what machine learning (ML) is, how training data works, and why the shift from deterministic to probabilistic outputs fundamentally changes validation. We ended with a question: what does “accurate” actually mean for ML — and was it ever clearly defined for the processes it replaces?
ALCOA — Attributable, Legible, Contemporaneous, Original, Accurate — is the bedrock of data integrity in regulated environments. Four of those five principles are process controls you can enforce: you can attribute a record, make it legible, timestamp it, preserve the original. The fifth — Accurate — is different. It’s the one that assumes a relationship to truth. And it’s the one that breaks first when AI enters the picture.
This article pulls the thread on accuracy. It unravels further than you’d expect — and not just for AI.
Start with a distinction that matters more than it first appears: precision and accuracy are different problems.
Precision is definable. You can constrain it, spec it, test it, and enforce it. You can state that a measurement must be within ±0.01 and verify that it is. Precision is a boundary you draw and hold.
Accuracy is something else entirely. The word itself is deceptive — it implies a relationship to truth that may not exist at any level in the chain. You think “>15” is a simple threshold, but is the true value 14.999999? Was it rounded to 15? What was the instrument resolution? Was there parallax on an analog reading? What transcription steps and rounding decisions are embedded in that number before it ever reaches the system that evaluates it? Start pulling the thread on “accurate” and it unravels fast — from a supposedly simple statement, you end up chasing your tail.
And then we hand an ML model this entire stack of accumulated imprecision and ambiguity and ask it to… what, exactly? Be more accurate than that? Replicate it? Against what standard?
Accuracy is always context-dependent — what counts as “accurate” depends on what the system is doing and what process it’s standing in for. That’s true for rule-based systems, manual processes, and ML alike. The difference is that ML forces you to confront it. The critical step is knowing, for each model and each use case, how close to truth you can actually get. If you expect accuracy to equal truth in a case where that’s not achievable, the entire endeavor is doomed — not because the model failed, but because the expectation was wrong.
What “Accurate” Means — and Doesn’t
ML models produce outputs on a spectrum of testability, and the validation approach has to match.
Classification (potential adverse event: yes/no, image: normal/abnormal) is the most testable — you have labeled data, you can measure sensitivity and specificity against it. But even here, “accurate” means “agrees with the labels.” If those labels were based on expert decisions rather than confirmed outcomes, your accuracy metric measures agreement with human opinion, not correctness.
Prediction is harder. A model that says “this batch will yield 85%” can be evaluated after the fact — but the actual yield was influenced by variables the model didn’t see. Was the prediction wrong, or did conditions change? Regression toward observed outcomes is measurable but always retrospective, and attribution is ambiguous.
Scoring and ranking may be fundamentally untestable in absolute terms. A model that assigns a 73% enrollment probability to a clinical site isn’t “wrong” if the site fails to enroll — a 73% probability means 27% of the time it won’t happen. You can validate calibration across many predictions (do 73% of “73% likely” events actually occur?) but not any individual output.
The validation question isn’t “is the model right?” It’s “does the model perform well enough for this intended use, what happens when it’s wrong, can we detect when it’s wrong, and can we even define ‘wrong’ for this output type?”
The Uncomfortable Foundation
If you can’t define what “accurate” means for a given output, you can’t define what an error is. If you can’t define error, you can’t define an error rate. And if you can’t define an error rate, your acceptance criteria are judgment-based rather than empirically grounded — which means your validation conclusion rests on assumptions, not measurements.
This isn’t an ML problem. It’s a validation problem that ML makes impossible to ignore. Even rule-based systems — the “easy” case — rest on unexamined assumptions about accuracy. “The rule fires when X > 5” is testable, but two things are taken on faith. First, is 5 the right threshold? You know it’s what someone wrote in the requirement. You tested that the system implements it correctly. You never tested that the requirement was correct. Second — is the value of X accurate? The value that arrived at the rule engine as 5.2 originated from an instrument with calibration tolerances, read by a person, entered into a system, possibly transcribed from another. By the time the rule evaluates it, you’ve accumulated sources of measurement error, transcription error, and rounding decisions. The rule fires correctly — it sees 5.2, threshold is 5, action triggered. The system works fine. But was the value actually 5.2?
CSV is scoped to system behavior, not data truth. It always has been. Within the system boundary, validation ensures you know what happened to every data point from the time it was entered in the system to the present. That’s data integrity, and it’s real. What it doesn’t address is whether the value entered was correct in the first place. The industry confronted this with SDV — verifying the CRF matched the source (i.e. medical chart), but never whether the source was right. That reckoning didn’t just expose the assumption. It forced a practical response: not all data points carry equal risk, so focus verification effort on the ones that matter most. That supported the shift to risk-based monitoring. ML validation requires the same thinking — you can’t verify every output or every training label, so identify where the risk is highest and focus there.
ML strips away that convenience for three reasons. First, the “expected results” (training labels) are visibly subjective — someone labeled thousands of examples, and if you look, you can measure the disagreement rate between reviewers. The subjectivity is quantifiable in a way it never was for a requirements document. Second, the outputs are explicitly probabilistic — the system tells you it’s not certain, which traditional software never did. And third, ML degradation tends to be distributed and subtle, whereas rule-based systems often fail in discrete, detectable ways.
The industry has been able to avoid this conversation for decades because deterministic systems let you test against a defined expected result and call it validated. ML doesn’t offer that shortcut. The question was always there. ML just won’t let you look away from it.
These aren’t AI questions. They’re validation questions — precision, accuracy, reference standards, data provenance, what “correct” actually means. They should have been standard practice for every validated system, every data integrity assessment, every acceptance criteria definition. The industry didn’t ask them because the framework didn’t force it. ML forces it. And if that makes you uncomfortable about the rule-based systems we covered in Part 3 — the ones that seemed so cut and dried — it should. The same unexamined assumptions are there. They’re just easier to not see.
The Double Standard
We accept that humans are fallible — quality systems are built around that assumption. We don’t extend the same acceptance to computers, because we programmed them. For rule-based systems, that expectation is reasonable. For ML, it’s a category error — the system is probabilistic by design, and a defined error rate isn’t a defect, it’s the operating model.
Organizations are demanding ML prove accuracy against a standard they never established for the manual process it replaces. The human doing the work was validated by pedigree — trained, qualified, following an approved SOP. Error rates were seldom formalized as validation evidence. Nobody defined what “correct” meant beyond “what the experienced person decided.” The accuracy of the existing process was assumed, not demonstrated.
Now ML arrives and suddenly you need statistical performance baselines, acceptance criteria, drift monitoring — all measured against a benchmark that doesn’t exist because the manual process was never quantified.
The double standard is real: AI is being held to a measurement standard the industry never applied to anything else. That doesn’t mean ML shouldn’t be validated rigorously. It means the rigor should be proportionate and honest about what it’s actually measuring — and what was never measured before.
The Practical Framework
None of the above means you can’t validate ML. It means you stop pretending validation proves accuracy in absolute terms — for ML or anything else — and focus on what you can actually establish.
Life sciences professionals make defensible decisions with incomplete information every day. That’s what risk assessment is. You don’t need to solve the epistemology to validate an ML system. You need to answer four practical questions:
What’s the consequence when it’s wrong? This is assessable without defining absolute accuracy. If the ML flags a case for human review and gets it wrong, the consequence is wasted reviewer time. If it triggers an automated action on a manufacturing line, the consequence could be product impact. Validation rigor follows consequence, not complexity.
Is it better than current state? “Better than what we’re doing now” is a defensible acceptance standard — and often an achievable one. Even a rough characterization of the manual process error profile gives you a baseline. Where you genuinely can’t measure current state, document the assumption in your validation rationale.
Can you detect when it degrades? Monitor performance over time rather than trying to prove accuracy at a single point in time. Trend data tells you more than a snapshot validation ever did. Define thresholds. Measure. Respond when thresholds are breached.
Are you controlling the variables that impact accuracy? This means managing training data quality and provenance, monitoring model performance, detecting drift, setting confidence thresholds that route uncertain outputs for review, and calibrating human oversight to actual risk rather than applying it as a blanket assumption. You can’t guarantee every output is correct. But you can demonstrate that your controls are proportionate to the risk.
This is the argument for ML, not against it. The manual process had the same accuracy problems — you just couldn’t see them. These measurements were rarely part of the validation conversation. Inter-rater reliability on deviation classifications, error rates for experienced scientists making visual assessments, accuracy of clinical data entry — these were seldom quantified, and almost never formalized as acceptance criteria for the process the ML is replacing.
ML forces you to define your reference standard, measure performance against it, acknowledge the limitations of that standard, and monitor for degradation over time. The challenges aren’t new — but ML makes them visible and measurable. And visible, measurable problems are manageable problems. That’s more honest and more rigorous than “a qualified person did it and signed off.”
The life sciences industry already knows how to do this. Dual reader adjudication in solid tumor image analysis is exactly this process: two experts evaluate, if they disagree, an adjudication process resolves the disagreement, by accepting the adjudicator’s decision: Reader 1, Reader 2, or Other; and you proceed — knowing the process has limitations, having acknowledged them, having defined how you’ll handle them. You don’t wait for perfect accuracy because it doesn’t exist and you have a trial to run.
It’s like getting in your car to drive to another state knowing the oil is a little low, the brake fluid should be replaced in about 5,000 miles, the battery is halfway through its life, and the tire pressure is a bit low. You don’t refuse to drive until every system is perfect. You assess which imperfections are tolerable and which ones will kill you, and you act accordingly. Low tire pressure — monitor it. Brake fluid at 5,000 — schedule it. Brakes not functioning — you don’t drive, you take an Uber.
That’s risk-based decision making. That’s CSA. And that’s how ML validation works in practice: acknowledge the limitations, define how you’ll manage them, monitor for degradation, and proceed.
What Comes Next
Several issues surfaced in this article that aren’t specific to ML — they apply across AI categories with increasing intensity. These will get dedicated treatment in future articles:
Human-in-the-loop as a validation control. HIL is necessary but not sufficient — and it degrades with use. The SDV-to-RBM transition is the direct precedent.
Monitoring, drift, and bail conditions. When does the model stop being acceptable? What triggers a revert to manual process, a rebuild, or a retraining cycle? Who owns the decision?
Data integrity frameworks for AI. This article focused on accuracy — one of ALCOA’s five principles. But the others have their own problems with AI: legibility inverts when polished AI output masks uncertainty, attribution gets murky for model-generated content, and “original” loses meaning for outputs synthesized from patterns across millions of inputs. Each AI category breaks different ALCOA principles in different ways.
Organizational readiness. ML monitoring requires cross-functional ownership — data science, quality, process owners, IT — with roles defined before deployment. Most organizations aren’t structured for this yet.
Validation approaches across AI categories. The practical mechanics of what to validate, acceptance criteria design, testing methodologies, and digital validation platform capabilities deserve consolidated treatment rather than repetition in each category article.
Discussion
The accuracy problem isn’t going away. As AI categories advance through this series — deep learning, NLP, LLMs, generative AI — the concept of “correct output” becomes progressively harder to define. What we’ve established here for ML applies, with increasing force, to everything that follows.
Questions worth exploring:
- Have you measured the error rate of the manual process the ML is replacing? If not — how are you defining acceptance criteria?
- When your organization validates ML, is training data provenance documented as a risk factor — or treated as a technical detail?
- How do you handle the double standard conversation internally? When leadership asks “how do we know the AI is accurate?” — what’s your answer?
- Do you know where your ML training data falls on the provenance spectrum — confirmed outcomes or expert decisions? Has that distinction informed your validation approach?
I’d welcome hearing how different organizations are navigating these questions — especially anyone who’s found practical approaches that are both defensible and honest about what they’re actually measuring.
Next in this series: Category 3 — Deep Learning and Neural Networks — where ML scales into layered architectures that can process images, audio, and unstructured text. More powerful, less transparent, and harder to validate.