Multi-Agent AI Architecture
EvidAI's screening and analysis capabilities are powered by a proprietary multi-agent consensus system—the first of its kind in evidence synthesis. This architecture achieves accuracy levels that single-model approaches cannot match.
The Problem with Single-Model AI
Traditional AI tools for literature screening use a single language model. This approach has fundamental limitations:
| Issue | Impact on Research |
|---|---|
| Model Blind Spots | Each model has unique weaknesses in certain domains |
| Hallucination Risk | Single models can confidently produce incorrect outputs |
| No Self-Correction | Errors propagate without detection |
| Inconsistent Performance | Quality varies unpredictably across topics |
| Audit Concerns | "Black box" decisions problematic for regulatory work |
The Evidence: In validation studies, single-model screening achieves 85-89% sensitivity. For pharmaceutical systematic reviews where missing a relevant study could impact patient safety, this error rate is unacceptable.
EvidAI's Multi-Agent Solution
Architecture Overview
Our system uses four specialized AI models that independently evaluate every paper:
┌────────────────────────────────┐
│ PAPER INPUT │
│ Title, Abstract, Metadata │
└───────────────┬────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ MODEL A │ │ MODEL B │ │ MODEL C │
│ (GPT-4o) │ │(Claude 3.5) │ │ (Gemini 1.5) │
│ Weight: 35% │ │ Weight: 35% │ │ Weight: 20% │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ │ │
│ ┌───────────────┐ │ │
│ │ MODEL D │ │ │
│ │(EvidAI Custom)│◄────────────────────┼─────────┤
│ │ Weight: 10% │ │ │
│ │ (Fine-tuned) │ │ │
│ └───────┬───────┘ │ │
│ │ │ │
└───────────┴─────────────┬───────────────┴─────────┘
│
▼
┌─────────────────────────────┐
│ CONSENSUS ENGINE │
│ • Weighted voting │
│ • Confidence calibration │
│ • Disagreement detection │
│ • Reasoning synthesis │
└───────────────┬─────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌─────────────────┐ ┌───────────────┐
│ INCLUDE │ │ HUMAN REVIEW │ │ EXCLUDE │
│ (>90%) │ │ (50-90%) │ │ (>90%) │
│ Auto-proceed │ │ Flagged │ │ Auto-proceed │
└───────────────┘ └─────────────────┘ └───────────────┘
Why Four Models?
| Model | Strengths | Role in Consensus |
|---|---|---|
| GPT-4o | Strong general reasoning, broad knowledge | Primary evaluator |
| Claude 3.5 | Careful analysis, follows instructions precisely | Verification layer |
| Gemini 1.5 | Large context window, document understanding | Context specialist |
| EvidAI Custom | Domain expertise, calibrated for medicine | Medical validation |
How Consensus Works
Step 1: Independent Evaluation
Each model receives the same input and criteria but evaluates independently:
INPUT TO ALL MODELS:
─────────────────────────────────────────────────────────────
Study: "Effect of dapagliflozin on renal outcomes in patients
with type 2 diabetes: DAPA-CKD trial results"
Inclusion Criteria:
• Population: Adults with type 2 diabetes
• Intervention: SGLT2 inhibitors
• Comparator: Placebo or standard care
• Outcome: Renal endpoints
• Design: Randomized controlled trial
Evaluate: Should this study be INCLUDED or EXCLUDED?
─────────────────────────────────────────────────────────────
MODEL A (GPT-4o) Response:
├── Decision: INCLUDE
├── Confidence: 97%
└── Reasoning: "Clearly an RCT (DAPA-CKD is well-known).
Population is T2DM patients. SGLT2 inhibitor intervention.
Renal outcomes match criteria."
MODEL B (Claude 3.5) Response:
├── Decision: INCLUDE
├── Confidence: 96%
└── Reasoning: "DAPA-CKD was a randomized trial of dapagliflozin
(an SGLT2 inhibitor) vs placebo in diabetic CKD patients.
All PICO elements satisfied."
MODEL C (Gemini 1.5) Response:
├── Decision: INCLUDE
├── Confidence: 98%
└── Reasoning: "Published 2020 NEJM landmark trial. Meets all
stated criteria. Primary outcome was renal composite."
MODEL D (EvidAI Custom) Response:
├── Decision: INCLUDE
├── Confidence: 97%
└── Reasoning: "High-quality RCT matching all criteria. n=4,304.
Key study for this review topic."
Step 2: Weighted Consensus Calculation
CONSENSUS CALCULATION:
─────────────────────────────────────────────────────────────
Weighted Confidence Scores:
├── GPT-4o: 97% × 0.35 = 33.95%
├── Claude 3.5: 96% × 0.35 = 33.60%
├── Gemini 1.5: 98% × 0.20 = 19.60%
└── EvidAI: 97% × 0.10 = 9.70%
──────────
CONSENSUS CONFIDENCE: 96.85%
Agreement Level: 4/4 models agree (UNANIMOUS)
Decision: INCLUDE
Routing: AUTOMATIC (confidence >90%, unanimous)
Step 3: Disagreement Detection
When models disagree, the system identifies it for human review:
DISAGREEMENT EXAMPLE:
─────────────────────────────────────────────────────────────
Study: "Case series of empagliflozin use in gestational diabetes"
MODEL VOTES:
├── GPT-4o: INCLUDE (72%) - "Diabetes + SGLT2i"
├── Claude 3.5: EXCLUDE (78%) - "Gestational ≠ Type 2"
├── Gemini 1.5: EXCLUDE (65%) - "Wrong population"
└── EvidAI: EXCLUDE (71%) - "Not T2DM per criteria"
CONSENSUS CONFIDENCE: 62.4%
Agreement Level: 3/4 models agree on EXCLUDE
Decision: EXCLUDE (tentative)
Routing: HUMAN REVIEW REQUIRED
Flagged Reason: "Population mismatch uncertainty. Models
disagree on whether gestational diabetes qualifies under
'adults with type 2 diabetes' criteria."
Performance Comparison
Accuracy Metrics
| Metric | Single Model | Multi-Agent | Improvement |
|---|---|---|---|
| Sensitivity | 89% | 96.2% | +7.2% |
| Specificity | 86% | 93.5% | +7.5% |
| Human Agreement | 85% | 94.1% | +9.1% |
| Auto-Decidable | 65% | 85% | +20% |
What This Means
96.2% Sensitivity: For every 100 relevant studies in your database, EvidAI correctly identifies 96+ of them. Missing only 4% compares favorably to human dual-review teams (typically 90-95%).
85% Auto-Decidable: For most reviews, 85% of papers can be automatically processed without human intervention, reducing workload by 85%.
Confidence Calibration
What Calibration Means
Our confidence scores are calibrated—when we say 90% confidence, we're right 90% of the time:
| Stated Confidence | Actual Accuracy | Calibration |
|---|---|---|
| 90-100% | 96.2% | ✅ Well-calibrated |
| 80-90% | 87.4% | ✅ Well-calibrated |
| 70-80% | 74.1% | ✅ Well-calibrated |
| 60-70% | 68.3% | ✅ Well-calibrated |
| 50-60% | 54.2% | ✅ Well-calibrated |
How We Achieved Calibration
- 10,000+ validation decisions against human expert panels
- Temperature and sampling optimization per model
- Continuous recalibration from user feedback
- Domain-specific tuning for medical literature
The EvidAI Custom Model
Purpose
Our proprietary fine-tuned model provides domain expertise that general models lack:
| Capability | General Models | EvidAI Custom |
|---|---|---|
| Medical terminology | Good | Excellent |
| Study design recognition | Variable | Highly accurate |
| PICO element extraction | Adequate | Optimized |
| Quality signal detection | Basic | Advanced |
Training Data
- 50,000+ labeled screening decisions
- Expert annotation from Cochrane methodologists
- Continuous learning from platform usage
- Domain coverage: Medicine, Nursing, Psychology, Public Health
Why This Matters
For Researchers
- Higher quality reviews: Fewer missed studies
- Less effort: 85% auto-screening
- Audit confidence: Clear reasoning for every decision
For Pharmaceutical/Regulatory
- Defensible decisions: Every ruling documented with multi-model consensus
- Reduced risk: Hallucination risk minimized through cross-validation
- Efficiency: Faster evidence synthesis without quality compromise
For the Field
- New standard: Multi-agent consensus should become the baseline
- Reproducibility: Same inputs produce consistent outputs
- Transparency: Complete visibility into AI reasoning
Industry First: EvidAI is the only evidence synthesis platform using multi-agent AI consensus. This architecture represents 18+ months of development and continuous optimization.