12 min read

Multi-Agent AI Architecture

How four AI models work together to achieve research-grade accuracy

Multi-Agent AI Architecture

EvidAI's screening and analysis capabilities are powered by a proprietary multi-agent consensus system—the first of its kind in evidence synthesis. This architecture achieves accuracy levels that single-model approaches cannot match.


The Problem with Single-Model AI

Traditional AI tools for literature screening use a single language model. This approach has fundamental limitations:

IssueImpact on Research
Model Blind SpotsEach model has unique weaknesses in certain domains
Hallucination RiskSingle models can confidently produce incorrect outputs
No Self-CorrectionErrors propagate without detection
Inconsistent PerformanceQuality varies unpredictably across topics
Audit Concerns"Black box" decisions problematic for regulatory work

The Evidence: In validation studies, single-model screening achieves 85-89% sensitivity. For pharmaceutical systematic reviews where missing a relevant study could impact patient safety, this error rate is unacceptable.


EvidAI's Multi-Agent Solution

Architecture Overview

Our system uses four specialized AI models that independently evaluate every paper:

                    ┌────────────────────────────────┐
                    │         PAPER INPUT            │
                    │  Title, Abstract, Metadata     │
                    └───────────────┬────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐         ┌───────────────┐         ┌───────────────┐
│   MODEL A     │         │   MODEL B     │         │   MODEL C     │
│  (GPT-4o)     │         │(Claude 3.5)   │         │ (Gemini 1.5)  │
│   Weight: 35% │         │  Weight: 35%  │         │  Weight: 20%  │
└───────┬───────┘         └───────┬───────┘         └───────┬───────┘
        │                         │                         │
        │         ┌───────────────┴───────────────┐         │
        │         │                               │         │
        │         ▼                               │         │
        │   ┌───────────────┐                     │         │
        │   │  MODEL D      │                     │         │
        │   │(EvidAI Custom)│◄────────────────────┼─────────┤
        │   │  Weight: 10%  │                     │         │
        │   │ (Fine-tuned)  │                     │         │
        │   └───────┬───────┘                     │         │
        │           │                             │         │
        └───────────┴─────────────┬───────────────┴─────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │    CONSENSUS ENGINE         │
                    │  • Weighted voting          │
                    │  • Confidence calibration   │
                    │  • Disagreement detection   │
                    │  • Reasoning synthesis      │
                    └───────────────┬─────────────┘
                                    │
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
            ▼                       ▼                       ▼
    ┌───────────────┐     ┌─────────────────┐     ┌───────────────┐
    │   INCLUDE     │     │  HUMAN REVIEW   │     │   EXCLUDE     │
    │   (>90%)      │     │   (50-90%)      │     │   (>90%)      │
    │ Auto-proceed  │     │  Flagged        │     │ Auto-proceed  │
    └───────────────┘     └─────────────────┘     └───────────────┘

Why Four Models?

ModelStrengthsRole in Consensus
GPT-4oStrong general reasoning, broad knowledgePrimary evaluator
Claude 3.5Careful analysis, follows instructions preciselyVerification layer
Gemini 1.5Large context window, document understandingContext specialist
EvidAI CustomDomain expertise, calibrated for medicineMedical validation

How Consensus Works

Step 1: Independent Evaluation

Each model receives the same input and criteria but evaluates independently:

INPUT TO ALL MODELS:
─────────────────────────────────────────────────────────────
Study: "Effect of dapagliflozin on renal outcomes in patients 
       with type 2 diabetes: DAPA-CKD trial results"

Inclusion Criteria:
• Population: Adults with type 2 diabetes
• Intervention: SGLT2 inhibitors
• Comparator: Placebo or standard care
• Outcome: Renal endpoints
• Design: Randomized controlled trial

Evaluate: Should this study be INCLUDED or EXCLUDED?
─────────────────────────────────────────────────────────────

MODEL A (GPT-4o) Response:
├── Decision: INCLUDE
├── Confidence: 97%
└── Reasoning: "Clearly an RCT (DAPA-CKD is well-known). 
    Population is T2DM patients. SGLT2 inhibitor intervention.
    Renal outcomes match criteria."

MODEL B (Claude 3.5) Response:
├── Decision: INCLUDE
├── Confidence: 96%
└── Reasoning: "DAPA-CKD was a randomized trial of dapagliflozin
    (an SGLT2 inhibitor) vs placebo in diabetic CKD patients.
    All PICO elements satisfied."

MODEL C (Gemini 1.5) Response:
├── Decision: INCLUDE
├── Confidence: 98%
└── Reasoning: "Published 2020 NEJM landmark trial. Meets all
    stated criteria. Primary outcome was renal composite."

MODEL D (EvidAI Custom) Response:
├── Decision: INCLUDE
├── Confidence: 97%
└── Reasoning: "High-quality RCT matching all criteria. n=4,304.
    Key study for this review topic."

Step 2: Weighted Consensus Calculation

CONSENSUS CALCULATION:
─────────────────────────────────────────────────────────────

Weighted Confidence Scores:
├── GPT-4o:       97% × 0.35 = 33.95%
├── Claude 3.5:   96% × 0.35 = 33.60%
├── Gemini 1.5:   98% × 0.20 = 19.60%
└── EvidAI:       97% × 0.10 =  9.70%
                            ──────────
CONSENSUS CONFIDENCE:          96.85%

Agreement Level: 4/4 models agree (UNANIMOUS)

Decision: INCLUDE
Routing: AUTOMATIC (confidence >90%, unanimous)

Step 3: Disagreement Detection

When models disagree, the system identifies it for human review:

DISAGREEMENT EXAMPLE:
─────────────────────────────────────────────────────────────

Study: "Case series of empagliflozin use in gestational diabetes"

MODEL VOTES:
├── GPT-4o:       INCLUDE (72%) - "Diabetes + SGLT2i"
├── Claude 3.5:   EXCLUDE (78%) - "Gestational ≠ Type 2"
├── Gemini 1.5:   EXCLUDE (65%) - "Wrong population"
└── EvidAI:       EXCLUDE (71%) - "Not T2DM per criteria"

CONSENSUS CONFIDENCE: 62.4%
Agreement Level: 3/4 models agree on EXCLUDE

Decision: EXCLUDE (tentative)
Routing: HUMAN REVIEW REQUIRED

Flagged Reason: "Population mismatch uncertainty. Models 
disagree on whether gestational diabetes qualifies under 
'adults with type 2 diabetes' criteria."

Performance Comparison

Accuracy Metrics

MetricSingle ModelMulti-AgentImprovement
Sensitivity89%96.2%+7.2%
Specificity86%93.5%+7.5%
Human Agreement85%94.1%+9.1%
Auto-Decidable65%85%+20%

What This Means

96.2% Sensitivity: For every 100 relevant studies in your database, EvidAI correctly identifies 96+ of them. Missing only 4% compares favorably to human dual-review teams (typically 90-95%).

85% Auto-Decidable: For most reviews, 85% of papers can be automatically processed without human intervention, reducing workload by 85%.


Confidence Calibration

What Calibration Means

Our confidence scores are calibrated—when we say 90% confidence, we're right 90% of the time:

Stated ConfidenceActual AccuracyCalibration
90-100%96.2%✅ Well-calibrated
80-90%87.4%✅ Well-calibrated
70-80%74.1%✅ Well-calibrated
60-70%68.3%✅ Well-calibrated
50-60%54.2%✅ Well-calibrated

How We Achieved Calibration

  1. 10,000+ validation decisions against human expert panels
  2. Temperature and sampling optimization per model
  3. Continuous recalibration from user feedback
  4. Domain-specific tuning for medical literature

The EvidAI Custom Model

Purpose

Our proprietary fine-tuned model provides domain expertise that general models lack:

CapabilityGeneral ModelsEvidAI Custom
Medical terminologyGoodExcellent
Study design recognitionVariableHighly accurate
PICO element extractionAdequateOptimized
Quality signal detectionBasicAdvanced

Training Data

  • 50,000+ labeled screening decisions
  • Expert annotation from Cochrane methodologists
  • Continuous learning from platform usage
  • Domain coverage: Medicine, Nursing, Psychology, Public Health

Why This Matters

For Researchers

  • Higher quality reviews: Fewer missed studies
  • Less effort: 85% auto-screening
  • Audit confidence: Clear reasoning for every decision

For Pharmaceutical/Regulatory

  • Defensible decisions: Every ruling documented with multi-model consensus
  • Reduced risk: Hallucination risk minimized through cross-validation
  • Efficiency: Faster evidence synthesis without quality compromise

For the Field

  • New standard: Multi-agent consensus should become the baseline
  • Reproducibility: Same inputs produce consistent outputs
  • Transparency: Complete visibility into AI reasoning

Industry First: EvidAI is the only evidence synthesis platform using multi-agent AI consensus. This architecture represents 18+ months of development and continuous optimization.

Did this article help?
Still stuck?