A Survey of DeepSeek Models

📝 Paper Summary

Clinical Decision Support Open-Source LLM Evaluation AI Safety and Ethics

DeepSeek-R1 offers a cost-effective, open-source alternative for healthcare reasoning through a hybrid architecture, achieving near-proprietary performance in diagnostics while exhibiting higher risks of bias and adversarial vulnerability.

Core Problem

Proprietary healthcare AI models (like GPT-4o) are expensive and opaque, while existing open-source models often lack the reasoning depth required for complex clinical tasks.

Why it matters:

High computational costs and licensing fees restrict AI adoption in resource-constrained healthcare settings and developing regions
Closed-source 'black box' models prevent clinical auditing and transparency, which are critical for patient safety and regulatory compliance
A lack of accessible, high-reasoning models hinders the democratization of medical AI tools for diagnosis and education

Concrete Example: In ophthalmology, proprietary models like GPT-o1 achieve high accuracy but at high cost. DeepSeek-R1 achieves comparable diagnostic accuracy (82.0%) but is 15 times cheaper to run, potentially enabling wider deployment in rural clinics.

Key Novelty

Hybrid Reasoning-Reinforcement Architecture (DeepSeek-R1)

Integrates Mixture of Experts (MoE) to selectively activate neural pathways, drastically reducing inference costs while maintaining high parameter capacity
Employs Group Relative Policy Optimization (GRPO) in reinforcement learning to induce 'self-reflection' capabilities, allowing the model to critique and refine its own reasoning chain

Evaluation Highlights

Achieves 86.7% accuracy on AIME 2024 (mathematics benchmark) and 96.3rd percentile on Codeforces, rivaling proprietary models
Matches OpenAI's o1 model with 82.0% accuracy on ophthalmology cases while incurring ~15x lower inference costs
Performs strongly in pediatric diagnostics (87.0% on MedQA), though slightly trailing ChatGPT-o1 (92.8%)

Breakthrough Assessment

7/10

Significant for democratizing high-level reasoning in healthcare via open weights and efficiency, but safety vulnerabilities and lower general NLP fluency prevent a higher score.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of a Large Language Model (LLM) on domain-specific healthcare, reasoning, and safety tasks

Inputs: Clinical queries, USMLE-style questions, retinal scan descriptions, or adversarial prompts

Outputs: Diagnostic suggestions, reasoning chains, treatment plans, or code

Pipeline Flow

Input Query -> Sparse Expert Activation (MoE) -> Chain of Thought Reasoning -> Self-Reflection/Refinement -> Final Output

System Modules

Base Model (DeepSeek-V3)

Provides the pre-trained linguistic and logical foundation using Mixture of Experts

Model or implementation: DeepSeek-V3

Reasoning Engine

Generates multi-step reasoning chains (CoT) and performs symbolic computation

Model or implementation: DeepSeek-R1 (RL-tuned)

Novel Architectural Elements

Integration of Group Relative Policy Optimization (GRPO) specifically to reinforce self-reflective reasoning patterns
Hybrid architecture combining sparse MoE activation with token-heavy CoT generation for dense reasoning tasks

Modeling

Base Model: DeepSeek-V3 (MoE architecture)

Training Method: Reinforcement Learning with Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Enhance reasoning capabilities.

Formally: Reward signals based on accuracy and logical coherence in CoT generation.
Purpose: Align with human values.

Formally: Penalties for formatting errors and linguistic incoherence.

Adaptation: Supervised Fine-Tuning (SFT) followed by RLHF

Trainable Parameters: Not reported in the paper

Training Data:

Pretraining on massive unlabeled multilingual data
SFT on curated CoT reasoning datasets
RLHF on human feedback data

Compute: Inference cost reportedly 15x lower than OpenAI o1; exact GPU hours not reported

Comparison to Prior Work

vs. GPT-4o: DeepSeek-R1 is open-source and significantly cheaper but less capable in general NLU and multimodal tasks
vs. GPT-o1: DeepSeek-R1 matches accuracy in specific domains (ophthalmology) at ~1/15th the cost
vs. Claude-3 Opus: DeepSeek-R1 is more prone to bias (3x higher) and toxic content
+ 1 more
vs. Mixtral: DeepSeek-R1 uses a more advanced reasoning-focused RL pipeline (GRPO) beyond standard MoE

Limitations

Vulnerability to adversarial attacks and prompt injection due to open weights
Significantly higher rates of bias (3x vs Claude-3) and misinformation (4x vs GPT-4o)
Lower performance in general natural language fluency and creative writing compared to GPT-4o
Computational overhead from long Chain-of-Thought sequences increases latency

Reproducibility

DeepSeek-R1 is open-source under the MIT license with model weights available. However, this specific survey paper does not provide a separate codebase for its evaluation metrics. The model allows local deployment.

📊 Experiments & Results

Evaluation Setup

Benchmarking across mathematical, coding, and medical domains using standard datasets

Benchmarks:

AIME 2024 (High school competition mathematics)
Codeforces (Competitive programming/coding)
GPQA Diamond (General-purpose scientific QA)
MedQA (Pediatrics) (Medical question answering)
Ophthalmology Cases (Clinical diagnosis from text descriptions)

Metrics:

Accuracy (%)
Percentile Rank
Inference Cost
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME 2024	Accuracy	Not reported in the paper	86.7	Not reported in the paper
Codeforces	Percentile	Not reported in the paper	96.3	Not reported in the paper
GPQA Diamond	Accuracy	Not reported in the paper	71.5	Not reported in the paper
MedQA (Pediatrics)	Accuracy	92.8	87.0	-5.8
Ophthalmology Cases	Accuracy	82.0	82.0	0.0

Main Takeaways

DeepSeek-R1 rivals top proprietary models (GPT-o1) in structured reasoning tasks like math, code, and specific medical diagnostics (ophthalmology).
The model is significantly more cost-effective (15x cheaper inference than o1), making it viable for resource-constrained settings.
Safety is a major concern: the model is 3x more prone to bias than Claude-3 Opus and 4x more likely to propagate misinformation than GPT-4o.
Performance degrades in tasks requiring high nuance, creativity, or general natural language fluency compared to GPT-4o.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Transformer architectures
Understanding of Reinforcement Learning from Human Feedback (RLHF)
Knowledge of clinical evaluation benchmarks (USMLE, MedQA)

Key Terms

MoE: Mixture of Experts—a neural network architecture that activates only a subset of specialized sub-networks (experts) per input to save compute

CoT: Chain of Thought—a prompting or training method where the model generates intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used to guide the model's reasoning improvements by evaluating groups of outputs

USMLE: United States Medical Licensing Examination—a standardized three-step examination for medical licensure in the U.S.

RLHF: Reinforcement Learning from Human Feedback—a technique to align model behavior with human preferences using reward signals

SFT: Supervised Fine-Tuning—training a model on labeled datasets to learn specific task behaviors before RL

Prompt Injection: A security attack where malicious inputs manipulate the model into ignoring its safety constraints

Hallucination: When an AI generates plausible-sounding but factually incorrect or fabricated information

MLA: Multihead Latent Attention—an attention mechanism designed to handle long-range dependencies efficiently

Self-reflection: The model's ability to critique and revise its own reasoning steps during the generation process