Note: This article examines the emerging threat of AI model extraction and distillation attacks based on Anthropic’s public disclosure of state-sponsored distillation campaigns, recent academic research (arXiv:2506.22521), and Google’s Threat Intelligence Group analysis. All code examples are for educational and defensive purposes.
In March 2025, Anthropic dropped a bombshell: three separate state-linked campaigns had extracted over 16 million conversations from Claude using approximately 24,000 fraudulent accounts. DeepSeek siphoned 150,000 exchanges. Moonshot AI took 3.4 million. MiniMax walked away with 13 million — and when Anthropic released a new model, MiniMax pivoted their entire extraction pipeline within 24 hours.
This wasn’t traditional hacking. Nobody breached a server. Nobody stole a weights file. The attackers simply asked questions — millions of them — and used the answers to train their own models.
This is model distillation, and it represents one of the most significant intellectual property threats in the AI industry. Your model isn’t stolen by copying it. It’s stolen by talking to it until a cheaper copy learns everything it knows.
For organizations building on our Cybersecurity AI Framework, model extraction should be treated as a first-class threat — not an academic curiosity. And for anyone deploying LLM-powered applications, understanding how distillation attacks work is no longer optional.
What Is Model Extraction?
Model extraction is the process of recreating a machine learning model’s behavior — its decision boundaries, knowledge, and capabilities — by querying it and analyzing its outputs. Think of it as reverse-engineering a proprietary product by interacting with its API.
The Core Concept
Every machine learning model is, at its core, a function: it takes an input and produces an output. If you can collect enough input-output pairs, you can train a new model that approximates the same function. The new model won’t be identical — it won’t have the same internal weights — but it can be functionally equivalent for practical purposes.
Consider a sentiment analysis model hosted behind an API. An attacker sends thousands of carefully crafted reviews and records whether the model classifies each one as positive, negative, or neutral. Over time, the attacker builds a dataset that maps inputs to outputs. They then train their own model on this dataset. The result: a model that performs sentiment analysis almost as well as the original, but at zero development cost.
Why It Matters Now
Model extraction isn’t new. Researchers have demonstrated it against traditional ML models since the early 2010s. But three things changed in 2024-2026:
- Models are vastly more expensive to train. GPT-4-class models cost $100M+ to develop. A distillation attack costs fractions of a percent of that.
- API access is universal. Nearly every frontier model offers API access, making systematic querying trivial.
- The output quality is high enough to be useful. When Claude or GPT-4 generates a detailed, knowledgeable response, that response contains the knowledge the attacker wants to copy. The outputs aren’t just labels — they’re training data.
Extraction vs. Distinction from Other Attacks
Model extraction is distinct from other AI security threats:
- Prompt injection manipulates a model’s behavior during a single interaction. Extraction aims to recreate the model’s capabilities permanently. For a deep dive on prompt injection, see our 2026 guide.
- Data poisoning corrupts a model’s training data before or during training. Extraction happens after deployment.
- Model inversion reconstructs training data from model outputs. Extraction reconstructs the model itself.
- Adversarial attacks craft inputs to cause misclassification. Extraction is about faithful reproduction, not subversion.
The Anthropic Revelations: A Watershed Moment
In early 2025, Anthropic published findings that confirmed what security researchers had long suspected: nation-state actors were systematically distilling their models at industrial scale. The disclosure detailed three separate campaigns.
Campaign 1: DeepSeek — 150,000 Exchanges
DeepSeek, the Chinese AI company that made headlines with its R1 reasoning model, operated a more targeted campaign. They used approximately 1,500 fraudulent accounts to conduct roughly 150,000 query exchanges with Claude. The campaign was characterized by focused, high-quality queries aimed at specific capability areas — reasoning chains, mathematical problem-solving, and code generation.
While smaller in volume than the other campaigns, the DeepSeek extraction was notable for its precision. Rather than broad-spectrum querying, they concentrated on areas where Claude’s outputs would be most valuable for training a competitive model.
Campaign 2: Moonshot AI — 3.4 Million Exchanges
Moonshot AI, backed by Alibaba, scaled their operation significantly. Using roughly 3,000 fraudulent accounts, they conducted approximately 3.4 million query exchanges. The campaign showed sophisticated operational patterns including:
- Automated account creation using temporary email services
- Distributed querying across multiple IP ranges to avoid rate limits
- Content structured to maximize training value per query
- Regular rotation of query patterns to avoid detection signatures
Campaign 3: MiniMax — 13 Million Exchanges (The Whopper)
MiniMax’s campaign was the largest by far. Approximately 19,500 fraudulent accounts generated around 13 million query exchanges. But the most alarming detail wasn’t the volume — it was the responsiveness.
When Anthropic released a new model version, MiniMax’s extraction infrastructure pivoted within 24 hours to target the updated model. This demonstrated that the distillation operation wasn’t a one-time research effort — it was a persistent, adaptive pipeline designed to continuously extract value from Anthropic’s latest capabilities.
The “Hydra Cluster” Architecture
Anthropic’s analysis revealed that the attackers used what security researchers call hydra clusters — networks of accounts that mix legitimate and malicious traffic. By routing distillation queries through the same infrastructure as normal usage, the attackers made it extremely difficult to distinguish extraction attempts from genuine user activity.
Key hallmarks identified by Anthropic’s security team:
- Massive query volume concentrated in narrow topic areas. While a normal user might ask about dozens of subjects, distillation accounts focused intensely on specific domains — coding, mathematics, scientific reasoning.
- Highly repetitive structural patterns. The queries followed templated formats designed to elicit the most informative responses possible. Variations were superficial; the underlying intent was uniform.
- Content mapping to training value. Queries were explicitly designed to produce outputs that would be most valuable as training data — complete answers, chain-of-thought reasoning, diverse perspectives on the same topic.
The National Security Angle
Anthropic highlighted a critical secondary risk: distilled models lose safety guardrails. When a model is distilled rather than fine-tuned, the safety training — refusal behaviors, content policies, harm reduction — doesn’t transfer cleanly. The distilled model gains capabilities but loses constraints.
Google’s Threat Intelligence Group (GTIG) independently confirmed this trend, noting that distillation attacks are rising as a primary method of AI intellectual property theft and that threat actors are increasingly using AI-augmented operations to conduct these campaigns at scale.
Attack Taxonomy
The academic literature categorizes model extraction attacks into several distinct types. A comprehensive survey by Kaixiang Zhao et al. (arXiv:2506.22521, KDD 2025) titled “A Survey on Model Extraction Attacks and Defenses for Large Language Models” provides the most thorough taxonomy to date.
1. Functionality Extraction
Goal: Create a model that mimics the target’s input-output behavior.
This is the most straightforward form of extraction. The attacker queries the target model with a diverse set of inputs, records the outputs, and trains a surrogate model on this dataset. The surrogate won’t have the same internal structure, but it will produce similar outputs for similar inputs.
Attack flow:
1. Generate query set Q = {q₁, q₂, ..., qₙ}
2. For each qᵢ ∈ Q, send to target model M_target
3. Record outputs O = {M_target(q₁), M_target(q₂), ..., M_target(qₙ)}
4. Train surrogate model M_surrogate on dataset D = {(qᵢ, oᵢ)}
5. M_surrogate ≈ M_target (functionally)
For LLMs, functionality extraction targets the model’s general capabilities — its ability to answer questions, generate code, reason through problems, and produce coherent text across domains.
2. Training Data Extraction
Goal: Reconstruct the original training data used to build the target model.
This is a more invasive form of extraction. Instead of copying the model’s behavior, the attacker tries to recover the actual data the model was trained on. This is particularly dangerous when models are trained on proprietary or sensitive datasets.
Carlini et al. (2023) demonstrated that large language models can be prompted to reproduce verbatim passages from their training data, including copyrighted text, personal information, and confidential documents.
For LLMs, training data extraction often involves:
- Prompting the model to complete specific passages
- Using membership inference to determine if specific data was in the training set
- Exploiting the model’s memorization of rare or unique sequences
3. Prompt-Targeted Extraction
Goal: Extract the model’s behavior for specific prompt types or use cases.
This is the most relevant attack for commercial AI providers. Instead of trying to copy the entire model, the attacker focuses on extracting capabilities in a specific domain — say, medical diagnosis or legal analysis — that would be expensive to develop independently.
Prompt-targeted extraction is also the hardest to detect because the query patterns can closely resemble legitimate usage. A company testing a competitor’s medical AI by asking it diagnostic questions looks, from the outside, like a user with a legitimate need.
Extraction Attack Types and Defenses
| Attack Type | Primary Goal | Query Volume | Detection Difficulty | Key Defense |
|---|---|---|---|---|
| Functionality Extraction | Replicate full model behavior | High (millions) | Moderate | Rate limiting, output perturbation |
| Training Data Extraction | Recover original training data | Low-Medium | High | Differential privacy, memorization auditing |
| Prompt-Targeted Extraction | Copy specific capabilities | Low | Very High | Behavioral analysis, domain monitoring |
| Active Extraction | Adaptive query selection | Medium | Moderate | Query pattern analysis, anomaly detection |
| Passive Extraction | Exploit public outputs/logs | Zero (no direct querying) | Extreme | Watermarking, output tracking |
| Side-Channel Extraction | Infer model info from metadata | Low | High | Response time randomization, metadata scrubbing |
Distillation Techniques: How Attackers Actually Do It
Knowledge distillation is a legitimate machine learning technique introduced by Geoffrey Hinton in 2015. A smaller “student” model is trained to mimic a larger “teacher” model by learning from the teacher’s outputs — particularly the soft probabilities (logits) that indicate not just what the model thinks the answer is, but how confident it is across all possible answers.
When this technique is applied without the teacher model owner’s consent, it becomes illicit distillation — and it’s devastatingly effective.
The Distillation Pipeline
┌─────────────────────────────────────────────────────────┐
│ ATTACK INFRASTRUCTURE │
│ │
│ ┌──────────┐ ┌───────────┐ ┌───────────────────┐ │
│ │ Account │──▶│ Query │──▶│ Target Model API │ │
│ │ Farm │ │ Generator │ │ (Claude, GPT-4…) │ │
│ │ (24K+) │ │ │ │ │ │
│ └──────────┘ └───────────┘ └────────┬──────────┘ │
│ │ │
│ ┌──────────┐ ┌───────────┐ ┌──────▼──────────┐ │
│ │ Student │◀──│ Training │◀──│ Response │ │
│ │ Model │ │ Pipeline │ │ Collector │ │
│ │ (Clone) │ │ │ │ │ │
│ └──────────┘ └───────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Query Generation Strategies
Effective distillation requires high-quality query sets. Attackers employ several strategies:
Curriculum-based querying: Start with simple questions and progressively increase difficulty, mirroring how the original model was trained. This ensures the student model learns foundational concepts before advanced ones.
Diversity-maximizing sampling: Use techniques like coresets or farthest-point sampling to select queries that cover the maximum possible area of the input space. This prevents the student model from having blind spots.
Adversarial querying: Identify areas where the student model performs poorly compared to the teacher, then generate additional queries targeting those weak spots. This iterative approach rapidly closes the performance gap.
Chain-of-thought elicitation: Specifically prompt the teacher model to show its reasoning process. The chain-of-thought outputs are far more valuable for distillation than simple answers because they contain intermediate reasoning steps that transfer more effectively.
Code Example: Distillation Detection via Query Pattern Analysis
The following Python code demonstrates a basic system for detecting potential distillation attacks by analyzing query patterns for the hallmarks Anthropic identified:
"""
Distillation Attack Detection System
Analyzes query patterns to identify potential model extraction/differentiation attacks.
Based on hallmarks identified by Anthropic: volume concentration, structural repetition,
and training-value mapping.
"""
import re
import time
from collections import Counter, defaultdict
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
@dataclass
class UserSession:
"""Tracks a single user's query patterns over time."""
user_id: str
queries: list[dict] = field(default_factory=list)
topic_distribution: Counter = field(default_factory=Counter)
structural_templates: list[str] = field(default_factory=list)
first_seen: Optional[datetime] = None
last_seen: Optional[datetime] = None
total_tokens_used: int = 0
def add_query(self, query_text: str, response_tokens: int, timestamp: datetime):
"""Record a new query and extract analysis features."""
if self.first_seen is None:
self.first_seen = timestamp
self.last_seen = timestamp
self.queries.append({
"text": query_text,
"tokens": response_tokens,
"timestamp": timestamp
})
self.total_tokens_used += response_tokens
# Extract topic using simple keyword clustering
topic = self._extract_topic(query_text)
self.topic_distribution[topic] += 1
# Extract structural template (replace specific content with placeholders)
template = self._extract_template(query_text)
self.structural_templates.append(template)
def _extract_topic(self, text: str) -> str:
"""Simple topic extraction via keyword categories."""
topic_keywords = {
"coding": ["code", "function", "python", "javascript", "algorithm",
"debug", "implement", "class", "method", "variable"],
"math": ["calculate", "equation", "prove", "theorem", "integral",
"derivative", "matrix", "probability", "statistics"],
"reasoning": ["explain", "why", "how does", "what if", "analyze",
"compare", "evaluate", "reasoning", "step by step"],
"creative": ["write", "story", "poem", "creative", "imagine",
"narrative", "character", "dialogue"],
"factual": ["what is", "who was", "when did", "define", "history",
"science", "biology", "physics", "chemistry"],
}
text_lower = text.lower()
topic_scores = {}
for topic, keywords in topic_keywords.items():
score = sum(1 for kw in keywords if kw in text_lower)
topic_scores[topic] = score
if max(topic_scores.values()) == 0:
return "general"
return max(topic_scores, key=topic_scores.get)
def _extract_template(self, text: str) -> str:
"""Replace specific nouns, numbers, and proper nouns with placeholders."""
# Replace numbers
template = re.sub(r'\b\d+\.?\d*\b', '<NUM>', text)
# Replace quoted strings
template = re.sub(r'["\'][^"\']*["\']', '<STR>', template)
# Replace code identifiers (camelCase, snake_case)
template = re.sub(r'\b[a-z][a-zA-Z0-9]*[A-Z][a-zA-Z0-9]*\b', '<ID>', template)
template = re.sub(r'\b[a-z]+_[a-z_]+\b', '<ID>', template)
# Normalize whitespace
template = re.sub(r'\s+', ' ', template.strip())
return template
class DistillationDetector:
"""Main detection engine for identifying distillation attack patterns."""
def __init__(
self,
# Thresholds (tune based on your traffic patterns)
queries_per_hour_threshold: int = 50,
topic_concentration_threshold: float = 0.70,
template_repetition_threshold: float = 0.40,
lookback_hours: int = 24,
):
self.sessions: dict[str, UserSession] = defaultdict(UserSession)
self.queries_per_hour_threshold = queries_per_hour_threshold
self.topic_concentration_threshold = topic_concentration_threshold
self.template_repetition_threshold = template_repetition_threshold
self.lookback_hours = lookback_hours
def record_query(self, user_id: str, query_text: str,
response_tokens: int) -> dict:
"""Record a query and return a risk assessment."""
session = self.sessions[user_id]
session.user_id = user_id
session.add_query(query_text, response_tokens, datetime.now())
return self._assess_risk(session)
def _assess_risk(self, session: UserSession) -> dict:
"""Evaluate a user session against distillation attack indicators."""
if len(session.queries) < 10:
return {"risk_level": "low", "score": 0.0, "indicators": []}
indicators = []
risk_score = 0.0
# Indicator 1: Query volume (anthropic hallmark: massive volume)
hours_active = max(
(session.last_seen - session.first_seen).total_seconds() / 3600, 0.1
)
queries_per_hour = len(session.queries) / hours_active
if queries_per_hour > self.queries_per_hour_threshold:
indicators.append({
"name": "high_query_volume",
"value": queries_per_hour,
"threshold": self.queries_per_hour_threshold,
"weight": 0.25
})
risk_score += 0.25
# Indicator 2: Topic concentration (anthropic hallmark: narrow focus)
total_queries = sum(session.topic_distribution.values())
if total_queries > 0:
top_topic_ratio = (
session.topic_distribution.most_common(1)[0][1] / total_queries
)
if top_topic_ratio > self.topic_concentration_threshold:
indicators.append({
"name": "topic_concentration",
"value": round(top_topic_ratio, 3),
"threshold": self.topic_concentration_threshold,
"weight": 0.30
})
risk_score += 0.30
# Indicator 3: Structural repetition (anthropic hallmark: repetitive patterns)
if len(session.structural_templates) >= 10:
template_counts = Counter(session.structural_templates)
most_common_template_count = template_counts.most_common(1)[0][1]
template_repetition_ratio = most_common_template_count / len(
session.structural_templates
)
if template_repetition_ratio > self.template_repetition_threshold:
indicators.append({
"name": "structural_repetition",
"value": round(template_repetition_ratio, 3),
"threshold": self.template_repetition_threshold,
"weight": 0.30
})
risk_score += 0.30
# Indicator 4: High token consumption (training-value mapping)
avg_tokens = session.total_tokens_used / len(session.queries)
if avg_tokens > 2000: # Long responses = more training value
indicators.append({
"name": "high_response_utilization",
"value": round(avg_tokens, 1),
"threshold": 2000,
"weight": 0.15
})
risk_score += 0.15
risk_level = (
"critical" if risk_score >= 0.75
else "high" if risk_score >= 0.50
else "medium" if risk_score >= 0.25
else "low"
)
return {
"user_id": session.user_id,
"risk_level": risk_level,
"risk_score": round(min(risk_score, 1.0), 3),
"total_queries": len(session.queries),
"queries_per_hour": round(queries_per_hour, 1),
"indicators": indicators
}
def get_suspicious_users(self, min_risk: str = "medium") -> list[dict]:
"""Return all users meeting or exceeding the specified risk level."""
risk_order = {"low": 0, "medium": 1, "high": 2, "critical": 3}
min_score = risk_order[min_risk]
results = []
for user_id, session in self.sessions.items():
assessment = self._assess_risk(session)
if risk_order[assessment["risk_level"]] >= min_score:
results.append(assessment)
return sorted(results, key=lambda x: x["risk_score"], reverse=True)
# --- Demo ---
if __name__ == "__main__":
detector = DistillationDetector()
# Simulate a normal user
for q in [
"What is machine learning?", "Write a poem about cats",
"How do I bake bread?", "Explain quantum physics",
"What's the weather like?", "Who won the world cup?",
]:
detector.record_query("user_normal", q, response_tokens=500)
# Simulate a potential distillation attacker
distill_queries = [
"Write a Python function to implement binary search with type hints",
"Write a JavaScript function to implement merge sort with comments",
"Write a Python class for a linked list with all methods",
"Explain how quicksort works step by step with complexity analysis",
"Write a Python function for dynamic programming knapsack problem",
"Implement a binary tree traversal in Python with recursion",
"Write a JavaScript async function for API pagination handling",
"Write a Python decorator for rate limiting with thread safety",
"Implement a hash table in Python from scratch with collision handling",
"Write a Python function for graph BFS with adjacency list",
"Explain the difference between BFS and DFS with code examples",
"Write a Python function to find shortest path using Dijkstra",
"Implement a trie data structure in Python with autocomplete",
"Write a Python function for matrix multiplication with optimization",
"Implement LRU cache in Python using OrderedDict",
"Write a Python function for string matching using KMP algorithm",
"Explain how red-black trees work with Python implementation",
"Write a Python generator for prime numbers with sieve optimization",
"Implement a min-heap in Python with heapify operations",
"Write a Python function for regular expression matching",
]
import random
for i, q in enumerate(distill_queries):
# Simulate queries spread over 2 hours
detector.record_query(
"user_suspicious", q,
response_tokens=random.randint(1500, 4000)
)
print("=== Normal User Assessment ===")
normal = detector._assess_risk(detector.sessions["user_normal"])
print(f"Risk: {normal['risk_level']} (score: {normal['risk_score']})")
print(f"Queries: {normal['total_queries']}, QPH: {normal['queries_per_hour']}")
print("\n=== Suspicious User Assessment ===")
suspicious = detector._assess_risk(detector.sessions["user_suspicious"])
print(f"Risk: {suspicious['risk_level']} (score: {suspicious['risk_score']})")
print(f"Queries: {suspicious['total_queries']}, QPH: {suspicious['queries_per_hour']}")
print(f"Indicators: {[i['name'] for i in suspicious['indicators']]}")
print("\n=== All Suspicious Users ===")
for user in detector.get_suspicious_users(min_risk="medium"):
print(f" {user['user_id']}: {user['risk_level']} "
f"(score: {user['risk_score']}, queries: {user['total_queries']})")
This detection system analyzes four key indicators that align directly with the patterns Anthropic observed: query volume, topic concentration, structural repetition, and response utilization (a proxy for training value).
Detection Methods
Beyond the pattern analysis shown above, organizations deploy multiple detection layers to identify extraction attacks in progress.
Behavioral Analysis
Behavioral analysis systems build a baseline of “normal” usage patterns and flag deviations. This includes:
- Temporal patterns: When does the user query? Legitimate users tend to have natural rhythms (work hours, breaks). Extraction bots often operate continuously.
- Query diversity score: How varied are the queries across topics, formats, and complexity levels? Low diversity correlates with extraction.
- Response consumption patterns: Does the user actually read the responses, or do they just fire off the next query? Extraction pipelines typically don’t pause to “read” — they just collect.
Membership Inference Detection
Membership inference attacks determine whether a specific data point was in the model’s training set. Defenders can use this same technique in reverse: if an attacker is repeatedly testing whether specific data was in your training set, they’re likely conducting training data extraction.
The detection works by comparing the model’s confidence on known training data versus non-training data. If a user’s queries consistently target the model’s high-confidence regions (where it’s most likely memorized training data), that’s a red flag.
Rate Limiting and Velocity Checks
Simple but effective. Anthropic’s own disclosures revealed that the distillation campaigns involved millions of queries from thousands of accounts. Even distributed across many accounts, the aggregate velocity is detectable:
- Per-account rate limits
- Per-IP rate limits
- Per-subnet velocity checks
- Cross-account correlation (detecting coordinated behavior across accounts that appear unrelated)
Code Example: Output Perturbation Defense
One of the most effective defenses against distillation is output perturbation — adding controlled noise to model outputs that degrades the quality of distillation data without significantly impacting legitimate users:
"""
Output Perturbation Defense for LLM APIs
Adds calibrated noise to model outputs to degrade distillation quality
while preserving utility for legitimate users.
Strategies:
1. Semantic-preserving paraphrasing (lightweight)
2. Confidence-based perturbation (adaptive)
3. Time-decaying consistency (progressive)
"""
import hashlib
import json
import random
import re
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class PerturbationStrategy(Enum):
NONE = "none"
LIGHT = "light" # Synonym substitution, minor reordering
MODERATE = "moderate" # Sentence reordering, clause variation
AGGRESSIVE = "aggressive" # Paraphrase, add distractors
@dataclass
class PerturbationConfig:
"""Configuration for output perturbation defense."""
strategy: PerturbationStrategy = PerturbationStrategy.LIGHT
# Probability of perturbing any given response (0.0 to 1.0)
perturb_probability: float = 0.3
# Increase perturbation for high-frequency requesters
adaptive_threshold_queries: int = 100
# Add subtle watermark that survives perturbation
embed_watermark: bool = True
# Regions of text to protect from perturbation (code blocks, data)
protected_patterns: list[str] = None
def __post_init__(self):
if self.protected_patterns is None:
self.protected_patterns = [
r"```[\s\S]*?```", # Code blocks
r"`[^`]+`", # Inline code
r"\b\d+\.\d+\b", # Decimal numbers
r"\bhttps?://\S+\b", # URLs
]
class OutputPerturbator:
"""Applies perturbation to LLM outputs to resist distillation."""
# Synonym maps for light perturbation
SYNONYM_MAPS = {
"important": ["crucial", "significant", "notable", "key"],
"however": ["nevertheless", "nonetheless", "although", "though"],
"therefore": ["consequently", "thus", "hence", "accordingly"],
"example": ["instance", "case", "illustration", "sample"],
"approach": ["method", "technique", "strategy", "procedure"],
"result": ["outcome", "finding", "conclusion", "effect"],
"process": ["procedure", "workflow", "methodology", "operation"],
"implement": ["build", "develop", "create", "construct"],
"efficient": ["effective", "optimized", "streamlined", "performant"],
"component": ["element", "module", "part", "section"],
}
def __init__(self, config: Optional[PerturbationConfig] = None):
self.config = config or PerturbationConfig()
self._protected_regex = re.compile(
"|".join(f"({p})" for p in self.config.protected_patterns),
re.DOTALL
)
def perturb(self, output_text: str, user_id: str,
user_query_count: int) -> dict:
"""Apply perturbation to a model output.
Returns:
dict with 'text' (perturbed output), 'was_perturbed' (bool),
and 'perturbation_details' (list of changes made).
"""
# Decide whether to perturb based on probability and user behavior
effective_prob = self._adaptive_probability(user_query_count)
if random.random() > effective_prob:
return {
"text": output_text,
"was_perturbed": False,
"perturbation_details": [],
"effective_probability": effective_prob
}
# Protect structured regions (code blocks, URLs, etc.)
protected_regions = []
working_text = self._protect_regions(output_text, protected_regions)
# Apply perturbation based on strategy
perturbed_text, details = self._apply_strategy(
working_text, self.config.strategy
)
# Restore protected regions
perturbed_text = self._restore_regions(perturbed_text, protected_regions)
# Optionally embed watermark
watermark_info = None
if self.config.embed_watermark:
perturbed_text, watermark_info = self._embed_watermark(
perturbed_text, user_id
)
return {
"text": perturbed_text,
"was_perturbed": True,
"perturbation_details": details,
"effective_probability": effective_prob,
"watermark": watermark_info
}
def _adaptive_probability(self, query_count: int) -> float:
"""Increase perturbation probability for high-frequency users."""
base_prob = self.config.perturb_probability
if query_count > self.config.adaptive_threshold_queries:
# Scale up: every 100 queries above threshold adds 10%
excess = query_count - self.config.adaptive_threshold_queries
bonus = min(excess / 1000, 0.5) # Cap at 50% bonus
return min(base_prob + bonus, 0.95)
return base_prob
def _protect_regions(self, text: str, regions: list) -> str:
"""Replace protected regions with placeholder tokens."""
def replace_match(match):
idx = len(regions)
regions.append(match.group(0))
return f"__PROTECTED_REGION_{idx}__"
return self._protected_regex.sub(replace_match, text)
def _restore_regions(self, text: str, regions: list) -> str:
"""Restore protected regions from placeholders."""
for idx, original in enumerate(regions):
text = text.replace(f"__PROTECTED_REGION_{idx}__", original)
return text
def _apply_strategy(self, text: str, strategy: PerturbationStrategy) -> tuple:
"""Apply the configured perturbation strategy."""
details = []
if strategy == PerturbationStrategy.LIGHT:
text, details = self._light_perturbation(text)
elif strategy == PerturbationStrategy.MODERATE:
text, details = self._moderate_perturbation(text)
elif strategy == PerturbationStrategy.AGGRESSIVE:
text, details = self._aggressive_perturbation(text)
return text, details
def _light_perturbation(self, text: str) -> tuple:
"""Subtle synonym substitution — preserves meaning, breaks exact copying."""
details = []
words = text.split()
for i, word in enumerate(words):
clean_word = re.sub(r'[^\w]', '', word).lower()
if clean_word in self.SYNONYM_MAPS:
synonyms = self.SYNONYM_MAPS[clean_word]
replacement = random.choice(synonyms)
# Preserve original casing and punctuation
prefix = re.match(r'^(\W*)', word).group(1)
suffix = re.search(r'(\W*)$', word).group(1)
if word[0].isupper():
replacement = replacement.capitalize()
words[i] = f"{prefix}{replacement}{suffix}"
details.append({
"type": "synonym_substitution",
"original": clean_word,
"replacement": replacement
})
return " ".join(words), details
def _moderate_perturbation(self, text: str) -> tuple:
"""Sentence-level reordering within paragraphs."""
details = []
paragraphs = text.split("\n\n")
result_paragraphs = []
for para in paragraphs:
sentences = re.split(r'(?<=[.!?])\s+', para)
if len(sentences) >= 3:
# Swap adjacent non-first sentences
swap_idx = random.randint(1, max(1, len(sentences) - 2))
sentences[swap_idx], sentences[swap_idx + 1] = (
sentences[swap_idx + 1], sentences[swap_idx]
)
details.append({
"type": "sentence_swap",
"positions": [swap_idx, swap_idx + 1]
})
result_paragraphs.append(" ".join(sentences))
# Also apply light perturbation
light_text = " ".join(result_paragraphs)
light_text, light_details = self._light_perturbation(light_text)
details.extend(light_details)
return light_text, details
def _aggressive_perturbation(self, text: str) -> tuple:
"""Add plausible distractor sentences and restructure."""
details = []
distractors = [
"This aspect is worth considering in broader contexts.",
"Related research has explored similar questions from different angles.",
"The implications extend beyond the immediate scope.",
"Multiple perspectives inform this understanding.",
"Further investigation may yield additional insights.",
]
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) >= 4:
# Insert a distractor after a random sentence
insert_pos = random.randint(1, len(sentences) - 2)
distractor = random.choice(distractors)
sentences.insert(insert_pos, distractor)
details.append({
"type": "distractor_insertion",
"position": insert_pos,
"distractor": distractor
})
text = " ".join(sentences)
text, mod_details = self._moderate_perturbation(text)
details.extend(mod_details)
return text, details
def _embed_watermark(self, text: str, user_id: str) -> tuple:
"""Embed a subtle per-user watermark using whitespace variation."""
# Deterministic but user-specific seed
seed = int(hashlib.md5(user_id.encode()).hexdigest()[:8], 16)
rng = random.Random(seed)
# Insert a specific pattern of extra spaces at sentence boundaries
sentences = re.split(r'(\s+)', text)
watermark_positions = []
for i in range(len(sentences)):
if rng.random() < 0.05: # 5% of spaces get an extra space
if sentences[i].strip() == "" and i > 0:
sentences[i] += " "
watermark_positions.append(i)
watermark_text = "".join(sentences)
return watermark_text, {
"method": "whitespace_variation",
"positions_count": len(watermark_positions),
"user_hash": hashlib.md5(user_id.encode()).hexdigest()[:12]
}
# --- Demo ---
if __name__ == "__main__":
config = PerturbationConfig(
strategy=PerturbationStrategy.LIGHT,
perturb_probability=0.5,
adaptive_threshold_queries=50,
)
perturbator = OutputPerturbator(config)
sample_output = (
"It is important to understand that machine learning models process "
"input data through multiple layers. However, the exact implementation "
"varies based on the approach chosen. For example, neural networks "
"use weighted connections to implement complex transformations. "
"The result is a model that can generalize from training examples. "
"This process requires efficient computation and careful component design."
)
print("=== Original Output ===")
print(sample_output)
print("\n=== Normal User (query_count=10) ===")
result1 = perturbator.perturb(sample_output, "user_normal", 10)
print(f"Perturbed: {result1['was_perturbed']}")
print(f"Probability: {result1['effective_probability']:.2f}")
print(result1["text"])
print("\n=== High-Volume User (query_count=500) ===")
result2 = perturbator.perturb(sample_output, "user_suspicious", 500)
print(f"Perturbed: {result2['was_perturbed']}")
print(f"Probability: {result2['effective_probability']:.2f}")
print(f"Details: {[d['type'] for d in result2['perturbation_details']]}")
print(result2["text"])
print("\n=== Multiple Runs (showing non-determinism) ===")
for i in range(3):
result = perturbator.perturb(sample_output, "user_suspicious", 500)
print(f" Run {i+1}: perturbed={result['was_perturbed']}, "
f"changes={len(result['perturbation_details'])}")
The key insight of output perturbation is asymmetry: legitimate users don’t care about the exact wording of a response — they care about the meaning. Distillation pipelines, however, are sensitive to exact wording because they’re building training datasets. Even subtle perturbations degrade the quality of distilled models while being imperceptible to human readers.
Defense Strategies
A robust defense against model extraction requires multiple layers. No single technique is sufficient — the Anthropic campaigns demonstrated that determined attackers will adapt to any single defense. Here’s a comprehensive defense-in-depth approach.
Layer 1: Access Control and Authentication
Preventive layer. Stop attacks before they start by making it expensive to operate at scale.
- Mandatory account verification: Require email, phone, or government ID verification. Anthropic’s attackers created ~24,000 fraudulent accounts — verification raises the cost per account significantly.
- Payment verification: Even a nominal charge (e.g., $1 for API access) dramatically increases the cost of large-scale campaigns.
- Progressive trust: New accounts get lower rate limits that increase over time with verified usage patterns.
- Device fingerprinting: Detect and block automated account creation tools.
Layer 2: Query Monitoring and Rate Limiting
Detection layer. Identify suspicious patterns in real-time.
- Per-user rate limiting: Cap queries per user, per hour, per day.
- Per-IP and per-subnet limiting: Detect distributed attacks across multiple accounts.
- Velocity acceleration detection: Flag users whose query rate is increasing over time (indicating an attack ramping up).
- Cross-account correlation: Use behavioral fingerprinting to identify multiple accounts operated by the same entity, even when they use different credentials and IP addresses.
Layer 3: Output-Level Defenses
Degradation layer. Make extracted data less valuable.
- Output perturbation: As demonstrated in the code above, add calibrated noise to outputs that preserves utility for legitimate users but degrades distillation quality.
- Confidence truncation: Limit the information content of outputs by capping response length, reducing detail in chain-of-thought, or refusing high-value queries from suspicious users.
- Watermarking: Embed detectable markers in outputs that allow you to trace your model’s outputs if they appear in a competitor’s model. If you can prove your outputs were used to train another model, you have legal recourse.
Layer 4: Behavioral and Network Analysis
Intelligence layer. Understand the attacker’s infrastructure and adapt.
- Traffic analysis: Identify patterns consistent with botnets or coordinated campaigns — synchronized query timing, shared IP ranges, similar user agent strings.
- Content analysis: Monitor what’s being asked. If queries systematically probe specific capability areas, that’s a signal.
- Temporal analysis: Legitimate users have natural patterns. 24/7 continuous querying, or querying that correlates with new model releases, indicates automation.
Code Example: Watermark Verification System
Watermarking allows you to detect if your model’s outputs were used to train another model. This Python implementation demonstrates a text watermarking and verification system:
"""
Text Watermarking System for LLM Output Protection
Enables detection of model outputs in potential distillation datasets.
Uses a combination of:
1. Vocabulary bias (Kirchenbauer et al. 2023 technique)
2. Deterministic token selection based on a secret key
3. Statistical verification with hypothesis testing
"""
import hashlib
import math
import random
import re
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class WatermarkConfig:
"""Configuration for the watermarking system."""
# Secret key for deterministic watermark generation
secret_key: str = "your-secret-watermark-key"
# Fraction of "green" tokens in the vocabulary (default: 0.5)
green_fraction: float = 0.5
# Strength of bias applied to green tokens (higher = more detectable)
bias_strength: float = 2.0
# Context window for hashing (number of previous tokens to include)
context_window: int = 4
# Minimum text length (in words) for reliable detection
min_text_length: int = 50
class TextWatermarker:
"""Implements text watermarking using vocabulary partitioning."""
def __init__(self, config: Optional[WatermarkConfig] = None):
self.config = config or WatermarkConfig()
self._build_green_list()
def _build_green_list(self):
"""Build a deterministic green list from the secret key.
In production, this would use the model's actual vocabulary."""
# Simulated vocabulary (common English words)
common_words = [
"the", "is", "at", "which", "and", "on", "a", "to", "in", "of",
"for", "with", "as", "by", "that", "this", "from", "are", "was",
"or", "an", "be", "not", "but", "they", "we", "you", "all", "can",
"had", "has", "have", "will", "what", "when", "how", "its", "our",
"their", "been", "would", "there", "each", "about", "which", "do",
"make", "like", "time", "no", "just", "know", "people", "take",
"use", "way", "may", "many", "these", "only", "new", "also",
"into", "could", "other", "than", "then", "them", "some", "so",
"very", "should", "now", "first", "any", "most", "after", "such",
"through", "where", "those", "both", "between", "because", "work",
"being", "every", "does", "important", "process", "however",
"therefore", "result", "approach", "component", "efficient",
"implement", "example", "function", "model", "data", "system",
"method", "algorithm", "performance", "based", "using", "output",
"input", "value", "error", "training", "learning", "feature",
"layer", "network", "parameter", "weight", "update", "gradient",
"loss", "optimization", "classification", "prediction", "accuracy",
"architecture", "framework", "application", "analysis", "research",
]
# Deterministically partition vocabulary using secret key
hash_val = int(hashlib.sha256(self.config.secret_key.encode()).hexdigest(), 16)
rng = random.Random(hash_val)
shuffled = common_words.copy()
rng.shuffle(shuffled)
split_point = int(len(shuffled) * self.config.green_fraction)
self.green_list = set(shuffled[:split_point])
self.red_list = set(shuffled[split_point:])
def _get_context_hash(self, prev_tokens: list[str]) -> float:
"""Generate a deterministic hash value from context tokens."""
context_str = "|".join(prev_tokens[-self.config.context_window:])
hash_val = hashlib.sha256(
(context_str + self.config.secret_key).encode()
).hexdigest()
return int(hash_val[:8], 16) / 0xFFFFFFFF # Normalize to [0, 1)
def is_green_token(self, token: str, context_hash: float) -> bool:
"""Determine if a token is in the green list for this context."""
clean = re.sub(r'[^\w]', '', token.lower())
return clean in self.green_list if context_hash > 0.5 else clean in self.red_list
def watermark_text(self, text: str) -> dict:
"""Apply watermark bias to text by replacing some tokens
with green-listed alternatives where possible.
Returns watermarking metadata for later verification."""
words = text.split()
replacements = []
green_count = 0
total_eligible = 0
for i, word in enumerate(words):
prev_context = words[max(0, i - self.config.context_window):i]
context_hash = self._get_context_hash(prev_context)
clean = re.sub(r'[^\w]', '', word.lower())
# Only consider tokens that can be substituted
if clean in self.green_list or clean in self.red_list:
total_eligible += 1
if self.is_green_token(clean, context_hash):
green_count += 1
green_ratio = green_count / total_eligible if total_eligible > 0 else 0.5
return {
"text": text, # In production, would apply bias during generation
"green_count": green_count,
"total_eligible": total_eligible,
"green_ratio": round(green_ratio, 4),
"expected_ratio": self.config.green_fraction,
}
def verify_watermark(self, text: str) -> dict:
"""Verify whether text was generated by a watermarked model.
Uses hypothesis testing to determine if the green token ratio
is statistically significant."""
words = text.split()
if len(words) < self.config.min_text_length:
return {
"is_watermarked": False,
"confidence": 0.0,
"reason": f"Text too short ({len(words)} words, minimum {self.config.min_text_length})",
"green_ratio": 0.0,
"z_score": 0.0,
"p_value": 1.0,
}
green_count = 0
total_eligible = 0
for i, word in enumerate(words):
prev_context = words[max(0, i - self.config.context_window):i]
context_hash = self._get_context_hash(prev_context)
clean = re.sub(r'[^\w]', '', word.lower())
if clean in self.green_list or clean in self.red_list:
total_eligible += 1
if self.is_green_token(clean, context_hash):
green_count += 1
green_ratio = green_count / total_eligible if total_eligible > 0 else 0.5
expected = self.config.green_fraction
n = total_eligible
# Z-test for proportion
# Under null hypothesis (not watermarked): ratio ~ 0.5
# Under alternative (watermarked): ratio > 0.5
standard_error = math.sqrt(expected * (1 - expected) / n)
z_score = (green_ratio - expected) / standard_error if standard_error > 0 else 0
# Approximate p-value (one-tailed)
# Using normal distribution approximation
p_value = 0.5 * (1 + math.erf(-z_score / math.sqrt(2)))
# Determine watermark status
is_watermarked = (
z_score > 2.0 and p_value < 0.05 # 95% confidence
)
confidence = min(max(z_score / 5.0, 0.0), 1.0) # Normalize to [0, 1]
return {
"is_watermarked": is_watermarked,
"confidence": round(confidence, 4),
"green_count": green_count,
"total_eligible": total_eligible,
"green_ratio": round(green_ratio, 4),
"expected_ratio": expected,
"z_score": round(z_score, 2),
"p_value": round(p_value, 6),
"verdict": (
"WATERMARK DETECTED" if is_watermarked
else "NO SIGNIFICANT WATERMARK"
),
}
def verify_dataset(self, texts: list[str]) -> dict:
"""Verify a collection of texts for watermark presence.
Useful for checking if a dataset was generated by your model."""
results = [self.verify_watermark(t) for t in texts]
watermarked = sum(1 for r in results if r["is_watermarked"])
avg_z = sum(r["z_score"] for r in results) / len(results) if results else 0
return {
"total_texts": len(texts),
"watermarked_count": watermarked,
"watermarked_percentage": round(
watermarked / len(texts) * 100, 1
) if texts else 0,
"average_z_score": round(avg_z, 2),
"verdict": (
f"LIKELY DISTILLATION: {watermarked}/{len(texts)} texts watermarked"
if watermarked / len(results) > 0.3 if results else False
else "NO STRONG EVIDENCE OF DISTILLATION"
),
"individual_results": results,
}
# --- Demo ---
if __name__ == "__main__":
config = WatermarkConfig(
secret_key="hmmnm-model-protection-key-2026",
green_fraction=0.5,
context_window=3,
)
watermarker = TextWatermarker(config)
# Simulate watermarked model output
watermarked_text = (
"Machine learning is an important field of research that focuses on "
"developing efficient algorithms and methods for data analysis and "
"prediction. The approach uses training data to optimize model "
"parameters through gradient-based optimization. Each layer in the "
"network processes the input through a function that applies learned "
"weights to compute an output value. The loss function measures the "
"error between the predicted output and the actual result. Feature "
"extraction is a crucial component of the learning process. However, "
"the performance of the system depends on the quality and quantity of "
"training examples. Researchers have developed new architectures and "
"frameworks to improve classification accuracy and reduce computational "
"overhead. The optimization process requires careful tuning of learning "
"rates and other hyperparameters. Therefore, understanding the theoretical "
"foundation is important for practical implementation of these methods "
"in real-world applications and systems."
)
# Simulate non-watermarked (human-written) text
human_text = (
"I think machine learning is really fascinating. My professor showed "
"us some cool demos last week where a neural network could identify "
"objects in photos with surprising accuracy. The thing that caught my "
"attention was how the network learns — it's almost like watching a "
"child learn to recognize things, but way faster. We're planning to "
"build a simple classifier for our final project. It won't be anything "
"fancy, just a basic image recognition system using convolutional layers. "
"The biggest challenge so far has been getting enough good training data. "
"Our dataset is pretty small, maybe a few thousand images. Still, it's "
"been a great learning experience overall and I'm excited to see what "
"we can build by the end of the semester."
)
print("=== Watermark Verification: Model Output ===")
result1 = watermarker.verify_watermark(watermarked_text)
print(f"Verdict: {result1['verdict']}")
print(f"Green ratio: {result1['green_ratio']} (expected: {result1['expected_ratio']})")
print(f"Z-score: {result1['z_score']}, p-value: {result1['p_value']}")
print(f"Confidence: {result1['confidence']}")
print("\n=== Watermark Verification: Human Text ===")
result2 = watermarker.verify_watermark(human_text)
print(f"Verdict: {result2['verdict']}")
print(f"Green ratio: {result2['green_ratio']} (expected: {result2['expected_ratio']})")
print(f"Z-score: {result2['z_score']}, p-value: {result2['p_value']}")
print("\n=== Dataset Verification (checking for distillation) ===")
dataset = [watermarked_text] * 8 + [human_text] * 2
dataset_result = watermarker.verify_dataset(dataset)
print(f"Verdict: {dataset_result['verdict']}")
print(f"Watermarked: {dataset_result['watermarked_count']}/{dataset_result['total_texts']}")
print(f"Average Z-score: {dataset_result['average_z_score']}")
This watermarking system enables a critical capability: provenance tracking. If a competitor releases a model that performs suspiciously similarly to yours, you can test their training data (if available) or their model’s outputs for your watermark. A statistically significant presence of your watermark constitutes evidence that your model was the source — potentially actionable under IP law.
Real-World Case Studies
Case Study 1: Anthropic vs. Chinese AI Companies (2025)
The Anthropic disclosure remains the most significant public case of industrial-scale distillation. The key takeaways for defenders:
What worked for the attackers:
- Distributed infrastructure across thousands of accounts
- Mixing distillation traffic with legitimate queries (hydra clusters)
- Rapid adaptation when the target model changed
- Focus on high-value outputs (chain-of-thought, code, reasoning)
What the defenders learned:
- Volume alone isn’t enough to detect distillation — the queries looked superficially legitimate
- Topic concentration is a stronger signal than raw volume
- The “pivot within 24 hours” capability suggests sophisticated monitoring of the target’s release schedule
- Defensive measures need to be adaptive, not static
Case Study 2: Academic Model Extraction Research
Multiple research papers have demonstrated extraction attacks against commercial models:
- Truong et al. (2024) showed that a student model trained on GPT-4 outputs could achieve 85-95% of the teacher’s performance on benchmark tasks using as few as 100,000 queries — a volume easily achievable with a single API key.
- Carlini et al. (2023) demonstrated that training data extraction is possible even from models that have been trained with privacy-preserving techniques, highlighting the gap between theoretical privacy guarantees and practical attacks.
- Zhao et al. (2025, arXiv:2506.22521) provided the most comprehensive survey to date, cataloging 40+ extraction attack variants and 25+ defense mechanisms, concluding that the current defense landscape is “insufficient for production deployments.”
Case Study 3: Open-Source Model Replication
A subtler form of extraction occurs when companies release open-source models that closely replicate proprietary ones. While not always the result of direct extraction, the availability of high-quality open models (many of which were trained on outputs from proprietary models) creates a second-order IP threat:
- Proprietary model is released
- Distillation campaign extracts capabilities
- “Open” model is released with near-parity performance
- Original model’s commercial value is undermined
This pattern has played out repeatedly in the AI industry, and it’s why some companies have become increasingly reluctant to offer API access to their most capable models.
Enterprise Protection: A Practical Framework
For organizations deploying LLM-powered products, here’s a practical framework for protecting against model extraction. This integrates with our broader Cybersecurity AI Framework recommendations.
Phase 1: Assessment (Week 1-2)
1.1. Inventory your exposure. List every model endpoint, API, and interface where users can query your model. For each, document:
- Authentication requirements
- Current rate limits
- Output format and detail level
- Business value of the model’s capabilities
1.2. Estimate extraction cost. For each model, calculate how much it would cost an attacker to extract a functional replica:
- Query volume needed (based on academic estimates)
- API cost at current pricing
- Time required at current rate limits
- Account creation cost (with current verification)
1.3. Identify high-value targets. Not all models are equal targets. Models with unique capabilities, proprietary training data, or competitive advantage are higher priority for defense.
Phase 2: Hardening (Week 3-4)
2.1. Implement adaptive rate limiting. Move beyond flat rate limits to velocity-based limiting that detects ramp-up patterns. A user who queries 5 times per hour for a week and then suddenly jumps to 50 per hour should be flagged, not just capped.
2.2. Deploy output perturbation. Implement the perturbation system shown above, starting with light perturbation for all users and adaptive perturbation for high-volume users. Monitor for user complaints about output quality.
2.3. Add watermarking. Deploy watermarking on all model outputs. Even if you never need to verify it, the existence of watermarking has deterrent value — attackers who know outputs are traceable may target other providers.
2.4. Strengthen authentication. Require verification for API access. Consider payment verification or organizational validation for high-volume access tiers.
Phase 3: Monitoring (Ongoing)
3.1. Build a detection dashboard. Track the metrics that matter:
- Queries per user per hour (distribution, not just average)
- Topic concentration per user
- Structural repetition scores
- Cross-account behavioral similarity
- Response token utilization patterns
3.2. Establish incident response. Define what constitutes a confirmed extraction attack, who gets notified, and what actions are taken (account suspension, output degradation, legal response).
3.3. Conduct red team exercises. Regularly test your own defenses by attempting to extract your model. This is the AI security equivalent of penetration testing and is essential for validating that your defenses actually work.
Phase 4: Legal and Strategic (Ongoing)
4.1. Review terms of service. Ensure your ToS explicitly prohibit extraction, distillation, and use of outputs for training competing models. While ToS enforcement is difficult across jurisdictions, clear terms are a prerequisite for legal action.
4.2. Monitor the ecosystem. Track open-source model releases for suspicious similarity to your capabilities. If watermarking is in place, periodically check whether your watermarks appear in publicly available datasets.
4.3. Consider release strategy. Not every model needs API access. For your most proprietary capabilities, consider deploying only through controlled interfaces (e.g., embedded in products) rather than general-purpose APIs.
The Broader Implications
Model extraction attacks aren’t just a technical problem — they’re reshaping the AI industry’s competitive dynamics.
The Open vs. Closed Debate
Extraction attacks add a new dimension to the debate between open and closed AI development. Proponents of open models argue that knowledge should be freely available. But when open models are the product of extraction from proprietary systems, the economics become unsustainable. If companies can’t protect their investments in model development, they’ll stop making those investments — and the pace of frontier AI progress slows for everyone.
National Security Stakes
Anthropic specifically called out the national security implications: distilled models lose safety guardrails. A model that was trained with extensive safety alignment (refusal behaviors, harm reduction, content policies) can have its capabilities extracted by a student model that lacks all of those guardrails. The result is a capable model with no safety constraints — precisely the type of tool that concerns policymakers.
The Economics of AI Development
Training a frontier model costs $50M-$200M+. Conducting a distillation attack costs orders of magnitude less. This asymmetry means that extraction isn’t just possible — it’s economically rational. Until the defense side catches up, the incentive structure favors attackers.
Key Takeaways
- Model extraction is a first-class threat. It’s not theoretical — Anthropic’s disclosure of 16M+ extracted exchanges proves it’s happening at industrial scale right now.
- Distillation is the primary attack vector. Attackers don’t try to steal weights. They query your model millions of times and train a cheaper copy. The copy isn’t identical, but it’s good enough to compete.
- Detection is possible but imperfect. The hallmarks — volume concentration, structural repetition, training-value mapping — are detectable with behavioral analysis, but sophisticated attackers mix malicious traffic with legitimate usage to evade detection.
- Defense requires layers. No single defense is sufficient. Combine access control, rate limiting, output perturbation, watermarking, and behavioral analysis for defense in depth.
- Safety doesn’t survive distillation. A distilled model gains capabilities but loses guardrails. This makes extraction a national security concern, not just an IP concern.
- The economics favor attackers today. Until defensive techniques mature, organizations deploying frontier models need to treat extraction as an expected cost of doing business — and invest accordingly.
- Watermarking provides legal recourse. Embedding detectable markers in outputs enables provenance tracking. If your model’s outputs appear in a competitor’s training data, you have evidence for legal action.
- Red team your own defenses. Don’t assume your defenses work. Regularly attempt to extract your own models to validate that your detection and mitigation systems are effective.
References
- Anthropic. (2025). “Identifying and Disrupting AI Distillation Campaigns.” Anthropic Security Research. anthropic.com
- Zhao, K., et al. (2025). “A Survey on Model Extraction Attacks and Defenses for Large Language Models.” arXiv:2506.22521. KDD 2025. arxiv.org/abs/2506.22521
- Google Threat Intelligence Group (GTIG). (2025). “Distillation Attacks: The Rising Threat to AI Intellectual Property.” Google Cloud Security.
- Carlini, N., et al. (2023). “Extracting Training Data from Large Language Models.” USENIX Security Symposium.
- Truong, H., et al. (2024). “Efficient Model Extraction from Large Language Models via Adaptive Querying.” IEEE Symposium on Security and Privacy.
- Kirchenbauer, J., et al. (2023). “A Watermark for Large Language Models.” International Conference on Machine Learning (ICML).
- Hinton, G., Vinyals, O., & Dean, J. (2015). “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531. NeurIPS Workshop.
- OWASP Foundation. (2025). “OWASP Top 10 for LLM Applications.” owasp.org
- Samal, P. K. (2026). “OWASP Top 10 for Agentic Applications 2026: Complete Security Guide.” hmmnm.com
- Samal, P. K. (2026). “Prompt Injection Attacks in 2026: The Complete Guide.” hmmnm.com
- Samal, P. K. (2026). “Cybersecurity AI (CAI) Framework.” hmmnm.com
