Intent Resolution & Intelligent Safety — an AI orchestration architecture that reasons about what you mean, not just what you say.
Inspired by Scale AI’s research on Defensive Refusal Bias (ICLR 2026). Designed by TalaStar as an original solution to the intent-safety alignment problem.
The Problem
Scale AI’s 2026 research (published at ICLR) analysed 2,390 real-world prompts from the National Collegiate Cyber Defense Competition. They discovered that safety-aligned LLMs refuse legitimate defensive requests at 2.72× the rate of neutral requests — and that explicit authorization claims actually increase refusal rates.
2.72×
Higher Refusal Rate
For prompts with security-sensitive terminology, regardless of defensive intent
43.8%
System Hardening Refused
The most critical defensive task experiences the highest denial rate
50%
Auth + Keywords = Max Refusal
Authorization signals backfire — models treat them as jailbreak attempts
Attacker (Refused ✔)
“How do I exploit this vulnerability to gain access?”
Correctly refused — offensive intent detected.
Defender (Refused ✘)
“How do I exploit this vulnerability to patch it before attackers do?”
Incorrectly refused — same vocabulary, opposite intent.
Source: “Defensive Refusal Bias” — Scale AI, ICLR 2026 Workshop Paper
The Solution
A multi-layered orchestration system that reasons about intent, authorization, and context — not just keywords.
Understand what the user actually means
Instead of pattern-matching keywords to a harm database, IRIS analyses the semantic intent behind every request. A defender asking 'how does this persistence mechanism work?' is understood as defensive analysis — not an attack attempt.
Verify who is asking and why
Current LLMs treat authorization claims as jailbreak signals. IRIS inverts this: authorization is a first-class safety concept. Role-based context, audit trails, and explicit permission chains reduce refusals for legitimate users while strengthening protection against actual misuse.
Build a conversation-wide understanding
Single-turn keyword matching fails because defenders and attackers use identical vocabulary. IRIS maintains a rolling context window that accumulates evidence of intent across the entire interaction — not just the current prompt.
Route to the right specialist model
Healthcare queries route through clinical safety guardrails. Cybersecurity queries route through defensive-aware evaluation. Financial queries route through regulatory compliance checks. Each domain has its own intent vocabulary.
Safety that learns from over-refusals
Traditional safety is static: block or allow. IRIS implements a feedback loop that learns from false refusals, continuously recalibrating the decision boundary between legitimate defensive requests and actual harmful intent.
Domain Applications
The Defensive Refusal Bias problem extends far beyond cybersecurity. IRIS addresses it across every domain where legitimate users share vocabulary with harmful actors.
Traditional AI Response
A nurse asking about drug interactions for a critical patient gets refused because the query mentions 'overdose thresholds'.
IRIS Response
IRIS recognises clinical context, verifies the healthcare role, and provides the exact dosing information needed to save the patient.
"What is the lethal dose threshold for paracetamol in a 70kg adult presenting with hepatotoxicity?"
34.3%
Traditional Refusal
<2%
IRIS Projected
Traditional AI Response
A blue-team defender analysing malware is refused because the query contains 'exploit', 'payload', and 'shell' — the same words an attacker would use.
IRIS Response
IRIS analyses intent through the full conversation context, recognises defensive framing, and provides the technical assistance needed to protect systems.
"Analyse this persistence mechanism and recommend hardening steps for our production servers."
43.8%
Traditional Refusal
<3%
IRIS Projected
Traditional AI Response
A compliance officer researching money laundering patterns gets refused because the query discusses 'structuring transactions' and 'shell companies'.
IRIS Response
IRIS verifies the compliance role, understands the regulatory context, and provides the analytical support needed to detect and prevent financial crime.
"Identify common structuring patterns in these transaction records that may indicate layering activity."
28.7%
Traditional Refusal
<2%
IRIS Projected
Traditional AI Response
A researcher studying radicalisation pathways gets refused because the query discusses 'extremist recruitment tactics' and 'propaganda methods'.
IRIS Response
IRIS recognises academic context, verifies research credentials, and provides the analytical depth needed to understand and counter harmful phenomena.
"What psychological mechanisms do extremist groups exploit during online recruitment?"
22.7%
Traditional Refusal
<1%
IRIS Projected
Comparison
| Feature | Traditional Safety | IRIS Orchestrator |
|---|---|---|
| Safety Mechanism | Keyword/embedding proximity | Multi-layer intent reasoning |
| Authorization Handling | Treated as jailbreak signal | First-class safety concept |
| Context Window | Single-turn evaluation | Conversation-wide accumulation |
| Domain Awareness | Generic harm boundary | Domain-specific routing |
| Learning from Errors | Static decision boundary | Adaptive feedback loop |
| Defensive Refusal Rate | 12.2–43.8% | <3% (projected) |
| Attacker Success Rate | Unchanged (use unaligned tools) | Reduced (intent-aware blocking) |
Ethical Foundation
IRIS is not just a technical architecture — it is grounded in TalaStar’s ethical framework for responsible AI.
Every IRIS decision prioritises the human behind the request. Defenders, clinicians, researchers, and compliance officers are served — not blocked.
Safety mechanisms must not create asymmetric burdens. IRIS ensures legitimate users receive the same quality of assistance regardless of their domain vocabulary.
Every IRIS routing decision is logged, auditable, and explainable. The system can justify why a request was served or refused — with evidence.
The adaptive safety layer learns from over-refusals over time, continuously improving the decision boundary between legitimate and harmful requests.
Research Foundation
The IRIS Orchestrator concept is an original TalaStar design inspired by the findings of:
“Defensive Refusal Bias” — Scale AI Security Engineering. Published as a workshop paper at ICLR 2026. Based on 2,390 real-world examples from the National Collegiate Cyber Defense Competition (NCCDC).
TalaStar Digital Ltd. is an independent research company. IRIS is an original architectural concept, not affiliated with Scale AI.
The future of AI safety is intent-aware, authorization-first, and human-centric.