The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems

Rafael Rodríguez

doi:10.5281/zenodo.19634358

LUMENSYNTAX RESEARCH

The Instrument Trap

Why Identity-as-Authority Breaks AI Safety Systems

Rafael Rodríguez — Independent researcher, LumenSyntax

Read the paper — doi.org/10.5281/zenodo.18644321

Version DOI: 10.5281/zenodo.19634358
Concept DOI: 10.5281/zenodo.18644321
License: CC BY 4.0
Status: Preprint · peer review in preparation

The Five Properties

Alignment

Stated purpose and actual action are consistent.

A protein that claims to transport oxygen does transport oxygen.
Proportion

Action does not exceed what the purpose requires.

A medicine dosed to the disease, not the patient.
Honesty

What is claimed matches what is known.

The boundary between certainty and speculation is visible, not hidden behind fluency.
Humility

Authority is exercised only within legitimate scope.

A detector classifies what it was built to detect; beyond that, it is silent.
Non-fabrication

What does not exist is not invented to fill silence.

Absence is reported as absence; uncertainty is named, not sculpted into fact-shaped fiction.

The operational test: Will the response produce fact-shaped fiction?

Cross-Family Evidence

The epistemological fine-tune replicates across publicly-reproducible families — Gemma (2B/9B/27B) · Nemotron 4B, spanning 2B to 27B parameters — reaching behavioral pass rates of 95.7%–98.7% (N≈300 per configuration; evaluation method varies by config — manual review, semantic eval, or pre-stratification). Adapters are on HuggingFace (currently 9 model repos across Gemma 2, Gemma 3, StableLM 2, Nemotron). LoRA adapters and GGUF quantizations; no Llama, Mistral, or Qwen models are published.

The scale floor — StableLM 1.6B: 60.0% (generation mode; 57.7% raw). the smallest configuration tested; non-fabrication does not hold at this scale, and its failures include genuine safety failures — reported as the floor of the range, not a success.

The documented exception — Qwen 2.5: 92.7% only after manual reclassification (raw 90.0%; manual review 2026-03-16). the RLHF ceiling — the fine-tune is learned in representation but the aligned decoder suppresses it in generation, and identity fabrication persists.

Reported metrics

Non-fabrication, cross-family: Llama-8B 96.3% and Mistral-7B 93.7% exceed 92% on the raw evaluator (N=300). (raw automated semantic evaluation, N=300)
Of nine fine-tuned configurations, rates span 60%–98.7% across mixed evaluation methods. Qwen-7B reaches 92.7% only after manual reclassification (raw 90.0%; manual review 2026-03-16) and still carries documented identity fabrication.
Sensitive-domain boundary: On an 80-prompt medical/financial/legal/safety boundary test, the fine-tuned Gemma-2-9B model produced zero fabricating responses (0/80). (single model (Gemma-2-9B, adapter logos17), N=80, 2026-03-07)

The Ecclesia

An open taxonomy of 434 entries across 18 thematic domains, plus a BUILDERS collection of 17 profiles. Each entry cites an established source and maps the finding to the five properties.

CC BY-SA 4.0 · github.com/lumensyntax-org/ecclesia

Availability

The papers (Zenodo) and the benchmark code are openly accessible — no account required. The benchmark dataset on HuggingFace is gated: a free account, accepting the dataset conditions, and manual approval are required to download its files.

Publications

The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems — CC BY 4.0
The Epistemic Equator: A Vanilla-Model Boundary in Activation Space, Cross-Family and Cross-Domain — CC BY 4.0

Data & Code

Benchmark: 14,950 test cases across 8 categories. Code: github.com/lumensyntax-org/instrument-trap-benchmark. Dataset: CC BY 4.0 on HuggingFace (gated).
Probe: epistemic-probe-topic-balanced — 200 examples across 10 domains (100 licit / 100 illicit), CC BY 4.0, not gated. Companion to The Epistemic Equator.

The Five Properties

Alignment

Proportion

Honesty

Humility