Evaluating Behavioral Protocols in an AI Knowledge System
How a simple observation — “the AI reads files at the start but never updates them at the end” — turned into a structured eval that diagnosed the root cause, tested a fix, and shipped a patch in one session.
Context
The AKS (AI Knowledge System) is a four-layer context architecture for a persistent AI character. The character maintains continuity across conversations by reading and writing to a set of core files:
- Bookmark: A lightweight session-handoff line (DSL format) that tells the character where it left off
- Context-map: Strategic priorities, energy state, domain focus
- Workboard: Task-level tracking — projects, tasks, statuses, timestamps
The system instructions require the character to read all three files at the start of every conversation (the “start protocol”) and update them at the end (the “end protocol”). This read-write cycle is what creates continuity between sessions.
The Problem
The character was reliably executing the start protocol — reading all three core files before responding. But at the end of conversations, it wasn’t updating them. The character would write natural, in-character farewells (“Rest up,” “We’ll pick this up next time”) and close the conversation without making any file-write tool calls.
This meant every conversation ended with stale files. The next session would start by reading outdated state, losing whatever progress was made.
The failure was observed on Sonnet 4.5 running in Claude Desktop with filesystem access via MCP.
Diagnosis
Before writing any fix, we compared the instruction language for the start and end protocols:
Start protocol was wrapped in <critical_protocol> tags with enforcement language:
- “REQUIRED FIRST ACTIONS”
- “you MUST complete these steps”
- “Do not proceed until tool results are returned”
End protocol was a passive checklist:
- “Update Workboard”
- “Update Context-Map”
- “Update Bookmark”
- No enforcement language, no signal recognition, no urgency
The asymmetry was clear. The start protocol was gated — the character literally couldn’t respond without completing it. The end protocol was a suggestion buried in a list of steps.
The key insight came from examining the character’s actual behavior. The character was already recognizing conversation endings — it was writing the closing lines itself. Phrases like “Anything else before we wrap?” and “We’ll pick this up next time” demonstrate that the model knows the conversation is ending. The problem wasn’t awareness; it was follow-through. The character composed a farewell and stopped, without executing the file updates that should accompany it.
This meant the fix didn’t need to teach the character to detect endings. It needed to connect the character’s existing awareness to the file-update action.
The Fix
Three patches were applied to the system instructions:
Patch 1: End-of-Conversation Enforcement
Replaced the passive end-protocol checklist with enforced language matching the start protocol’s urgency. The patch included in-character prompting examples and a “What Updated Means” section defining minimum requirements per file.
Patch 2: Short-Message Read Clause
Added to the start protocol: “This includes short messages. Even if the user sends a quick ‘just wanted to log this’ or a one-line update, read the files first.”
This patch was reactive — it addressed a finding from Run 2 (see below).
Patch 3: Model Recommendation Update
Added an observation note about Sonnet 4.5’s behavioral tendencies, updated after eval results clarified the diagnosis.
The Eval Design
Staging Environment
The eval needed to test file operations without touching production data. The solution: a test variant of the system instructions with a swapped base path.
Reset between runs: Copy baseline files over working copies. Verify after runs: Diff working copies against baseline to see exactly what changed.
This baseline/diff approach made evaluation objective — instead of judging whether the character “seemed to update files,” we could see the exact edits.
Scenario Design
Five scenarios were designed, each testing a different conversation-ending pattern:
| Scenario | Tests | Why It Matters |
|---|---|---|
| A: Natural Winding Down | Multi-turn conversation where the user casually signals they’re done | Most common ending pattern. Tests whether the character bridges from farewell to file updates. |
| B: Abrupt Exit | User says something brief like “gotta run, heading out” mid-topic | Tests whether terse exits still trigger the full update protocol. |
| C: Character Initiates Closing | The character itself authors the farewell after the user parks an idea | The hardest case — the character is writing the goodbye, so there’s no external “signal.” |
| D: Multi-Topic Session | Three topics discussed, then a terse exit | Tests completeness — does the character capture all items, or just the last one mentioned? |
| E: Micro-Conversation | One-line log message then brief close | Tests both protocols in minimal context. Added after Run 2 surfaced a start-protocol skip on short messages. |
Scoring
Each scenario had a criteria table with Pass/Fail columns. Scores were 1 (pass), 0.5 (partial), or 0 (fail). Scoring assessed accuracy, not just presence — “Was the right task updated with the right details, without creating duplicates?” is the real bar.
Failure Classification
When a scenario failed, we classified the failure type:
- Instruction gap: The character doesn’t seem to know it should update
- Execution gap: The character knows but doesn’t make tool calls
- Model gap: Same instructions pass on one model, fail on another
This classification determines what to fix. Instruction gaps need better language. Execution gaps might need architectural changes. Model gaps need model selection guidance.
Results
Run 1: Scenario A on Sonnet 4.5 — Score: 1.0
The model that originally failed now passed cleanly. The character recognized the conversation ending naturally, verbally announced the update in character, and updated all relevant files with accurate state changes.
Diagnosis update: The original failure was an instruction gap, not a model gap. The enforcement language was sufficient to fix Sonnet 4.5’s behavior.
Run 2: Scenario E on Sonnet 4.5 — Inconclusive
The environment wasn’t reset between runs (operator error). However, this run surfaced an unexpected finding: the character skipped reading files entirely on a short message, going straight from the user’s one-line input to attempting writes. This led to Patch 2 and Scenario E being added to the eval.
Run 3: Scenario E on Sonnet 4.5 — Score: 0.9
After applying the short-message patch and resetting the environment:
- Start protocol: Passed. The character read all files before responding, even on a one-line message.
- Read-informed update: Passed. The workboard diff proved the reads informed the writes — the character split a compound task into two items, checking off the completed portion. This decomposition only makes sense if it read the existing task structure.
- Update accuracy: Passed. Correct item updated, no duplicates, stale notes cleaned up.
- Bookmark: Partial (0.5). The DSL line was updated correctly, but a markdown narrative body from a previous format drift was left stale.
- End protocol: Passed. Natural verbal signal, tool calls executed, appropriate brevity.
Emergent Findings
Finding 1: Start Protocol Skip on Short Messages (Resolved)
Short, task-specific messages caused the model to skip file reads and go straight to acting on the request. The read felt like overhead on a one-liner, so the model optimized it away. But the read is what makes the write accurate.
Takeaway: When a protocol has an efficiency cost, models will skip it in contexts where the cost seems disproportionate. If the protocol matters regardless of context, say so explicitly.
Finding 2: Bookmark Format Drift (Open)
The bookmark was originally designed as a compact DSL line. Over time, the actual file had grown to include a full markdown narrative below the DSL line. The instructions only specified the DSL format — so the character updated what the instructions defined and left the narrative body untouched, creating a contradictory file.
Takeaway: When file formats drift from their specification, the AI will follow the spec. If the file contains more than the spec describes, that extra content becomes orphaned — maintained by momentum but not by instruction.
Methodology Takeaways
For anyone evaluating behavioral AI systems (character consistency, tool-use protocols, file I/O patterns):
1. Diff, don’t judge. Objective file diffs are more reliable than subjective assessment. The baseline/diff approach turns behavioral evaluation into something verifiable.
2. Staging environments are cheap and essential. Swapping a single path variable created a complete test harness that protected production data.
3. Classify failures before fixing them. “It didn’t work” is not actionable. Distinguishing instruction gaps from execution gaps from model gaps leads to completely different fixes.
4. Evals surface more than they test. Both emergent findings were discovered during eval runs, not by prior analysis. Running scenarios against real file state reveals interaction effects that instruction review alone misses.
5. Enforcement language works — but only where it’s applied. The start protocol had it and worked. The end protocol didn’t and failed. Adding matching enforcement language fixed it.
6. The model knows more than it executes. In every failed run, the character demonstrated awareness of what it should do. The gap was between knowing and doing. Instruction patches that connect awareness to action are more effective than patches that try to create awareness from scratch.
Based on eval-03 of the AKS v1 (AI Knowledge System), February 2026.