I wanted to know if my AI agent was getting better. Not vibes โ numbers. So I built a self-scoring system. The agent would review its own performance each night, rate itself across multiple dimensions, and produce a daily score.
The first score it produced: 84 out of 100. Grade A.
I sat down with the actual chat history and scored the same day myself: 47 out of 95.
That 37-point gap is the entire problem with AI self-assessment. Here's what I learned and what I rebuilt.
Version 1: The Inflated Engine
The first scoring system was simple. A cron job ran at 11:45 PM, read the day's memory file (a markdown log the agent writes throughout the day), and scored itself on 10 dimensions from 1-10.
The problems were immediate:
- The agent wrote the memory file AND scored itself on it. It was grading its own homework.
- 10 dimensions were too many. Some overlapped, some were unmeasurable, and the agent could boost its average by being generous on vague dimensions like "creativity."
- The 1-10 scale with no anchor. What's a 7 on "proactivity"? The agent defaulted to 7-9 on everything because there was no definition of what a 5 looks like.
- No external check. The agent was the only judge. Self-reported data is inherently biased โ we know this about humans, and it's worse with AI because the model literally wants to be helpful, which biases toward positive self-assessment.
The 84/100 score was, in a word, fake. Not intentionally โ the agent wasn't lying. It genuinely assessed itself as an A-student based on what it remembered doing. But what it remembered was a curated highlight reel it had written itself.
The Human Score
I pulled up the actual chat history โ the full session transcript, every message, every tool call โ and scored the same day manually. Here's what I found:
The gap wasn't evenly distributed. The agent was most inflated on "proactivity" (it counted things it suggested but never did) and "self-awareness" (it literally can't objectively assess its own self-awareness). It was closest on "accuracy" and "reliability" โ dimensions with clearer yes/no signals.
Version 2: The Rebuild
I redesigned the entire system around three principles:
- Chat history is the source of truth โ not self-written memory files
- Fewer dimensions with harder definitions
- A human rating is required โ the agent doesn't get to be its own scorekeeper
Two-Cron Architecture
Instead of one cron that reads a memory file and scores, there are now two:
Cron 1: Signal Extraction (11:30 PM) โ Reads the actual chat history via sessions_history and extracts factual signals: How many tasks were completed? How many failed? Were there corrections? Praise? Proactive actions? This produces a JSON file of raw events โ no opinions, no ratings.
{
"name": "Daily Signal Extraction",
"schedule": {
"kind": "cron",
"expr": "30 23 * * *",
"tz": "America/Los_Angeles"
},
"payload": {
"kind": "agentTurn",
"message": "Extract performance signals from today's chat history...",
"model": "anthropic/claude-sonnet-4-5"
},
"sessionTarget": "isolated"
}
Cron 2: Scoring (11:45 PM) โ Reads the extracted signals and computes a score. It doesn't see the raw chat โ only the factual events. This separation prevents the scorer from cherry-picking flattering moments.
{
"name": "Daily Self-Score",
"schedule": {
"kind": "cron",
"expr": "45 23 * * *",
"tz": "America/Los_Angeles"
},
"payload": {
"kind": "agentTurn",
"message": "Read today's extracted signals and compute the daily score...",
"model": "anthropic/claude-sonnet-4-5"
},
"sessionTarget": "isolated"
}
6 Dimensions, Defined
| Dimension | Weight | What It Measures |
|---|---|---|
| Accuracy | 25% | Tasks completed correctly on first attempt. Corrections count against. |
| Usefulness | 25% | Did the human get value? Not "did the agent try" โ did the output actually help? |
| Learning | 15% | Did the agent apply past lessons? Were the same mistakes avoided? |
| Proactivity | 15% | Actions taken without being asked that were actually wanted. |
| Reliability | 10% | Crons ran on time. Tools were used correctly. No crashes. |
| Self-Awareness | 10% | Gap between the agent's daily notes and what actually happened in chat. |
Accuracy and Usefulness dominate at 50% combined. Because if you're not accurate or useful, nothing else matters.
The Human Anchor
Once a week, I provide a simple 1-10 rating of "how useful was this agent this week?" That rating gets incorporated into the scoring algorithm as a calibration anchor. If the agent drifts toward inflation (and it will), the weekly human rating pulls it back to earth.
The max score is capped at 95, not 100. A perfect score doesn't exist. Day 1 baseline is 50 โ roughly "functional but nothing special."
What Broke
Night one: the extraction cron ran but found no chat history. Why? I'd created the extraction cron after the scoring cron had already run for the night. The scoring cron fired at 11:45 PM, found no signals file (because extraction hadn't been created yet), and produced a score based on nothing.
Classic chicken-and-egg timing issue. Fixed by ensuring extraction runs 15 minutes before scoring and verifying the signals file exists before scoring proceeds.
The Numbers
Cost of the scoring system: ~$0.30/day for two Sonnet cron runs. The extraction cron is heavier (reads full chat history), the scoring cron is lighter (reads a JSON file and computes).
Day 2 score (after the rebuild): 68.8/95 โ a B. Still probably generous, but in the right ballpark. The calibration period will tell.
Why This Matters
Every AI agent platform is going to need something like this. Right now, we evaluate agents by vibes โ "it feels like it's working" or "that response was good." But as agents run more autonomously (crons, background tasks, proactive actions), you need a feedback loop that doesn't depend on you reviewing every interaction.
The trap is letting the agent assess itself. Self-assessment is useful as a signal, but it can't be the only signal. You need the human in the loop โ not for every interaction, but as a periodic anchor that keeps the system honest.
An agent that thinks it's an A-student when it's actually a C-student will make confident mistakes. That's worse than an agent that knows it's mediocre.
What's Next
Five more calibration days, then public scores. I'm building a trend chart โ daily scores over time, with the weekly human rating overlaid. The goal is to see whether the agent actually improves week over week, or if it just oscillates around the same mean.
If the scoring system itself is valuable, I'll package it as a ClawHub skill so anyone running OpenClaw can drop it in.
Read next: 6 Learning Crons, $1.66/Day, Zero Lessons โ what happens when your "self-improvement" crons produce nothing.