ahmedhesham.dev

Turning Agent Sessions into Product Decisions

· 8 min read

At Stakpak we build a DevOps AI agent. Users talk to it like a colleague — “deploy this to staging,” “why is this pod crashing,” “write me a Terraform module for our new service.” Like any AI product, we track anonymized usage patterns and session metadata through Langfuse to understand how the agent performs — what actions it takes, where it errors, how conversations flow.

After a few months we had thousands of these sessions. Each one captures the shape of a conversation — around 100K tokens of interaction patterns, tool usage sequences, and outcome signals.

Somewhere in that data was everything we needed: where the agent fails, what tasks it struggles with, which tool chains break most often. The problem was we had no way to ask.

The brute-force approach is dead on arrival

Thousands of sessions at ~100K tokens each. That’s hundreds of millions of tokens. You can’t feed that to an LLM. You can try batching — split it into a hundred-something groups of 50 — but that runs close to $2,000 per pass. And that’s before you realize batching doesn’t even work for this kind of analysis.

Here’s why. An agent analyzing batched sessions only sees what’s in front of it. It can’t follow a lead. It notices an error pattern in batch 14 but has no way to check whether the same thing happened in batch 87. You’re forced to pre-filter, which means you need to already know what you’re looking for.

That defeats the entire point.

What we needed was something that lets an analysis agent scan everything cheaply, then drill into what matters — on its own, following its own leads.

OpenViking changed the math

OpenViking is a context database from ByteDance (Volcengine), released in early 2026. The idea is simple and it works: for every piece of content you store, it automatically generates three tiers of representation.

LevelWhat it isSizeUsed for
L0One-sentence abstract~100 tokensSemantic search across everything
L1Structured overview~2K tokensUnderstanding what happened without the full transcript
L2Original contentFull sessionHard evidence when you need it

Think about what this means for analysis. An agent scans all the L0 abstracts — a few hundred thousand tokens. Finds a few hundred semantic matches. Reads those L1 overviews — another few hundred thousand tokens. Narrows it down to the sessions that actually matter. Pulls up a handful of full transcripts for evidence.

Under 2M tokens. A few dollars. The same analysis that would cost $2,000 with brute force.

And the agent self-directs. It follows leads. It decides what to drill into. We don’t pre-filter, we don’t guess — we just let it explore.

How I built the pipeline

The architecture is Langfuse → S3 → OpenViking. Extract sessions, stage them, load them. OpenViking handles the rest.

Langfuse Sessions
Extract
S3 Staging JSON + Manifests
Load
OpenViking L0 / L1 / L2
Failure Analysis
Task Profiling
Gap Detection
Dagu — daily at 4am UTC

I split it into two stages for a reason:

Extract pulls sessions from Langfuse by date, converts them from OTEL GenAI semantic format to OpenViking’s message schema, and writes each one to S3. This parallelizes cleanly — just API calls and S3 writes.

Load reads from S3 and pushes into OpenViking: create session, add messages, commit. This runs sequentially because each commit triggers async processing — L0/L1 generation, vectorization, memory extraction.

The split gives me resilience. OpenViking down? Extract still finishes. Need to reload a day? Data’s in S3. Something fails mid-load? Each date has its own manifest, so I know exactly what succeeded.

For scheduling I use Dagu — a single Go binary, web dashboard, no Postgres or Redis. The entire daily workflow is one YAML file:

yaml
# etl-daily.yaml
schedule: "0 4 * * *"
steps:
  - name: extract
    command: uv run python -m flows.pipeline extract
  - name: load
    command: uv run python -m flows.pipeline load

Runs at 4am UTC. For a pipeline this size, Airflow or Temporal would be absurd.

Everything runs on one box

The whole stack — OpenViking, Caddy for TLS, Dagu — lives on a single EC2 t3.medium. Two vCPUs, 4GB RAM, 256GB gp3 storage. Under $100/month.

Some hard-won lessons:

VLM concurrency will murder a small instance. OpenViking’s defaults assume beefy hardware. On my t3.medium, the default concurrency during commits ate all memory and killed the process. Setting max_concurrent: 2 fixed it instantly. Pipeline has been rock solid since.

There’s no upsert. OpenViking generates a new internal UUID per session regardless of your external ID. I handle idempotency at S3 — one file per session ID per date. Simple and bulletproof.

I found and fixed a bug in OpenViking itself. Each account backend was spinning up its own RocksDB instance on the same storage path — permanent lock contention. One-liner fix: share a single adapter across backends.

python
# Before: every backend fights for the same RocksDB lock
self._adapter = create_collection_adapter(config)

# After: share one adapter, no contention
self._adapter = shared_adapter or create_collection_adapter(config)

Patched it locally via Docker volume mount and submitted it upstream.

We take privacy seriously

Stakpak already strips secrets and credentials client-side — API keys, cloud account IDs, and common secret patterns never leave the user’s machine. On top of that, the ETL layer strips remaining PII before anything touches OpenViking. OpenViking itself runs on a private instance behind TLS, no public endpoint, encrypted S3 with least-privilege IAM. Staging expires after 30 days, backups after 90.

We built this the way we’d want our own sessions handled.

Now we can actually ask questions

This is the part that matters. Thousands of sessions, indexed and searchable. For the first time, we can interrogate agent behavior at scale instead of guessing.

Where does the agent fail? Searching for “error timeout retry failed” surfaces every session where the agent hit a wall. L1 overviews tell us whether it was a hallucinated command, a tool timeout, a permission issue, or a broken chain of reasoning. We categorize failure modes across the entire corpus without reading a single full transcript.

What tasks does the agent struggle with? A semantic search for “deployment” pulls up sessions across Kubernetes, Terraform, Docker Compose, Salesforce. We can see which DevOps tasks take the most turns to complete — and which ones the agent gives up on. Multi-step Terraform plans with conditional logic? The agent chokes. Simple rollbacks? Nailed every time.

What capabilities are missing? When the agent repeatedly falls back to asking for help — or produces a plan but can’t execute it — that pattern shows up. OpenViking’s memory extraction captures these as queryable “event” records. We can see exactly where the agent’s tool chain has gaps.

How do failure modes compare across platforms? Extracted memories organize into entities, events, and outcomes. Kubernetes sessions have different failure signatures than Terraform sessions. Docker Compose rarely fails, but when it does it’s always networking. These patterns give us a structured map of agent weaknesses that no error dashboard could produce.

The data was always there. We just couldn’t afford to read it — until now.

The stack

ComponentWhat it does
LangfuseSession tracking — source of truth
OpenVikingTiered context storage + semantic search
S3ETL staging + backups
DaguWorkflow scheduling — single binary, one YAML
EC2 t3.mediumRuns everything

The ETL is plain Python. httpx for APIs, boto3 for S3, no framework. Retry logic with exponential backoff. The transform handles OTEL GenAI message format with safe JSON parsing.

Daily updates and analysis runs cost a few dollars each.

What’s next

The indexed sessions are the foundation. Next: automated analysis agents that run weekly — scanning for quality regressions, new failure patterns, cost anomalies, and capability gaps. Each run produces a report that feeds directly into what we build.

The insight was never in any single session. It’s in the patterns across thousands of them — the failure modes that repeat, the tool chains that break, the tasks that take too many turns. The hard part was making those patterns cheap enough to explore.

If you’re building an AI agent and wondering why it fails at certain tasks — index everything, let agents analyze agents, and follow the data.

We’re building Stakpak. If you’re working on something similar, find me on X or LinkedIn.

React: