The prompt is the least interesting part of a production agent stack.
What matters more is whether the system remembers the right things, whether you can measure quality, and whether a human can inspect a bad run without playing detective for an hour.
Most teams throw everything into one bucket and call it memory. That is how systems get weird.
Each layer has a different job. Mixing them creates both cost and drift.
If your agent matters to the business, it needs evals. Not abstract benchmark talk. Real fixture-based checks that compare outputs against what good looks like for your workflow.
The eval loop should tell you three things fast:
Operators need transcripts, tool traces, cost visibility, and clear failure markers. Otherwise every incident becomes folklore instead of engineering.
When a system starts failing, the team should be able to inspect one run, see where it went wrong, and decide whether the fix belongs in the prompt, the workflow, the tool contract, or the memory layer.
The teams that win here usually do the same boring things well:
If you want help designing that stack, start with the AI Agent Architecture teardown. If you are still deciding what the first version should look like, read what to build first.