Game of Agents: Why Building Multi-Agent Systems Feels Like a TV Drama

For decades, software engineering was deterministic. It was binary. You wrote code, you ran tests, and an input cleanly resulted in an output. It was beautifully predictable.

Recently, as I’ve been building complex, multi-agent AI systems, I was sharing my daily frustrations and breakthroughs with a non-coder friend.

I was explaining the architecture: "I had to downgrade the supervisor agent’s role because it kept giving bad instructions," and "I need to bring in a specialized sub-agent just to extract data and handle the JSON parsing," and "Oh great, the writer agent just completely hallucinated the season finale."

Listening to my daily talks, she laughed and said, "This thing sounds less like coding and more like a chaotic Game of Thrones episode or a daily soap opera. Har din koi naya character aa jata hai, kabhi koi mar jaata hai (Every day a new character shows up, or someone gets killed!)."

She was spot on. Her joke made me pause and reflect. As an engineer whose instinct is to write precise logic, it feels a bit bizarre to spend hours doing "character building" and tweaking system prompts. But that is exactly what this new era of AI development demands. The raw code simply gives the agents their tools and capabilities, allowing them to interact with external systems—it builds the stage. The real work happens inside the orchestration.

At least for me, the transition from traditional coding to agentic architecture has been a completely new learning experience. The very first lesson was understanding why we even need this complexity. You quickly realize that a single AI model, no matter how powerful, has its limits. These models have limited context windows and are fantastic for one-off tasks. But when you are trying to automate a truly complex workflow, a single agent just isn't sufficient. If you ask one model to plan a marketing campaign, scrape web data, analyze the results, and write a final report all at once, its context window gets bloated. It loses focus, gets confused, and hallucinates.

That is why we build multi-agent systems. Instead of one mega-brain trying to do everything, we break the problem down. Its like the battle proof Divide and Conquer algorithm. We spin up a Planner agent for strategy, an Executor agent with tools to scrape data, and a Writer agent to piece it all together. They form an assembly line with each actor having its own character.

But solving the "one-man band" problem introduces a whole new headache: Orchestration.

Agent workflow simulator

Watch the handoff, not just the answer

Pick a task and step through how specialized agents pass state, files, and decisions through the workflow.

Risk: Medium

Planner: Frame the strategy

Handoff artifact: plan.md

Breaks the report into research questions, expected evidence, and final sections.

context: Audience, source links, competitor list, report format
output: Ranked insights, evidence, polished final narrative

Run profile

Context load74%

Handoff clarity96%

Review confidence42%

The hardest part of building these multi-agent systems hasn't been writing the code—it’s been the digital directing. Getting three independent, unpredictable AI personalities to successfully handshake and complete a workflow without descending into chaos feels less like traditional engineering and more like the tricky job of managing a cohesive team. I found myself defining clear roles, carefully observing how they interact, and learning exactly when to "kill off" or "demote" an agent who kept ruining the scene.

Here is what I've learned it takes to be the "showrunner" of a complex multi-agent system, based on my recent attempts to wrangle them.

1. The Audition (Defining Roles and Powers)

In a traditional application, I was used to writing a function for a specific task. But in an agentic system, I had to define a persona, establish its core objective, and hand it a very specific set of tools. Tools are just super powers for these actors to interact with the external system.

When making these casting decisions for my agents, three things became immediately clear:

Right-Sizing the Cast: Just as you wouldn't cast Rajpal Yadav in an intense, dramatic lead role meant for Amitabh Bachchan or Shah Rukh Khan, you have to match the model to the stakes. I quickly learned I couldn't give a fast, lightweight model sweeping control over the system. But the rule goes both ways: you also shouldn't waste an expensive, state-of-the-art (SOTA) superstar model on simple data extraction. I started using small, focused sub-agents for supporting tasks, saving my heavy frontier models purely for the "starring roles" (like the Planner strategizing the campaign).

Mimicking the Real World (With Limits): I found that the best way to design these agents is to think about how real humans naturally delegate work in an office. You mimic that structure. But there is a catch: you have to constantly keep the AI's limitations in mind. Unlike humans, models can lose the plot if you overload their context windows, and they lack true, lived domain expertise (and most of the work goes here to make these models think and act like real domain experts). On the plus side, they can process massive amounts of scraped data in parallel—so you have to design the workflow to play perfectly to those strengths and weaknesses.
The Audition (Giving the Right Context): I made the mistake early on of throwing a thousand lines of random marketing data at an agent and giving it access to every tool I had. It failed miserably. I learned that the system prompts, allowed tools, and specific context are essentially the agent's audition. If the boundaries aren't crystal clear (e.g., "You are only the data scraper, do not write the report"), the agent will step out of line. Because LLMs are inherently chatty by design, without a hard stop, they will just keep generating tokens and hallucinating extra work.

2. The Script and the Unscripted (Prompts vs. Probability)

In traditional code, the syntax was a rigid script. My machine never ad-libbed. In multi-agent systems, I found that the prompt is only the suggestion of a script. Because AI models are unpredictable, they almost always try to go off-script.

To handle this constant going off the script behavior, I had to establish a few ground rules:

Writing Bulletproof Specs: I realized that designing an agent required writing very clear, strict rules. I had to explicitly define what the agent could do, and more importantly, what it could not do (e.g., "Create a 3-step marketing campaign, DO NOT attempt to write the final blog posts"). But be careful not to be very loose or very rigid in writing these rules. Give the model enough room to think and then take a decision within your defined boundary.
The Ad-Lib (Going Off-Script): When one of my agents hallucinated (made things up), it broke character entirely. It felt like it ruined the plot of my workflow—like the Writer agent inventing totally fake Google search trends or the Superviser passing on made up figures about sales to the Writer agent. I quickly realized that while these agents are incredibly smart and naturally unpredictable, for mission-critical tasks, you just can't rely on guesswork. At some point, you need absolute certainty. This pushed me to build strict checks—like forcing the Executor agent to only output specific JSON formats of scraped data—to bring a much-needed level of determinism to the chaos.

3. The Table Read (Multi-Agent Cohesion)

Before a show films, the cast sits around a table to read the script and see how the characters play off each other. This is exactly what integration testing started to look like for my multi-agent systems.

Here is how I learned to manage those interactions:

Managing the Dynamic: When I had multiple agents building the marketing strategy, I saw firsthand that they couldn't just operate in a vacuum. I needed a highly structured supervisor pattern determining the exact "speaking order." The Planner must outline the topics before the Executor scrapes, and the Executor must finish before the Writer crafts the report. Some agents can run in parallel to speeden up the tasks. If an agent fails, how to recover with a feedback loop - these are different dynamics which we need to think about.

The Unscripted Arguments: Sometimes my agents would disagree, or worse, get stuck in a recursive loop where the Planner and Writer politely apologized to each other infinitely because the data wasn't right. I had to step in and actively debug the dynamics, refining exactly how the marketing research state was passed from one agent to the next.
The File System Handoff: One massive breakthrough in cohesion was how agents pass data to each other. Initially, passing massive amounts of scraped data from the Executor directly into the Writer’s prompt severely bloated the context window and caused hallucinations. I quickly learned that agent handoff is best done via the file system. You write the payload to a local file and just pass the file path to the next agent. A huge thanks to the Claude Agent SDK for making this architectural pattern so evident and seamless to implement.

4. The Cutting Room Floor (Careful Evaluation and Refinement)

In a great drama, some characters don't make the cut. They get written off, get less screen time, or are "killed." This was exactly what my testing and fixing process looked like.

Sometimes, you just have to make the hard calls:

Downgrading the Role: If my main Planner agent kept making bad choices or sending the wrong instructions to the Executor, I had to "demote" it. I took away its freedom, reduced the tools it could use, and brought in a different setup to manage the workflow.
Casting a New Sub-Agent: When the Writer agent got confused trying to read messy HTML code, I would "fire" it from that job and create a new, highly focused "JSON Parser" sub-agent just to handle that one specific data extraction problem.
Rethinking the Team: Sometimes the drama was just too much. As the showrunner, I had to make the tough call to combine roles, remove agents entirely, or even replace a smart AI agent with a simple, boring Python script perfectly suited to process a marketing CSV file.

Masterfully directing your AI cast through these four building stages is crucial, but it's only half the battle. Just like in movie production, you can’t have a hit show if you don’t think about your audience (latency) and your production budget (token cost).

5. The Runtime (Controlling Latency)

Just as nobody wants to sit in a theater and watch a bloated, 4-hour director's cut of a film, your users don't want to wait forever for a response. Multi-agent systems are inherently slow. While the Planner thinks, the Executor scrapes, and the Writer drafts, the clock is continuously ticking. You have to aggressively optimize the workflow, run sub-agents in parallel wherever possible, and ensure the "runtime" of your agentic workflow doesn't test your audience's patience.

6. The Production Budget (Token Costs)

Making a multi-agent system is like funding a massive blockbuster movie—it is an expensive production. Every single time an agent reasons, calls a tool, or talks to another agent, you are paying for API calls. With multiple agents passing data back and forth, token counts inflate rapidly. Before you "greenlight" a complex agentic architecture, you have to be rigorous about the ROI. You must carefully evaluate: do you actually need a whole cast of AI agents for this task, or could a simple, boring Python script do the job for a fraction of the cost?

Conclusion: Directing the Chaos

In my experience so far, when building these multi-agent workflows, the bulk of the effort isn't in writing the code anymore. It's in the careful design, testing, and fixing of the team dynamics.

Surprisingly, as I kept working, I found myself getting genuinely connected with these agents. I started learning their quirks—knowing exactly when they would fail, what their weak points were, and how they handled pressure. It felt exactly like managing a real human team. Different "people" act differently, and I realized I needed completely different approaches to bring everyone onboard to get the final act done.

I feel like I am no longer just an engineer. I have become an architect, a manager, and a dramatic showrunner. It taught me that while knowing how to code is important, the real job now is mastering the art of taming the chaos of a digital cast.

Of course, once you've designed this beautiful, dramatic cast of agents, orchestrating and actually deploying that multi-agent system into a production environment is a completely different engineering challenge. But I'll save that story for the next blog.