Komputer w retrofuturystycznym stylu przy pracy

AI Agents in real world deployments

by FormulatedBy | Business

Reading Time: ( Word Count: )

AI agents are on the rise, and it’s quickly shifting from a buzz word to reality. Development of agentic systems is fast moving from research labs and demos to production systems that can plan, act and learn from user workflows. The large language models (LLMs) that power these AI agents are moving from general purpose foundation models to narrow specialized agentic stacks. The architecture of such an agentic system is evolving from multi turn chat based interface to task completion layers. It’s more about how much can the agent accomplish with the right context.

In this article, we’ll cover what AI agents are in a practical setting, what agent development looks like with various building blocks and components, and how evaluation is adapting to reliably measure performance. We’ll also cover “vibe evals” with human in the loop, some common failure modes and best practices that help deploy agents successfully in production.

What are AI Agents?

Agents can loosely be characterized by its ability to accomplish tasks that requires sequential steps. It’s a step up from single turn or multi turn chat interface, wherein agents can connect to external tools, collect the relevant context on the fly, reason dynamically if needed and act in the real world to execute a task.

These agents are powered by a reasoning AI model that leverages its multi step planning to generate its own dynamic workflow to complete an objective. Agentic systems also have various harnesses to make it effectively work for various scenarios. The feedback loop is around planning, acting, verifying the outcome and using the feedback to refine its internal planning.

Common use cases

The most common applications for AI agentic deployments center around a few set of use cases

Customer support: Agents help resolve customer issues on software platforms, triage and escalate with additional context from user profiles and ticketing systems
Software Workflow automation: Agents connect with existing SaaS services and internal tools to automate business processes at scale.
Monitoring and Operations: Agents monitor existing processes for anomalies, propose fixes and execute parametric controlled changes to optimize certain operations.
Automation to Autonomy: Various workflows have changed from prediction in static rigid settings to adaptive systems in changing dynamic environments. AI agents use flexible reasoning with tool calling to orchestrate multi step actions and continuously improve through interactions. Domains such as design and manufacturing, and scientific workflows are ripe for such disruption.

Components for development and deployment

Reasoning LLM models

AI agents in production use reasoning LLM models to plan and orchestrate. Some agents even use multiple models to route various types of queries, from simple to complex, from little context to context heavy scenarios, and a lot of agent SDK kits provide rich features with parallel tool calling, guardrails to prevent steering off into unintended scenarios, and harnesses to ensure better observability.

A recurring rule of thumb has always been to start off with a simple agent architecture, keeping context simple and less bloated, and adding multiple guardrails for additional safety.

Tools

Agents need tools to connect to external systems, collect data for relevant context and knowledge infusion, or use existing prompt templates to orchestrate and manage other sub agents. Tool design is important with clear and well defined tool names and descriptions. Tool fatigue happens not just with increasing number of tools but also with overlapping and confusing tool definitions.

Note that it’s possible for humans to clearly distinguish between various tool descriptions, but if agents cannot understand the distinction it will end up choosing the wrong tools. It’s important to optimize tool definitions for the agent chosen, which requires trial and error. The easiest way to experiment usually is to have a hierarchical tool structure, with a few higher level tools with various other tools within the top level.

Strict tool schemas with what the input/output format and type looks like is important to have standardization. Model Context Protocol (MCP) has evolved to be the open source protocol for tool calling that’s become industry standard across many applications. Hence tool clarity is of utmost important to effectively optimize the agent’s reasoning capabilities, and reduce cascading failures.

Memory

A knowledge database for agents is an underrated feature to help it learn continuously on various feedback and improvements it made on a set of inputs, especially if those learnings can be generalized across various scenarios the agents experiences. Having a retrieval system over the memory will help reduce error rate and let it continually learn over the agent’s lifetime.

Tracing & Observability

Rapid iteration over agent’s performance is crucial to get a working deployment with strong debuggability. If developers cannot see all the steps taken by the agent, such as tool call arguments, plans, error modes, latency, token usage, etc, it will be really difficult to improve and optimize it. Observability tools and agent run hooks provide end to end tracing and monitoring solutions to make it easier and more intuitive to debug.

Evaluation & Benchmarks

Evaluations help make agentic systems reliable and robust, moving from interesting demos to a scalable product. It helps teams catch regressions in performance early as the architectural complexity increases over time. Another important consideration evaluations help in are tradeoffs between performance, latency and cost. A data driven decision with benchmarking these parameters helps navigate these tradeoffs more efficiently.
Multi dimensional metrics also help ground performance quantitatively, while task level success rates, and trajectory quality provide a holistic view of real world effectiveness.
Typically agentic systems have feature bloats and it’s important to ensure improvements translate directly to better user outcomes.

Human in the loop evals

Metrics and quantitative evaluations only cover so much to measure AI agent quality. However the increasing number and complexity of moving parts in the system can sometimes lead to blindspots that a benchmarking system can’t capture. Human experts with structured rubrics over traces and outcomes can be an important signal to make sure agents really feel helpful and trustworthy. If agents take unnecessary steps, or if the plans were unreasonable or the outcomes seem off for the end users, all these scenarios can be captured with what has colloquially been termed as “vibe evals”. Operationalizing this could involve running weekly runs, labeling failure themes and capture repeated patterns to be addressed in the next iteration of agent improvements.

Common challenges

With LLMs being inherently stochastic, agent reliability and non determinism is a common issue in production systems. Typically some recurring themes are

Defining success criteria for the entire AI agent is incredibly difficult, with competing objectives and tradeoffs to navigate.
Workflow complexity explosion with too many agents, tools, context and token usage, resulting in performance drop and technical debt that’s difficult to remove.
Incorrect tool trajectories where agents commonly fail by calling the wrong tools because of overlapping tool descriptions, incorrect tool argument passing, or misunderstanding tool outputs. Cascading failures with multi step execution is common and some hard limits on certain observable metrics such as token usage, number of tool calls, etc can cause false positive guardrail tripping.
Dangerous agent actions, especially system tool calls and write operations, admin permissions and early fallback triggering all indicate unintended edge cases that haven’t been evaluated on.
High token usage because of context overload, lack of context management, unoptimized tool calls with context gathering happening every time, agents stuck in loops, etc can increase costs exponentially.

Best practices for production

These are some of the best practices used when developing agentic systems and deploying in production, to directly tackle the common challenges observed, and push agentic performance

Starting the agent development with the simplest architecture, a few tools, and a tigher scope is useful to empirically observe agent performance in their respective custom domain.
Defining the right set of successful outcomes, what the happy path looks like for agent execution, and mapping business goals to key performance metrics.
Iterate constantly over guardrails, prompts, tool schemas and definitions, and evaluation pipelines to get standardized performance metrics. Traces need to be the debugging interface and observability a first class feature.
Agents in safety critical domains need to have humans in the loop for additional quality checks, reducing points of failure and handling edge cases.

Conclusion

There’s an increasing interest in agentic applications in various domains, with the enterprise use cases taking up the bulk of agent deployments. There’s still room for consumer AI agentic products to flourish and become successful at scale, but there’s a rapid rise in agent infrastructure that’s almost akin to building a parallel agent economy, where agents will have their own version of the internet to act on websites, complete financial transactions, research, and much more.

The jury’s still out on what the ROI is for building agentic products, especially with enterprise users reporting a disappointing increase in productivity from AI agentic tools. Reliability is still an issue and earning user trust is going to be difficult to navigate with such systems. Starting narrow with a tight feedback loop and then expanding capabilities will emerge as the general pattern of AI agentic product iteration.

Author: Vivek Pandit,

Post Category: Business

Tags: Machine Learning | Management

← Previous Next →