AI Agent Development Guide

Learn to build powerful AI agents for specific tasks

Advanced AI Agent Features

Take your AI agent to the next level with powerful capabilities and optimizations

Introduction to Advanced Features

Once you've built a basic AI agent, you can enhance its capabilities with advanced features that make it more powerful, useful, and responsive. In this guide, we'll explore techniques for adding sophisticated capabilities to your AI agent including enhanced memory systems, complex reasoning patterns, tool chaining, and more.

Building on the foundations from our First Agent Tutorial, we'll now focus on:

  • Clear task definition: Define exactly what your agent should and shouldn't do
  • Thoughtful prompt engineering: Guide the agent's behavior with well-crafted instructions
  • Robust tool integration: Give your agent the capabilities it needs to succeed
  • User-centric design: Create agents that solve real problems for users
🧠

Enhanced Memory Systems

Implement sophisticated memory mechanisms to help your agent maintain context over long interactions.

Learn More
πŸ”„

Chain-of-Thought Reasoning

Enable your agent to break down complex problems and reason through multi-step solutions.

Learn More
πŸ› οΈ

Advanced Tool Integration

Connect your agent to multiple tools and external systems with sophisticated routing.

Learn More
πŸ”

Retrieval-Augmented Generation

Implement RAG to give your agent access to specific knowledge bases and documents.

Learn More
πŸ‘₯

Multi-Agent Systems

Create systems where multiple specialized agents collaborate to solve complex problems.

Learn More
πŸ“Š

Performance Optimization

Techniques to improve response quality, reduce latency, and manage costs.

Learn More

Enhanced Memory Systems

Basic AI agents typically have limited context windows, making it difficult to maintain information over long interactions. Enhanced memory systems solve this problem by storing, retrieving, and managing information effectively.

Types of Memory Systems

Memory Type Description Best Used For
Short-term (Buffer) Maintains recent conversation history Immediate context in conversations
Long-term (Vector DB) Stores important information permanently User preferences, facts, decisions
Episodic Organizes memories into related episodes Task sequences, conversation threads
Working Temporarily holds information for current task Multi-step reasoning processes

Implementing a Vector-Based Memory System

Vector databases are ideal for semantic memory systems that can retrieve information based on meaning rather than exact matching:

# Example: Implementing a vector-based memory system with LangChain and Chroma

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain

# Initialize embedding model
embeddings = OpenAIEmbeddings()

# Create a vector store to hold memories
memory_db = Chroma(embedding_function=embeddings, collection_name="agent_memories")

# Function to add a new memory
def store_memory(text, metadata=None):
    # Split long texts into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(text)
    
    # Store in vector database with optional metadata
    memory_db.add_texts(texts=texts, metadatas=[metadata] * len(texts))
    print(f"Stored new memory: {text[:50]}...")

# Function to retrieve relevant memories
def retrieve_memories(query, k=3):
    docs = memory_db.similarity_search(query, k=k)
    return [doc.page_content for doc in docs]

# Example usage in an agent
def agent_with_memory(user_input):
    # Retrieve relevant memories based on user input
    relevant_memories = retrieve_memories(user_input)
    
    # Use memories to enhance the context for the response
    context = "\n".join(["Relevant information:", *relevant_memories])
    
    # Generate response using the enriched context
    llm = OpenAI(temperature=0)
    response = llm(f"Context: {context}\nUser question: {user_input}\nResponse:")
    
    # Store this interaction as a new memory
    store_memory(f"User: {user_input}\nAgent: {response}")
    
    return response

Pro Tip: Memory Summarization

For long interactions, implement periodic summarization of memories to prevent context overflow while preserving important information:

  • Use the LLM itself to generate summaries of conversation history
  • Store both detailed memories and their summaries
  • Implement a hierarchy of memory: recent details + summarized history

Chain-of-Thought Reasoning

Chain-of-Thought (CoT) reasoning enables your agent to break down complex problems into smaller steps and think through each step sequentially. This significantly improves performance on tasks requiring multi-step reasoning.

Implementing Chain-of-Thought

  1. Explicit Prompting for Reasoning

    Modify your agent's prompt to explicitly ask for step-by-step thinking:

    # Example: Chain-of-Thought prompt
    
    def cot_prompt(question):
        return f"""
    Question: {question}
    
    To solve this problem, I need to think through this step by step:
    1. First, I'll understand what is being asked.
    2. Then, I'll break down the problem into smaller parts.
    3. For each part, I'll apply relevant knowledge or techniques.
    4. Finally, I'll combine the results to form my answer.
    
    Let me work through this systematically:
    """
    
    # Example usage
    question = "If a company's revenue grew by 15% to $690,000, what was the original revenue?"
    response = llm(cot_prompt(question))
  2. Self-Consistency Techniques

    Generate multiple reasoning paths and select the most consistent answer:

    # Example: Self-consistency with multiple reasoning paths
    
    def solve_with_self_consistency(question, num_paths=3):
        results = []
        
        for i in range(num_paths):
            # Generate a reasoning path with a different seed
            response = llm(cot_prompt(question), temperature=0.7)
            
            # Extract the final answer
            # This is a simplified extraction - you may need more robust parsing
            lines = response.split('\n')
            final_answer = lines[-1] if "answer" in lines[-1].lower() else response
            
            results.append(final_answer)
        
        # Find the most common answer
        from collections import Counter
        answer_counts = Counter(results)
        most_common_answer = answer_counts.most_common(1)[0][0]
        
        return most_common_answer
  3. Reflection Mechanisms

    Allow your agent to review and critique its own reasoning:

    # Example: Implementing reflection
    
    def reflective_reasoning(question):
        # First reasoning attempt
        initial_reasoning = llm(cot_prompt(question), temperature=0.5)
        
        # Prompt for reflection
        reflection_prompt = f"""
    I solved this problem as follows:
    {initial_reasoning}
    
    Now I'll reflect on my solution:
    1. Did I understand the problem correctly?
    2. Did I make any calculation errors?
    3. Is my reasoning logically sound?
    4. Are there any assumptions I made that might be incorrect?
    5. Is there a more elegant or efficient approach?
    
    My reflection:
    """
        
        # Generate reflection
        reflection = llm(reflection_prompt)
        
        # Final revised answer based on reflection
        final_answer_prompt = f"""
    Original problem: {question}
    
    My initial solution:
    {initial_reasoning}
    
    My reflection:
    {reflection}
    
    Based on my reflection, my revised and final answer is:
    """
        
        final_answer = llm(final_answer_prompt)
        return final_answer

Common Pitfall: Hallucination in Complex Reasoning

Even with Chain-of-Thought, agents can confidently present incorrect reasoning. Mitigate this by:

  • Implementing verification steps for critical calculations
  • Using tool calls for mathematical operations rather than relying on the LLM
  • Adding explicit fact-checking mechanisms for each reasoning step

Advanced Tool Integration

While basic agents might use one or two tools, advanced agents can leverage diverse tools and decide which ones to use based on the task at hand. Effective tool integration requires careful design of tool selection and orchestration.

Tool Orchestration Patterns

ReAct Pattern

Interleaving reasoning and action, where the agent thinks about what tool to use, uses it, then observes the result before the next step.

Function Calling

Structured tool use where the agent explicitly calls functions with specific parameters, enabling more reliable tool interactions.

Tool Chaining

Sequential use of multiple tools where output from one tool becomes input to another, enabling complex workflows.

Implementing Function Calling with OpenAI

# Example: Function calling with OpenAI

from openai import OpenAI
import json
import requests
from datetime import datetime

client = OpenAI()

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g., San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

# Implement the actual functions
def get_weather(location, unit="celsius"):
    # In a real implementation, this would call a weather API
    # This is a mock implementation
    weather_data = {
        "location": location,
        "temperature": "22" if unit == "celsius" else "72",
        "unit": unit,
        "condition": "Sunny",
        "humidity": "45%"
    }
    return json.dumps(weather_data)

def search_web(query):
    # In a real implementation, this would call a search API
    # This is a mock implementation
    return json.dumps({
        "results": [
            {"title": f"Result for {query}", "snippet": "This is a sample search result."}
        ]
    })

# Agent with tool-use capability
def agent_with_tools(user_input):
    messages = [{"role": "user", "content": user_input}]
    
    # First, let the model decide which tool to use (if any)
    response = client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    
    # If the model wants to use tools
    if tool_calls:
        # Add the model's response planning to use tools
        messages.append(response_message)
        
        # Process each tool call
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)
            
            # Call the appropriate function
            if function_name == "get_weather":
                function_response = get_weather(**function_args)
            elif function_name == "search_web":
                function_response = search_web(**function_args)
            else:
                function_response = f"Error: Function {function_name} not found"
            
            # Append the function response to messages
            messages.append({
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": function_name,
                "content": function_response
            })
        
        # Get the final response after tool use
        second_response = client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )
        
        return second_response.choices[0].message.content
    else:
        # Model chose not to use tools
        return response_message.content

Implementing Dynamic Tool Selection

For agents with many tools, implement a dynamic tool selection system:

# Example: Dynamic tool selection based on query analysis

class ToolRegistry:
    def __init__(self):
        self.tools = {}
        self.tool_descriptions = {}
    
    def register_tool(self, name, function, description):
        self.tools[name] = function
        self.tool_descriptions[name] = description
    
    def get_relevant_tools(self, query, max_tools=3):
        """Select the most relevant tools for a given query"""
        # In a real implementation, use embeddings or LLM to rank tools
        # This is a simplified implementation
        tool_scores = {}
        
        for name, description in self.tool_descriptions.items():
            # Simple keyword matching (use embeddings in a real system)
            score = sum(keyword in query.lower() for keyword in description.lower().split())
            tool_scores[name] = score
        
        # Get top N tools
        relevant_tools = sorted(tool_scores.items(), key=lambda x: x[1], reverse=True)[:max_tools]
        return [name for name, score in relevant_tools if score > 0]
    
    def execute_tool(self, name, **kwargs):
        if name in self.tools:
            return self.tools[name](**kwargs)
        else:
            return f"Error: Tool '{name}' not found"

# Usage example
registry = ToolRegistry()
registry.register_tool("get_weather", get_weather, "Get weather information for a location")
registry.register_tool("search_web", search_web, "Search the web for information")
# Register more tools...

def agent_with_dynamic_tools(user_input):
    # Select relevant tools for this query
    relevant_tool_names = registry.get_relevant_tools(user_input)
    relevant_tools = [t for t in tools if t["function"]["name"] in relevant_tool_names]
    
    # Only provide relevant tools to the model
    # Rest of the implementation follows the previous example...

Tool Design Best Practices

Follow these principles for effective tool integration:

  • Atomic functionality: Each tool should do one thing well
  • Clear interfaces: Use descriptive names and documentation
  • Robust error handling: Tools should fail gracefully with helpful error messages
  • Rate limiting: Implement safeguards against excessive tool use
  • Stateless when possible: Prefer stateless tools for reliability

Retrieval-Augmented Generation (RAG)

RAG systems enable your agent to access and leverage specific knowledge bases, documentation, or other content that may not be in the model's training data.

Building an Effective RAG System

  1. Document Processing

    Prepare your documents for retrieval:

    # Example: Processing documents for RAG
    
    from langchain.document_loaders import DirectoryLoader, TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.embeddings import OpenAIEmbeddings
    from langchain.vectorstores import Chroma
    
    # Load documents
    loader = DirectoryLoader('./documents/', glob="**/*.txt", loader_cls=TextLoader)
    documents = loader.load()
    
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", " ", ""]
    )
    text_chunks = text_splitter.split_documents(documents)
    
    # Create embeddings and store in vector database
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_documents(text_chunks, embeddings, collection_name="document_store")
  2. Retrieval Strategy

    Implement effective retrieval logic:

    # Example: Advanced retrieval strategies
    
    from langchain.retrievers import ContextualCompressionRetriever
    from langchain.retrievers.document_compressors import LLMChainExtractor
    
    # Basic retriever
    basic_retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
    
    # Enhanced retriever with LLM-based document filtering
    llm = OpenAI(temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)
    compression_retriever = ContextualCompressionRetriever(
        base_retriever=basic_retriever,
        base_compressor=compressor
    )
    
    # Function to retrieve relevant context
    def get_relevant_context(query, advanced=True):
        if advanced:
            docs = compression_retriever.get_relevant_documents(query)
        else:
            docs = basic_retriever.get_relevant_documents(query)
        
        return "\n\n".join([doc.page_content for doc in docs])
  3. Query Transformation

    Improve retrieval with query optimization:

    # Example: Query transformation for better retrieval
    
    def generate_search_queries(original_query):
        """Generate multiple search
    
        queries to improve retrieval results"""
        prompt = f"""
    Given the original search query: "{original_query}"
    Generate 3 alternative search queries that:
    1. Rephrase the question using different terminology
    2. Break down complex queries into simpler sub-queries
    3. Add relevant context or specify domain information
    
    Format each alternative query on a new line.
    """
        response = llm(prompt)
        # Parse response to extract queries
        alternative_queries = [line.strip() for line in response.split('\n') if line.strip()]
        # Include the original query
        all_queries = [original_query] + alternative_queries
        return all_queries
    
    def enhanced_retrieval(query):
        # Generate multiple search queries
        search_queries = generate_search_queries(query)
        
        # Retrieve documents for each query
        all_docs = []
        for search_query in search_queries:
            docs = basic_retriever.get_relevant_documents(search_query)
            all_docs.extend(docs)
        
        # Remove duplicates and rank by relevance
        unique_docs = {}
        for doc in all_docs:
            doc_id = hash(doc.page_content)
            if doc_id not in unique_docs:
                unique_docs[doc_id] = doc
        
        # Return the most relevant unique documents
        from langchain.retrievers import BM25Retriever
        bm25_retriever = BM25Retriever.from_documents(list(unique_docs.values()))
        final_docs = bm25_retriever.get_relevant_documents(query)
        
        return final_docs[:5]  # Return top 5 most relevant documents

Integrating RAG with Your Agent

# Example: RAG-powered agent

from langchain.agents import initialize_agent, Tool
from langchain.memory import ConversationBufferMemory

# Define tools including RAG
tools = [
    Tool(
        name="DocumentSearch",
        func=lambda q: "\n".join(doc.page_content for doc in enhanced_retrieval(q)),
        description="Useful for when you need to find specific information in documents. Input should be a search query."
    ),
    # Add other tools like web search, calculator, etc.
]

# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history")

# Initialize the agent
agent = initialize_agent(
    tools=tools,
    llm=llm,
    memory=memory,
    agent="chat-conversational-react-description",
    verbose=True
)

# Agent handler function
def rag_agent_response(user_input):
    try:
        response = agent.run(input=user_input)
        return response
    except Exception as e:
        return f"I encountered an error: {str(e)}"

RAG System Challenges

Be aware of these common challenges when implementing RAG:

  • Hallucination: Even with retrieval, models may generate incorrect facts
  • Content contradictions: Retrieved documents may contain conflicting information
  • Context window limits: Retrieved content must fit within model's context window
  • Retrieval quality: Semantic search may miss important information
  • Data freshness: Vector stores need updating when source content changes

Multi-Agent Systems

Multi-agent systems distribute complex tasks across multiple specialized agents that collaborate to solve problems. This approach can improve robustness, scalability, and specialization.

Multi-Agent Architectures

Architecture Description Best For
Hub and Spoke Central coordinator delegates to specialist agents Task decomposition, diverse specializations
Debate Framework Multiple agents critique and refine each other's work Complex reasoning, reducing bias
Assembly Line Sequential processing where each agent handles one step Well-defined processes with distinct phases
Hierarchical Management hierarchy with increasing abstraction Complex systems requiring multiple levels of planning

Implementing a Hub and Spoke Architecture

# Example: Hub and spoke multi-agent system

class AgentManager:
    def __init__(self, llm):
        self.llm = llm
        self.specialist_agents = {}
    
    def register_specialist(self, name, description, handler_function):
        """Register a specialist agent with the manager"""
        self.specialist_agents[name] = {
            "description": description,
            "handler": handler_function
        }
    
    def route_task(self, user_query):
        """Determine which specialist should handle this query"""
        agent_descriptions = "\n".join([
            f"- {name}: {details['description']}" 
            for name, details in self.specialist_agents.items()
        ])
        
        routing_prompt = f"""
Based on the user query, determine which specialist agent should handle this task.
Available specialists:
{agent_descriptions}

User query: "{user_query}"

Select the most appropriate specialist by name. If multiple specialists are needed, 
list them in order of priority. If no specialist is appropriate, respond with "DIRECT_RESPONSE".
"""
        
        response = self.llm(routing_prompt)
        # Extract agent name(s) - in a real system, use more robust parsing
        selected_agent = response.strip()
        
        return selected_agent
    
    def process_query(self, user_query):
        """Process user query by routing to appropriate specialist(s)"""
        selected_agent = self.route_task(user_query)
        
        if selected_agent == "DIRECT_RESPONSE":
            # No specialist needed, respond directly
            return self.generate_direct_response(user_query)
        
        # Check if the selected agent exists
        if selected_agent in self.specialist_agents:
            # Route to the specialist
            return self.specialist_agents[selected_agent]["handler"](user_query)
        else:
            # Fallback if routing returned an invalid agent
            return self.generate_direct_response(user_query)
    
    def generate_direct_response(self, user_query):
        """Generate a direct response when no specialist is needed"""
        response = self.llm(f"User query: {user_query}\nResponse:")
        return response

# Example usage
manager = AgentManager(llm)

# Register specialist agents
manager.register_specialist(
    name="ResearchAgent",
    description="Handles in-depth research queries requiring information synthesis",
    handler_function=lambda q: research_agent_handler(q)
)

manager.register_specialist(
    name="CodeAgent",
    description="Specializes in writing, explaining, and debugging code",
    handler_function=lambda q: code_agent_handler(q)
)

manager.register_specialist(
    name="DataAnalysisAgent",
    description="Processes and analyzes data, creates visualizations",
    handler_function=lambda q: data_analysis_handler(q)
)

# Process user query
response = manager.process_query("Can you help me analyze this CSV file of sales data?")

Implementing a Debate Framework

# Example: Debate framework for complex reasoning

def debate_framework(question, num_agents=3, rounds=2):
    """
    Use a debate framework where multiple agents discuss a question
    to arrive at a more accurate answer
    """
    # Initialize debate with the question
    debate_history = [f"Question: {question}\n\nThe agents will debate this question."]
    
    # Create agent personas with different perspectives
    agent_personas = [
        "You are a critical thinker who questions assumptions and looks for logical flaws.",
        "You are a creative thinker who considers unconventional approaches and possibilities.",
        "You are a detail-oriented analyst who focuses on facts and empirical evidence."
    ][:num_agents]
    
    # Conduct the debate for the specified number of rounds
    for round_num in range(1, rounds + 1):
        debate_history.append(f"\n\n--- Round {round_num} ---")
        
        # Each agent takes a turn
        for agent_idx, persona in enumerate(agent_personas):
            agent_prompt = f"""
{persona}

Below is the debate so far:
{''.join(debate_history)}

As Agent {agent_idx + 1}, provide your perspective on the question. 
If this is not the first round, respond to the points made by other agents.
Be concise but thorough in your reasoning.
"""
            # Get this agent's contribution
            agent_response = llm(agent_prompt, max_tokens=500)
            
            # Add to debate history
            debate_history.append(f"\n\nAgent {agent_idx + 1}:\n{agent_response}")
    
    # Final synthesis prompt
    synthesis_prompt = f"""
A debate was conducted on the following question:
{question}

The full debate transcript is below:
{''.join(debate_history)}

Synthesize the key insights from this debate into a comprehensive answer.
Highlight areas of agreement and disagreement, and provide a balanced conclusion.
"""
    
    final_answer = llm(synthesis_prompt, max_tokens=800)
    return final_answer

Multi-Agent Communication Strategies

Consider these approaches for agent-to-agent communication:

  • Structured messages: Standardized formats (JSON, XML) for reliable parsing
  • Shared memory: Common knowledge base that all agents can access
  • Broadcast/subscribe: Agents subscribe to relevant information channels
  • Mediator pattern: Central component manages communication between agents

Performance Optimization

As AI agents become more complex and handle more sophisticated tasks, optimizing their performance becomes critical. This involves improving response quality, reducing latency, managing costs, and ensuring scalability.

Why Optimization Matters

Performance optimization directly impacts user experience, operational costs, and system reliability. Optimizing your AI agents helps:

  • Improve user experience with faster, more responsive interactions
  • Reduce operational costs by minimizing unnecessary API calls and token usage
  • Enhance system reliability by preventing bottlenecks and handling larger workloads
  • Support scaling as your user base and functionality grows

Core Optimization Techniques

Here are several proven strategies to improve the performance and efficiency of your AI agents:

Model Cascading

Use smaller, faster models for initial processing and larger models only when necessary. This approach balances cost and quality efficiently.

# Example: Model cascading approach

def cascading_response(user_query):
    # 1. Try with small, fast model first
    fast_model = "gpt-3.5-turbo"
    response = client.chat.completions.create(
        model=fast_model,
        messages=[{"role": "user", "content": user_query}]
    )
    fast_response = response.choices[0].message.content
    
    # 2. Check confidence or quality using a heuristic
    confidence_check = client.chat.completions.create(
        model=fast_model,
        messages=[
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": fast_response},
            {"role": "user", "content": "On a scale of 1-10, how confident are you in this response? Just provide a number."}
        ]
    )
    confidence = int(confidence_check.choices[0].message.content.strip())
    
    # 3. If confidence is high, return the fast response
    if confidence >= 7:
        return fast_response
    
    # 4. Otherwise, use the more powerful model
    powerful_model = "gpt-4"
    response = client.chat.completions.create(
        model=powerful_model,
        messages=[{"role": "user", "content": user_query}]
    )
    return response.choices[0].message.content

Caching Strategies

Implement smart caching to avoid redundant computation and API calls, significantly reducing latency and costs.

# Example: Semantic caching for LLM responses

import hashlib
import json
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

class SemanticCache:
    def __init__(self, embedding_model="all-MiniLM-L6-v2"):
        # Initialize embedding model
        self.embedding_model = SentenceTransformer(embedding_model)
        
        # Initialize cache store
        self.cache = {}
        
        # Initialize FAISS index for fast similarity search
        embedding_dim = self.embedding_model.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatL2(embedding_dim)
        
        # Keep track of query-to-index mapping
        self.query_map = []
    
    def _get_embedding(self, text):
        """Generate embedding for text"""
        return self.embedding_model.encode([text])[0]
    
    def get_response(self, query, generate_func, threshold=0.92):
        """Get response from cache or generate new one"""
        # Generate embedding for query
        query_embedding = self._get_embedding(query)
        
        # If we have cached items, search for similar queries
        if len(self.query_map) > 0:
            # Search for similar queries
            D, I = self.index.search(np.array([query_embedding]).astype('float32'), 1)
            distance = D[0][0]
            
            # Convert distance to similarity score (higher is better)
            similarity = 1 / (1 + distance)
            
            if similarity > threshold:
                # Retrieve cached response
                cache_key = self.query_map[I[0][0]]
                return self.cache[cache_key], True  # Second value indicates cache hit
        
        # Cache miss - generate new response
        response = generate_func(query)
        
        # Add to cache
        cache_key = hashlib.md5(query.encode()).hexdigest()
        self.cache[cache_key] = response
        
        # Add to index
        self.index.add(np.array([query_embedding]).astype('float32'))
        self.query_map.append(cache_key)
        
        return response, False  # Second value indicates cache miss


# Example usage (separated from class definition)
def example_usage():
    from openai import OpenAI
    
    # Initialize your LLM client
    client = OpenAI(api_key="your-api-key")
    
    # Initialize the semantic cache
    semantic_cache = SemanticCache()

    def get_cached_response(query):
        def generate_response(q):
            # This is where you'd call your LLM API
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": q}]
            )
            return response.choices[0].message.content
        
        response, cache_hit = semantic_cache.get_response(query, generate_response)
        if cache_hit:
            print("Cache hit!")
        else:
            print("Cache miss - generated new response")
        
        return response
    
    # Example queries
    print(get_cached_response("What is the capital of France?"))
    print(get_cached_response("Tell me about Paris, the capital city of France"))  # Should hit cache


if __name__ == "__main__":
    example_usage()
            

Request Batching

Group multiple operations to reduce API overhead and improve throughput, especially useful for embedding and processing multiple inputs.

# Example: Request batching for multiple operations

import asyncio
import time
from collections import deque

class BatchProcessor:
    def __init__(self, process_func, max_batch_size=10, max_wait_time=0.5):
        self.process_func = process_func
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        
        self.queue = deque()
        self.processing = False
        self.last_batch_time = time.time()
    
    async def add_request(self, item):
        # Create a future to track this item's result
        future = asyncio.Future()
        
        # Add to queue
        self.queue.append((item, future))
        
        # Start processing if needed
        if not self.processing:
            asyncio.create_task(self._process_batch())
        
        # Return future so caller can await it
        return await future
    
    async def _process_batch(self):
        self.processing = True
        
        while self.queue:
            # Determine if we should process a batch now
            current_time = time.time()
            queue_size = len(self.queue)
            time_since_last_batch = current_time - self.last_batch_time
            
            should_process = (
                queue_size >= self.max_batch_size or
                time_since_last_batch >= self.max_wait_time
            )
            
            if not should_process:
                # Wait a bit before checking again
                await asyncio.sleep(0.1)
                continue
            
            # Process a batch
            batch_size = min(self.max_batch_size, queue_size)
            batch_items = []
            batch_futures = []
            
            for _ in range(batch_size):
                item, future = self.queue.popleft()
                batch_items.append(item)
                batch_futures.append(future)
            
            # Process the batch and get results
            try:
                batch_results = await self.process_func(batch_items)
                
                # Set results for each future
                for i, future in enumerate(batch_futures):
                    future.set_result(batch_results[i])
            except Exception as e:
                # Propagate error to all futures
                for future in batch_futures:
                    future.set_exception(e)
            
            # Update tracking variables
            self.last_batch_time = time.time()
        
        self.processing = False

# Example usage with embeddings
async def process_embeddings_batch(texts):
    """Process a batch of texts into embeddings"""
    # In a real implementation, this would call an embedding API
    embeddings = client.embeddings.create(
        model="text-embedding-ada-002",
        input=texts
    )
    return [embedding.embedding for embedding in embeddings.data]

# Create batch processor
embedding_batcher = BatchProcessor(process_embeddings_batch)

# Example usage
async def get_embedding(text):
    return await embedding_batcher.add_request(text)
            

Token Optimization

Minimize token usage to reduce costs and improve response times while maintaining output quality.

# Example: Token optimization techniques

class TokenOptimizer:
    def __init__(self, max_context_tokens=4000):
        self.max_context_tokens = max_context_tokens
    
    def count_tokens(self, text):
        """Estimate token count in text - real implementation would use tokenizer"""
        # Approximate estimate (in a real implementation, use the specific model's tokenizer)
        return len(text.split()) * 1.3  # rough estimate
    
    def prioritize_context(self, context_items, query, reserved_tokens=500):
        """Prioritize context items to fit within token limit"""
        # Reserve tokens for the query and response
        available_tokens = self.max_context_tokens - reserved_tokens
        
        # Get token count for query
        query_tokens = self.count_tokens(query)
        available_tokens -= query_tokens
        
        # Prioritize and select context items
        selected_items = []
        current_tokens = 0
        
        # Sort context items by relevance (in a real implementation, use semantic relevance)
        # For this example, we'll use recency as a proxy for relevance
        sorted_items = sorted(context_items, key=lambda x: x.get('relevance', 0), reverse=True)
        
        for item in sorted_items:
            item_tokens = self.count_tokens(item['text'])
            
            if current_tokens + item_tokens <= available_tokens:
                selected_items.append(item)
                current_tokens += item_tokens
            else:
                # If the item is too large, we could truncate it instead of skipping
                # (implementation depends on the specific use case)
                continue
        
        return selected_items
    
    def optimize_prompt(self, system_prompt, user_messages, context_items=None):
        """Optimize a complete prompt for token efficiency"""
        # Count tokens in fixed parts
        system_tokens = self.count_tokens(system_prompt)
        
        # Calculate tokens used by user messages (excluding current query)
        message_tokens = sum(self.count_tokens(msg['content']) for msg in user_messages)
        
        # Reserve tokens for response
        reserved_tokens = 500
        
        # Calculate available tokens for context
        available_tokens = self.max_context_tokens - system_tokens - message_tokens - reserved_tokens
        
        # If we have context items, prioritize them
        optimized_context = []
        if context_items and available_tokens > 0:
            # Get current query
            current_query = user_messages[-1]['content'] if user_messages else ""
            
            # Prioritize context
            optimized_context = self.prioritize_context(
                context_items, 
                current_query,
                reserved_tokens
            )
        
        # If we're still over budget, trim conversation history
        # (keeping the most recent messages)
        if system_tokens + message_tokens + reserved_tokens > self.max_context_tokens:
            # Keep the most recent messages, dropping older ones
            preserved_messages = []
            current_tokens = system_tokens + reserved_tokens
            
            # Process messages in reverse order (newest first)
            for msg in reversed(user_messages):
                msg_tokens = self.count_tokens(msg['content'])
                
                if current_tokens + msg_tokens <= self.max_context_tokens:
                    preserved_messages.insert(0, msg)  # Add to front of list
                    current_tokens += msg_tokens
                else:
                    break  # Stop once we can't add more messages
            
            user_messages = preserved_messages
        
        return {
            "system_prompt": system_prompt,
            "user_messages": user_messages,
            "context_items": optimized_context
        }

Scaling Considerations

Consideration Technique Impact
Latency Asynchronous processing, connection pooling Reduced waiting time for users
Cost Management Token optimization, model cascading Lower API costs, efficient resource use
Throughput Request batching, load balancing Higher system capacity
Reliability Circuit breakers, exponential backoff Graceful handling of failures

Performance Monitoring

Implement comprehensive monitoring to identify bottlenecks:

  • Response times: Track latency at each stage of processing
  • Token usage: Monitor input and output tokens to control costs
  • Error rates: Track failures and categorize error types
  • Cache efficiency: Measure hit rates and optimization opportunities
  • User satisfaction: Collect feedback on agent responses

Advanced Optimization Patterns

πŸ’‘

Progressive Enhancement

Deliver a basic response quickly, then enhance it with additional details as they become available.

  • Prioritize critical information delivery
  • Stream responses when possible
  • Use background processing for enrichment
πŸ”„

Precomputation

Compute expensive operations in advance and store results for quick retrieval.

  • Generate embeddings for common queries
  • Preprocess documents for retrieval
  • Build knowledge graphs offline
πŸ“±

Client-Side Optimization

Offload suitable tasks to the client to reduce server load and improve responsiveness.

  • Implement request debouncing
  • Cache responses locally
  • Compress data transmissions

Optimization Action Plan

Follow these steps to implement a comprehensive optimization strategy for your AI agent:

  1. Establish Performance Baselines

    Measure current performance metrics to identify optimization opportunities and track improvements.

  2. Identify Bottlenecks

    Use profiling tools to find the slowest components and highest cost operations in your agent system.

  3. Implement Core Optimizations

    Apply the techniques described above, starting with those that address your biggest bottlenecks.

  4. Monitor and Iterate

    Continuously track performance metrics and adjust your optimization strategy as usage patterns evolve.

Remember that optimization is an ongoing process, not a one-time effort. As your agent evolves and usage patterns change, regularly revisit your optimization strategy.

Go Live with Your Agent

Your agent is readyβ€”now make it available to the world. Learn how to deploy it to the cloud, as an API, or in your app.

Deploy Agent