Advanced AI Agent Features

Introduction to Advanced Features

Once you've built a basic AI agent, you can enhance its capabilities with advanced features that make it more powerful, useful, and responsive. In this guide, we'll explore techniques for adding sophisticated capabilities to your AI agent including enhanced memory systems, complex reasoning patterns, tool chaining, and more.

Building on the foundations from our First Agent Tutorial, we'll now focus on:

Clear task definition: Define exactly what your agent should and shouldn't do
Thoughtful prompt engineering: Guide the agent's behavior with well-crafted instructions
Robust tool integration: Give your agent the capabilities it needs to succeed
User-centric design: Create agents that solve real problems for users

Enhanced Memory Systems

Basic AI agents typically have limited context windows, making it difficult to maintain information over long interactions. Enhanced memory systems solve this problem by storing, retrieving, and managing information effectively.

Types of Memory Systems

Memory Type	Description	Best Used For
Short-term (Buffer)	Maintains recent conversation history	Immediate context in conversations
Long-term (Vector DB)	Stores important information permanently	User preferences, facts, decisions
Episodic	Organizes memories into related episodes	Task sequences, conversation threads
Working	Temporarily holds information for current task	Multi-step reasoning processes

Implementing a Vector-Based Memory System

Vector databases are ideal for semantic memory systems that can retrieve information based on meaning rather than exact matching:

# Example: Implementing a vector-based memory system with LangChain and Chroma

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain

# Initialize embedding model
embeddings = OpenAIEmbeddings()

# Create a vector store to hold memories
memory_db = Chroma(embedding_function=embeddings, collection_name="agent_memories")

# Function to add a new memory
def store_memory(text, metadata=None):
    # Split long texts into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(text)
    
    # Store in vector database with optional metadata
    memory_db.add_texts(texts=texts, metadatas=[metadata] * len(texts))
    print(f"Stored new memory: {text[:50]}...")

# Function to retrieve relevant memories
def retrieve_memories(query, k=3):
    docs = memory_db.similarity_search(query, k=k)
    return [doc.page_content for doc in docs]

# Example usage in an agent
def agent_with_memory(user_input):
    # Retrieve relevant memories based on user input
    relevant_memories = retrieve_memories(user_input)
    
    # Use memories to enhance the context for the response
    context = "\n".join(["Relevant information:", *relevant_memories])
    
    # Generate response using the enriched context
    llm = OpenAI(temperature=0)
    response = llm(f"Context: {context}\nUser question: {user_input}\nResponse:")
    
    # Store this interaction as a new memory
    store_memory(f"User: {user_input}\nAgent: {response}")
    
    return response

Pro Tip: Memory Summarization

For long interactions, implement periodic summarization of memories to prevent context overflow while preserving important information:

Use the LLM itself to generate summaries of conversation history
Store both detailed memories and their summaries
Implement a hierarchy of memory: recent details + summarized history

Chain-of-Thought Reasoning

Chain-of-Thought (CoT) reasoning enables your agent to break down complex problems into smaller steps and think through each step sequentially. This significantly improves performance on tasks requiring multi-step reasoning.

Implementing Chain-of-Thought

Explicit Prompting for Reasoning

Modify your agent's prompt to explicitly ask for step-by-step thinking:

# Example: Chain-of-Thought prompt

def cot_prompt(question):
    return f"""
Question: {question}

To solve this problem, I need to think through this step by step:
1. First, I'll understand what is being asked.
2. Then, I'll break down the problem into smaller parts.
3. For each part, I'll apply relevant knowledge or techniques.
4. Finally, I'll combine the results to form my answer.

Let me work through this systematically:
"""

# Example usage
question = "If a company's revenue grew by 15% to $690,000, what was the original revenue?"
response = llm(cot_prompt(question))

Self-Consistency Techniques

Generate multiple reasoning paths and select the most consistent answer:

# Example: Self-consistency with multiple reasoning paths

def solve_with_self_consistency(question, num_paths=3):
    results = []
    
    for i in range(num_paths):
        # Generate a reasoning path with a different seed
        response = llm(cot_prompt(question), temperature=0.7)
        
        # Extract the final answer
        # This is a simplified extraction - you may need more robust parsing
        lines = response.split('\n')
        final_answer = lines[-1] if "answer" in lines[-1].lower() else response
        
        results.append(final_answer)
    
    # Find the most common answer
    from collections import Counter
    answer_counts = Counter(results)
    most_common_answer = answer_counts.most_common(1)[0][0]
    
    return most_common_answer

Reflection Mechanisms

Allow your agent to review and critique its own reasoning:

# Example: Implementing reflection

def reflective_reasoning(question):
    # First reasoning attempt
    initial_reasoning = llm(cot_prompt(question), temperature=0.5)
    
    # Prompt for reflection
    reflection_prompt = f"""
I solved this problem as follows:
{initial_reasoning}

Now I'll reflect on my solution:
1. Did I understand the problem correctly?
2. Did I make any calculation errors?
3. Is my reasoning logically sound?
4. Are there any assumptions I made that might be incorrect?
5. Is there a more elegant or efficient approach?

My reflection:
"""
    
    # Generate reflection
    reflection = llm(reflection_prompt)
    
    # Final revised answer based on reflection
    final_answer_prompt = f"""
Original problem: {question}

My initial solution:
{initial_reasoning}

My reflection:
{reflection}

Based on my reflection, my revised and final answer is:
"""
    
    final_answer = llm(final_answer_prompt)
    return final_answer

Common Pitfall: Hallucination in Complex Reasoning

Even with Chain-of-Thought, agents can confidently present incorrect reasoning. Mitigate this by:

Implementing verification steps for critical calculations
Using tool calls for mathematical operations rather than relying on the LLM
Adding explicit fact-checking mechanisms for each reasoning step

Advanced Tool Integration

While basic agents might use one or two tools, advanced agents can leverage diverse tools and decide which ones to use based on the task at hand. Effective tool integration requires careful design of tool selection and orchestration.

Tool Orchestration Patterns

ReAct Pattern

Interleaving reasoning and action, where the agent thinks about what tool to use, uses it, then observes the result before the next step.

Function Calling

Structured tool use where the agent explicitly calls functions with specific parameters, enabling more reliable tool interactions.

Tool Chaining

Sequential use of multiple tools where output from one tool becomes input to another, enabling complex workflows.

Implementing Function Calling with OpenAI

# Example: Function calling with OpenAI

from openai import OpenAI
import json
import requests
from datetime import datetime

client = OpenAI()

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g., San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

# Implement the actual functions
def get_weather(location, unit="celsius"):
    # In a real implementation, this would call a weather API
    # This is a mock implementation
    weather_data = {
        "location": location,
        "temperature": "22" if unit == "celsius" else "72",
        "unit": unit,
        "condition": "Sunny",
        "humidity": "45%"
    }
    return json.dumps(weather_data)

def search_web(query):
    # In a real implementation, this would call a search API
    # This is a mock implementation
    return json.dumps({
        "results": [
            {"title": f"Result for {query}", "snippet": "This is a sample search result."}
        ]
    })

# Agent with tool-use capability
def agent_with_tools(user_input):
    messages = [{"role": "user", "content": user_input}]
    
    # First, let the model decide which tool to use (if any)
    response = client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    
    # If the model wants to use tools
    if tool_calls:
        # Add the model's response planning to use tools
        messages.append(response_message)
        
        # Process each tool call
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)
            
            # Call the appropriate function
            if function_name == "get_weather":
                function_response = get_weather(**function_args)
            elif function_name == "search_web":
                function_response = search_web(**function_args)
            else:
                function_response = f"Error: Function {function_name} not found"
            
            # Append the function response to messages
            messages.append({
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": function_name,
                "content": function_response
            })
        
        # Get the final response after tool use
        second_response = client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )
        
        return second_response.choices[0].message.content
    else:
        # Model chose not to use tools
        return response_message.content

Implementing Dynamic Tool Selection

For agents with many tools, implement a dynamic tool selection system:

# Example: Dynamic tool selection based on query analysis

class ToolRegistry:
    def __init__(self):
        self.tools = {}
        self.tool_descriptions = {}
    
    def register_tool(self, name, function, description):
        self.tools[name] = function
        self.tool_descriptions[name] = description
    
    def get_relevant_tools(self, query, max_tools=3):
        """Select the most relevant tools for a given query"""
        # In a real implementation, use embeddings or LLM to rank tools
        # This is a simplified implementation
        tool_scores = {}
        
        for name, description in self.tool_descriptions.items():
            # Simple keyword matching (use embeddings in a real system)
            score = sum(keyword in query.lower() for keyword in description.lower().split())
            tool_scores[name] = score
        
        # Get top N tools
        relevant_tools = sorted(tool_scores.items(), key=lambda x: x[1], reverse=True)[:max_tools]
        return [name for name, score in relevant_tools if score > 0]
    
    def execute_tool(self, name, **kwargs):
        if name in self.tools:
            return self.tools[name](**kwargs)
        else:
            return f"Error: Tool '{name}' not found"

# Usage example
registry = ToolRegistry()
registry.register_tool("get_weather", get_weather, "Get weather information for a location")
registry.register_tool("search_web", search_web, "Search the web for information")
# Register more tools...

def agent_with_dynamic_tools(user_input):
    # Select relevant tools for this query
    relevant_tool_names = registry.get_relevant_tools(user_input)
    relevant_tools = [t for t in tools if t["function"]["name"] in relevant_tool_names]
    
    # Only provide relevant tools to the model
    # Rest of the implementation follows the previous example...

Tool Design Best Practices

Follow these principles for effective tool integration:

Atomic functionality: Each tool should do one thing well
Clear interfaces: Use descriptive names and documentation
Robust error handling: Tools should fail gracefully with helpful error messages
Rate limiting: Implement safeguards against excessive tool use
Stateless when possible: Prefer stateless tools for reliability

Retrieval-Augmented Generation (RAG)

RAG systems enable your agent to access and leverage specific knowledge bases, documentation, or other content that may not be in the model's training data.

Building an Effective RAG System

Document Processing

Prepare your documents for retrieval:

# Example: Processing documents for RAG

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Load documents
loader = DirectoryLoader('./documents/', glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
text_chunks = text_splitter.split_documents(documents)

# Create embeddings and store in vector database
embeddings = OpenAIEmbeddings()
vector_store = Chroma.from_documents(text_chunks, embeddings, collection_name="document_store")

Retrieval Strategy

Implement effective retrieval logic:

# Example: Advanced retrieval strategies

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Basic retriever
basic_retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# Enhanced retriever with LLM-based document filtering
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_retriever=basic_retriever,
    base_compressor=compressor
)

# Function to retrieve relevant context
def get_relevant_context(query, advanced=True):
    if advanced:
        docs = compression_retriever.get_relevant_documents(query)
    else:
        docs = basic_retriever.get_relevant_documents(query)
    
    return "\n\n".join([doc.page_content for doc in docs])

Query Transformation

Improve retrieval with query optimization:

# Example: Query transformation for better retrieval

def generate_search_queries(original_query):
    """Generate multiple search

    queries to improve retrieval results"""
    prompt = f"""
Given the original search query: "{original_query}"
Generate 3 alternative search queries that:
1. Rephrase the question using different terminology
2. Break down complex queries into simpler sub-queries
3. Add relevant context or specify domain information

Format each alternative query on a new line.
"""
    response = llm(prompt)
    # Parse response to extract queries
    alternative_queries = [line.strip() for line in response.split('\n') if line.strip()]
    # Include the original query
    all_queries = [original_query] + alternative_queries
    return all_queries

def enhanced_retrieval(query):
    # Generate multiple search queries
    search_queries = generate_search_queries(query)
    
    # Retrieve documents for each query
    all_docs = []
    for search_query in search_queries:
        docs = basic_retriever.get_relevant_documents(search_query)
        all_docs.extend(docs)
    
    # Remove duplicates and rank by relevance
    unique_docs = {}
    for doc in all_docs:
        doc_id = hash(doc.page_content)
        if doc_id not in unique_docs:
            unique_docs[doc_id] = doc
    
    # Return the most relevant unique documents
    from langchain.retrievers import BM25Retriever
    bm25_retriever = BM25Retriever.from_documents(list(unique_docs.values()))
    final_docs = bm25_retriever.get_relevant_documents(query)
    
    return final_docs[:5]  # Return top 5 most relevant documents

Integrating RAG with Your Agent

# Example: RAG-powered agent

from langchain.agents import initialize_agent, Tool
from langchain.memory import ConversationBufferMemory

# Define tools including RAG
tools = [
    Tool(
        name="DocumentSearch",
        func=lambda q: "\n".join(doc.page_content for doc in enhanced_retrieval(q)),
        description="Useful for when you need to find specific information in documents. Input should be a search query."
    ),
    # Add other tools like web search, calculator, etc.
]

# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history")

# Initialize the agent
agent = initialize_agent(
    tools=tools,
    llm=llm,
    memory=memory,
    agent="chat-conversational-react-description",
    verbose=True
)

# Agent handler function
def rag_agent_response(user_input):
    try:
        response = agent.run(input=user_input)
        return response
    except Exception as e:
        return f"I encountered an error: {str(e)}"

RAG System Challenges

Be aware of these common challenges when implementing RAG:

Hallucination: Even with retrieval, models may generate incorrect facts
Content contradictions: Retrieved documents may contain conflicting information
Context window limits: Retrieved content must fit within model's context window
Retrieval quality: Semantic search may miss important information
Data freshness: Vector stores need updating when source content changes

Multi-Agent Systems

Multi-agent systems distribute complex tasks across multiple specialized agents that collaborate to solve problems. This approach can improve robustness, scalability, and specialization.

Multi-Agent Architectures

Architecture	Description	Best For
Hub and Spoke	Central coordinator delegates to specialist agents	Task decomposition, diverse specializations
Debate Framework	Multiple agents critique and refine each other's work	Complex reasoning, reducing bias
Assembly Line	Sequential processing where each agent handles one step	Well-defined processes with distinct phases
Hierarchical	Management hierarchy with increasing abstraction	Complex systems requiring multiple levels of planning

Implementing a Hub and Spoke Architecture

# Example: Hub and spoke multi-agent system

class AgentManager:
    def __init__(self, llm):
        self.llm = llm
        self.specialist_agents = {}
    
    def register_specialist(self, name, description, handler_function):
        """Register a specialist agent with the manager"""
        self.specialist_agents[name] = {
            "description": description,
            "handler": handler_function
        }
    
    def route_task(self, user_query):
        """Determine which specialist should handle this query"""
        agent_descriptions = "\n".join([
            f"- {name}: {details['description']}" 
            for name, details in self.specialist_agents.items()
        ])
        
        routing_prompt = f"""
Based on the user query, determine which specialist agent should handle this task.
Available specialists:
{agent_descriptions}

User query: "{user_query}"

Select the most appropriate specialist by name. If multiple specialists are needed, 
list them in order of priority. If no specialist is appropriate, respond with "DIRECT_RESPONSE".
"""
        
        response = self.llm(routing_prompt)
        # Extract agent name(s) - in a real system, use more robust parsing
        selected_agent = response.strip()
        
        return selected_agent
    
    def process_query(self, user_query):
        """Process user query by routing to appropriate specialist(s)"""
        selected_agent = self.route_task(user_query)
        
        if selected_agent == "DIRECT_RESPONSE":
            # No specialist needed, respond directly
            return self.generate_direct_response(user_query)
        
        # Check if the selected agent exists
        if selected_agent in self.specialist_agents:
            # Route to the specialist
            return self.specialist_agents[selected_agent]["handler"](user_query)
        else:
            # Fallback if routing returned an invalid agent
            return self.generate_direct_response(user_query)
    
    def generate_direct_response(self, user_query):
        """Generate a direct response when no specialist is needed"""
        response = self.llm(f"User query: {user_query}\nResponse:")
        return response

# Example usage
manager = AgentManager(llm)

# Register specialist agents
manager.register_specialist(
    name="ResearchAgent",
    description="Handles in-depth research queries requiring information synthesis",
    handler_function=lambda q: research_agent_handler(q)
)

manager.register_specialist(
    name="CodeAgent",
    description="Specializes in writing, explaining, and debugging code",
    handler_function=lambda q: code_agent_handler(q)
)

manager.register_specialist(
    name="DataAnalysisAgent",
    description="Processes and analyzes data, creates visualizations",
    handler_function=lambda q: data_analysis_handler(q)
)

# Process user query
response = manager.process_query("Can you help me analyze this CSV file of sales data?")

Implementing a Debate Framework

# Example: Debate framework for complex reasoning

def debate_framework(question, num_agents=3, rounds=2):
    """
    Use a debate framework where multiple agents discuss a question
    to arrive at a more accurate answer
    """
    # Initialize debate with the question
    debate_history = [f"Question: {question}\n\nThe agents will debate this question."]
    
    # Create agent personas with different perspectives
    agent_personas = [
        "You are a critical thinker who questions assumptions and looks for logical flaws.",
        "You are a creative thinker who considers unconventional approaches and possibilities.",
        "You are a detail-oriented analyst who focuses on facts and empirical evidence."
    ][:num_agents]
    
    # Conduct the debate for the specified number of rounds
    for round_num in range(1, rounds + 1):
        debate_history.append(f"\n\n--- Round {round_num} ---")
        
        # Each agent takes a turn
        for agent_idx, persona in enumerate(agent_personas):
            agent_prompt = f"""
{persona}

Below is the debate so far:
{''.join(debate_history)}

As Agent {agent_idx + 1}, provide your perspective on the question. 
If this is not the first round, respond to the points made by other agents.
Be concise but thorough in your reasoning.
"""
            # Get this agent's contribution
            agent_response = llm(agent_prompt, max_tokens=500)
            
            # Add to debate history
            debate_history.append(f"\n\nAgent {agent_idx + 1}:\n{agent_response}")
    
    # Final synthesis prompt
    synthesis_prompt = f"""
A debate was conducted on the following question:
{question}

The full debate transcript is below:
{''.join(debate_history)}

Synthesize the key insights from this debate into a comprehensive answer.
Highlight areas of agreement and disagreement, and provide a balanced conclusion.
"""
    
    final_answer = llm(synthesis_prompt, max_tokens=800)
    return final_answer

Multi-Agent Communication Strategies

Consider these approaches for agent-to-agent communication:

Structured messages: Standardized formats (JSON, XML) for reliable parsing
Shared memory: Common knowledge base that all agents can access
Broadcast/subscribe: Agents subscribe to relevant information channels
Mediator pattern: Central component manages communication between agents

Performance Optimization

As AI agents become more complex and handle more sophisticated tasks, optimizing their performance becomes critical. This involves improving response quality, reducing latency, managing costs, and ensuring scalability.

Why Optimization Matters

Performance optimization directly impacts user experience, operational costs, and system reliability. Optimizing your AI agents helps:

Improve user experience with faster, more responsive interactions
Reduce operational costs by minimizing unnecessary API calls and token usage
Enhance system reliability by preventing bottlenecks and handling larger workloads
Support scaling as your user base and functionality grows

Core Optimization Techniques

Here are several proven strategies to improve the performance and efficiency of your AI agents:

Model Cascading

Use smaller, faster models for initial processing and larger models only when necessary. This approach balances cost and quality efficiently.

# Example: Model cascading approach

def cascading_response(user_query):
    # 1. Try with small, fast model first
    fast_model = "gpt-3.5-turbo"
    response = client.chat.completions.create(
        model=fast_model,
        messages=[{"role": "user", "content": user_query}]
    )
    fast_response = response.choices[0].message.content
    
    # 2. Check confidence or quality using a heuristic
    confidence_check = client.chat.completions.create(
        model=fast_model,
        messages=[
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": fast_response},
            {"role": "user", "content": "On a scale of 1-10, how confident are you in this response? Just provide a number."}
        ]
    )
    confidence = int(confidence_check.choices[0].message.content.strip())
    
    # 3. If confidence is high, return the fast response
    if confidence >= 7:
        return fast_response
    
    # 4. Otherwise, use the more powerful model
    powerful_model = "gpt-4"
    response = client.chat.completions.create(
        model=powerful_model,
        messages=[{"role": "user", "content": user_query}]
    )
    return response.choices[0].message.content

Caching Strategies

Implement smart caching to avoid redundant computation and API calls, significantly reducing latency and costs.

# Example: Semantic caching for LLM responses

import hashlib
import json
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

class SemanticCache:
    def __init__(self, embedding_model="all-MiniLM-L6-v2"):
        # Initialize embedding model
        self.embedding_model = SentenceTransformer(embedding_model)
        
        # Initialize cache store
        self.cache = {}
        
        # Initialize FAISS index for fast similarity search
        embedding_dim = self.embedding_model.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatL2(embedding_dim)
        
        # Keep track of query-to-index mapping
        self.query_map = []
    
    def _get_embedding(self, text):
        """Generate embedding for text"""
        return self.embedding_model.encode([text])[0]
    
    def get_response(self, query, generate_func, threshold=0.92):
        """Get response from cache or generate new one"""
        # Generate embedding for query
        query_embedding = self._get_embedding(query)
        
        # If we have cached items, search for similar queries
        if len(self.query_map) > 0:
            # Search for similar queries
            D, I = self.index.search(np.array([query_embedding]).astype('float32'), 1)
            distance = D[0][0]
            
            # Convert distance to similarity score (higher is better)
            similarity = 1 / (1 + distance)
            
            if similarity > threshold:
                # Retrieve cached response
                cache_key = self.query_map[I[0][0]]
                return self.cache[cache_key], True  # Second value indicates cache hit
        
        # Cache miss - generate new response
        response = generate_func(query)
        
        # Add to cache
        cache_key = hashlib.md5(query.encode()).hexdigest()
        self.cache[cache_key] = response
        
        # Add to index
        self.index.add(np.array([query_embedding]).astype('float32'))
        self.query_map.append(cache_key)
        
        return response, False  # Second value indicates cache miss


# Example usage (separated from class definition)
def example_usage():
    from openai import OpenAI
    
    # Initialize your LLM client
    client = OpenAI(api_key="your-api-key")
    
    # Initialize the semantic cache
    semantic_cache = SemanticCache()

    def get_cached_response(query):
        def generate_response(q):
            # This is where you'd call your LLM API
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": q}]
            )
            return response.choices[0].message.content
        
        response, cache_hit = semantic_cache.get_response(query, generate_response)
        if cache_hit:
            print("Cache hit!")
        else:
            print("Cache miss - generated new response")
        
        return response
    
    # Example queries
    print(get_cached_response("What is the capital of France?"))
    print(get_cached_response("Tell me about Paris, the capital city of France"))  # Should hit cache


if __name__ == "__main__":
    example_usage()

Request Batching

Group multiple operations to reduce API overhead and improve throughput, especially useful for embedding and processing multiple inputs.

# Example: Request batching for multiple operations

import asyncio
import time
from collections import deque

class BatchProcessor:
    def __init__(self, process_func, max_batch_size=10, max_wait_time=0.5):
        self.process_func = process_func
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        
        self.queue = deque()
        self.processing = False
        self.last_batch_time = time.time()
    
    async def add_request(self, item):
        # Create a future to track this item's result
        future = asyncio.Future()
        
        # Add to queue
        self.queue.append((item, future))
        
        # Start processing if needed
        if not self.processing:
            asyncio.create_task(self._process_batch())
        
        # Return future so caller can await it
        return await future
    
    async def _process_batch(self):
        self.processing = True
        
        while self.queue:
            # Determine if we should process a batch now
            current_time = time.time()
            queue_size = len(self.queue)
            time_since_last_batch = current_time - self.last_batch_time
            
            should_process = (
                queue_size >= self.max_batch_size or
                time_since_last_batch >= self.max_wait_time
            )
            
            if not should_process:
                # Wait a bit before checking again
                await asyncio.sleep(0.1)
                continue
            
            # Process a batch
            batch_size = min(self.max_batch_size, queue_size)
            batch_items = []
            batch_futures = []
            
            for _ in range(batch_size):
                item, future = self.queue.popleft()
                batch_items.append(item)
                batch_futures.append(future)
            
            # Process the batch and get results
            try:
                batch_results = await self.process_func(batch_items)
                
                # Set results for each future
                for i, future in enumerate(batch_futures):
                    future.set_result(batch_results[i])
            except Exception as e:
                # Propagate error to all futures
                for future in batch_futures:
                    future.set_exception(e)
            
            # Update tracking variables
            self.last_batch_time = time.time()
        
        self.processing = False

# Example usage with embeddings
async def process_embeddings_batch(texts):
    """Process a batch of texts into embeddings"""
    # In a real implementation, this would call an embedding API
    embeddings = client.embeddings.create(
        model="text-embedding-ada-002",
        input=texts
    )
    return [embedding.embedding for embedding in embeddings.data]

# Create batch processor
embedding_batcher = BatchProcessor(process_embeddings_batch)

# Example usage
async def get_embedding(text):
    return await embedding_batcher.add_request(text)

Token Optimization

Minimize token usage to reduce costs and improve response times while maintaining output quality.

# Example: Token optimization techniques

class TokenOptimizer:
    def __init__(self, max_context_tokens=4000):
        self.max_context_tokens = max_context_tokens
    
    def count_tokens(self, text):
        """Estimate token count in text - real implementation would use tokenizer"""
        # Approximate estimate (in a real implementation, use the specific model's tokenizer)
        return len(text.split()) * 1.3  # rough estimate
    
    def prioritize_context(self, context_items, query, reserved_tokens=500):
        """Prioritize context items to fit within token limit"""
        # Reserve tokens for the query and response
        available_tokens = self.max_context_tokens - reserved_tokens
        
        # Get token count for query
        query_tokens = self.count_tokens(query)
        available_tokens -= query_tokens
        
        # Prioritize and select context items
        selected_items = []
        current_tokens = 0
        
        # Sort context items by relevance (in a real implementation, use semantic relevance)
        # For this example, we'll use recency as a proxy for relevance
        sorted_items = sorted(context_items, key=lambda x: x.get('relevance', 0), reverse=True)
        
        for item in sorted_items:
            item_tokens = self.count_tokens(item['text'])
            
            if current_tokens + item_tokens <= available_tokens:
                selected_items.append(item)
                current_tokens += item_tokens
            else:
                # If the item is too large, we could truncate it instead of skipping
                # (implementation depends on the specific use case)
                continue
        
        return selected_items
    
    def optimize_prompt(self, system_prompt, user_messages, context_items=None):
        """Optimize a complete prompt for token efficiency"""
        # Count tokens in fixed parts
        system_tokens = self.count_tokens(system_prompt)
        
        # Calculate tokens used by user messages (excluding current query)
        message_tokens = sum(self.count_tokens(msg['content']) for msg in user_messages)
        
        # Reserve tokens for response
        reserved_tokens = 500
        
        # Calculate available tokens for context
        available_tokens = self.max_context_tokens - system_tokens - message_tokens - reserved_tokens
        
        # If we have context items, prioritize them
        optimized_context = []
        if context_items and available_tokens > 0:
            # Get current query
            current_query = user_messages[-1]['content'] if user_messages else ""
            
            # Prioritize context
            optimized_context = self.prioritize_context(
                context_items, 
                current_query,
                reserved_tokens
            )
        
        # If we're still over budget, trim conversation history
        # (keeping the most recent messages)
        if system_tokens + message_tokens + reserved_tokens > self.max_context_tokens:
            # Keep the most recent messages, dropping older ones
            preserved_messages = []
            current_tokens = system_tokens + reserved_tokens
            
            # Process messages in reverse order (newest first)
            for msg in reversed(user_messages):
                msg_tokens = self.count_tokens(msg['content'])
                
                if current_tokens + msg_tokens <= self.max_context_tokens:
                    preserved_messages.insert(0, msg)  # Add to front of list
                    current_tokens += msg_tokens
                else:
                    break  # Stop once we can't add more messages
            
            user_messages = preserved_messages
        
        return {
            "system_prompt": system_prompt,
            "user_messages": user_messages,
            "context_items": optimized_context
        }

Scaling Considerations

Consideration	Technique	Impact
Latency	Asynchronous processing, connection pooling	Reduced waiting time for users
Cost Management	Token optimization, model cascading	Lower API costs, efficient resource use
Throughput	Request batching, load balancing	Higher system capacity
Reliability	Circuit breakers, exponential backoff	Graceful handling of failures

Performance Monitoring

Implement comprehensive monitoring to identify bottlenecks:

Response times: Track latency at each stage of processing
Token usage: Monitor input and output tokens to control costs
Error rates: Track failures and categorize error types
Cache efficiency: Measure hit rates and optimization opportunities
User satisfaction: Collect feedback on agent responses

Advanced Optimization Patterns

💡

Progressive Enhancement

Deliver a basic response quickly, then enhance it with additional details as they become available.

Prioritize critical information delivery
Stream responses when possible
Use background processing for enrichment

🔄

Precomputation

Compute expensive operations in advance and store results for quick retrieval.

Generate embeddings for common queries
Preprocess documents for retrieval
Build knowledge graphs offline

📱

Client-Side Optimization

Offload suitable tasks to the client to reduce server load and improve responsiveness.

Implement request debouncing
Cache responses locally
Compress data transmissions

Optimization Action Plan

Follow these steps to implement a comprehensive optimization strategy for your AI agent:

Establish Performance Baselines

Measure current performance metrics to identify optimization opportunities and track improvements.
Identify Bottlenecks

Use profiling tools to find the slowest components and highest cost operations in your agent system.
Implement Core Optimizations

Apply the techniques described above, starting with those that address your biggest bottlenecks.
Monitor and Iterate

Continuously track performance metrics and adjust your optimization strategy as usage patterns evolve.

Remember that optimization is an ongoing process, not a one-time effort. As your agent evolves and usage patterns change, regularly revisit your optimization strategy.

Introduction to Advanced Features

Enhanced Memory Systems

Chain-of-Thought Reasoning

Advanced Tool Integration

Retrieval-Augmented Generation

Multi-Agent Systems

Performance Optimization

Enhanced Memory Systems

Types of Memory Systems

Implementing a Vector-Based Memory System

Pro Tip: Memory Summarization

Chain-of-Thought Reasoning

Implementing Chain-of-Thought

Explicit Prompting for Reasoning

Self-Consistency Techniques

Reflection Mechanisms

Common Pitfall: Hallucination in Complex Reasoning

Advanced Tool Integration

Tool Orchestration Patterns

ReAct Pattern

Function Calling

Tool Chaining

Implementing Function Calling with OpenAI

Implementing Dynamic Tool Selection

Tool Design Best Practices

Retrieval-Augmented Generation (RAG)

Building an Effective RAG System

Document Processing

Retrieval Strategy

Query Transformation

Integrating RAG with Your Agent

RAG System Challenges

Multi-Agent Systems

Multi-Agent Architectures

Implementing a Hub and Spoke Architecture

Implementing a Debate Framework

Multi-Agent Communication Strategies

Performance Optimization

Why Optimization Matters

Core Optimization Techniques

Model Cascading

Caching Strategies

Request Batching

Token Optimization

Scaling Considerations

Performance Monitoring

Advanced Optimization Patterns

Progressive Enhancement

Precomputation

Client-Side Optimization

Optimization Action Plan

Establish Performance Baselines

Identify Bottlenecks

Implement Core Optimizations

Monitor and Iterate

Go Live with Your Agent