Learn to build powerful AI agents for specific tasks
Take your AI agent to the next level with powerful capabilities and optimizations
Once you've built a basic AI agent, you can enhance its capabilities with advanced features that make it more powerful, useful, and responsive. In this guide, we'll explore techniques for adding sophisticated capabilities to your AI agent including enhanced memory systems, complex reasoning patterns, tool chaining, and more.
Building on the foundations from our First Agent Tutorial, we'll now focus on:
Implement sophisticated memory mechanisms to help your agent maintain context over long interactions.
Learn MoreEnable your agent to break down complex problems and reason through multi-step solutions.
Learn MoreConnect your agent to multiple tools and external systems with sophisticated routing.
Learn MoreImplement RAG to give your agent access to specific knowledge bases and documents.
Learn MoreCreate systems where multiple specialized agents collaborate to solve complex problems.
Learn MoreTechniques to improve response quality, reduce latency, and manage costs.
Learn MoreBasic AI agents typically have limited context windows, making it difficult to maintain information over long interactions. Enhanced memory systems solve this problem by storing, retrieving, and managing information effectively.
Memory Type | Description | Best Used For |
---|---|---|
Short-term (Buffer) | Maintains recent conversation history | Immediate context in conversations |
Long-term (Vector DB) | Stores important information permanently | User preferences, facts, decisions |
Episodic | Organizes memories into related episodes | Task sequences, conversation threads |
Working | Temporarily holds information for current task | Multi-step reasoning processes |
Vector databases are ideal for semantic memory systems that can retrieve information based on meaning rather than exact matching:
# Example: Implementing a vector-based memory system with LangChain and Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
# Initialize embedding model
embeddings = OpenAIEmbeddings()
# Create a vector store to hold memories
memory_db = Chroma(embedding_function=embeddings, collection_name="agent_memories")
# Function to add a new memory
def store_memory(text, metadata=None):
# Split long texts into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(text)
# Store in vector database with optional metadata
memory_db.add_texts(texts=texts, metadatas=[metadata] * len(texts))
print(f"Stored new memory: {text[:50]}...")
# Function to retrieve relevant memories
def retrieve_memories(query, k=3):
docs = memory_db.similarity_search(query, k=k)
return [doc.page_content for doc in docs]
# Example usage in an agent
def agent_with_memory(user_input):
# Retrieve relevant memories based on user input
relevant_memories = retrieve_memories(user_input)
# Use memories to enhance the context for the response
context = "\n".join(["Relevant information:", *relevant_memories])
# Generate response using the enriched context
llm = OpenAI(temperature=0)
response = llm(f"Context: {context}\nUser question: {user_input}\nResponse:")
# Store this interaction as a new memory
store_memory(f"User: {user_input}\nAgent: {response}")
return response
For long interactions, implement periodic summarization of memories to prevent context overflow while preserving important information:
Chain-of-Thought (CoT) reasoning enables your agent to break down complex problems into smaller steps and think through each step sequentially. This significantly improves performance on tasks requiring multi-step reasoning.
Modify your agent's prompt to explicitly ask for step-by-step thinking:
# Example: Chain-of-Thought prompt
def cot_prompt(question):
return f"""
Question: {question}
To solve this problem, I need to think through this step by step:
1. First, I'll understand what is being asked.
2. Then, I'll break down the problem into smaller parts.
3. For each part, I'll apply relevant knowledge or techniques.
4. Finally, I'll combine the results to form my answer.
Let me work through this systematically:
"""
# Example usage
question = "If a company's revenue grew by 15% to $690,000, what was the original revenue?"
response = llm(cot_prompt(question))
Generate multiple reasoning paths and select the most consistent answer:
# Example: Self-consistency with multiple reasoning paths
def solve_with_self_consistency(question, num_paths=3):
results = []
for i in range(num_paths):
# Generate a reasoning path with a different seed
response = llm(cot_prompt(question), temperature=0.7)
# Extract the final answer
# This is a simplified extraction - you may need more robust parsing
lines = response.split('\n')
final_answer = lines[-1] if "answer" in lines[-1].lower() else response
results.append(final_answer)
# Find the most common answer
from collections import Counter
answer_counts = Counter(results)
most_common_answer = answer_counts.most_common(1)[0][0]
return most_common_answer
Allow your agent to review and critique its own reasoning:
# Example: Implementing reflection
def reflective_reasoning(question):
# First reasoning attempt
initial_reasoning = llm(cot_prompt(question), temperature=0.5)
# Prompt for reflection
reflection_prompt = f"""
I solved this problem as follows:
{initial_reasoning}
Now I'll reflect on my solution:
1. Did I understand the problem correctly?
2. Did I make any calculation errors?
3. Is my reasoning logically sound?
4. Are there any assumptions I made that might be incorrect?
5. Is there a more elegant or efficient approach?
My reflection:
"""
# Generate reflection
reflection = llm(reflection_prompt)
# Final revised answer based on reflection
final_answer_prompt = f"""
Original problem: {question}
My initial solution:
{initial_reasoning}
My reflection:
{reflection}
Based on my reflection, my revised and final answer is:
"""
final_answer = llm(final_answer_prompt)
return final_answer
Even with Chain-of-Thought, agents can confidently present incorrect reasoning. Mitigate this by:
While basic agents might use one or two tools, advanced agents can leverage diverse tools and decide which ones to use based on the task at hand. Effective tool integration requires careful design of tool selection and orchestration.
Interleaving reasoning and action, where the agent thinks about what tool to use, uses it, then observes the result before the next step.
Structured tool use where the agent explicitly calls functions with specific parameters, enabling more reliable tool interactions.
Sequential use of multiple tools where output from one tool becomes input to another, enabling complex workflows.
# Example: Function calling with OpenAI
from openai import OpenAI
import json
import requests
from datetime import datetime
client = OpenAI()
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g., San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
]
# Implement the actual functions
def get_weather(location, unit="celsius"):
# In a real implementation, this would call a weather API
# This is a mock implementation
weather_data = {
"location": location,
"temperature": "22" if unit == "celsius" else "72",
"unit": unit,
"condition": "Sunny",
"humidity": "45%"
}
return json.dumps(weather_data)
def search_web(query):
# In a real implementation, this would call a search API
# This is a mock implementation
return json.dumps({
"results": [
{"title": f"Result for {query}", "snippet": "This is a sample search result."}
]
})
# Agent with tool-use capability
def agent_with_tools(user_input):
messages = [{"role": "user", "content": user_input}]
# First, let the model decide which tool to use (if any)
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools,
tool_choice="auto"
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# If the model wants to use tools
if tool_calls:
# Add the model's response planning to use tools
messages.append(response_message)
# Process each tool call
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
# Call the appropriate function
if function_name == "get_weather":
function_response = get_weather(**function_args)
elif function_name == "search_web":
function_response = search_web(**function_args)
else:
function_response = f"Error: Function {function_name} not found"
# Append the function response to messages
messages.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response
})
# Get the final response after tool use
second_response = client.chat.completions.create(
model="gpt-4",
messages=messages
)
return second_response.choices[0].message.content
else:
# Model chose not to use tools
return response_message.content
For agents with many tools, implement a dynamic tool selection system:
# Example: Dynamic tool selection based on query analysis
class ToolRegistry:
def __init__(self):
self.tools = {}
self.tool_descriptions = {}
def register_tool(self, name, function, description):
self.tools[name] = function
self.tool_descriptions[name] = description
def get_relevant_tools(self, query, max_tools=3):
"""Select the most relevant tools for a given query"""
# In a real implementation, use embeddings or LLM to rank tools
# This is a simplified implementation
tool_scores = {}
for name, description in self.tool_descriptions.items():
# Simple keyword matching (use embeddings in a real system)
score = sum(keyword in query.lower() for keyword in description.lower().split())
tool_scores[name] = score
# Get top N tools
relevant_tools = sorted(tool_scores.items(), key=lambda x: x[1], reverse=True)[:max_tools]
return [name for name, score in relevant_tools if score > 0]
def execute_tool(self, name, **kwargs):
if name in self.tools:
return self.tools[name](**kwargs)
else:
return f"Error: Tool '{name}' not found"
# Usage example
registry = ToolRegistry()
registry.register_tool("get_weather", get_weather, "Get weather information for a location")
registry.register_tool("search_web", search_web, "Search the web for information")
# Register more tools...
def agent_with_dynamic_tools(user_input):
# Select relevant tools for this query
relevant_tool_names = registry.get_relevant_tools(user_input)
relevant_tools = [t for t in tools if t["function"]["name"] in relevant_tool_names]
# Only provide relevant tools to the model
# Rest of the implementation follows the previous example...
Follow these principles for effective tool integration:
RAG systems enable your agent to access and leverage specific knowledge bases, documentation, or other content that may not be in the model's training data.
Prepare your documents for retrieval:
# Example: Processing documents for RAG
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Load documents
loader = DirectoryLoader('./documents/', glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
text_chunks = text_splitter.split_documents(documents)
# Create embeddings and store in vector database
embeddings = OpenAIEmbeddings()
vector_store = Chroma.from_documents(text_chunks, embeddings, collection_name="document_store")
Implement effective retrieval logic:
# Example: Advanced retrieval strategies
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Basic retriever
basic_retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})
# Enhanced retriever with LLM-based document filtering
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_retriever=basic_retriever,
base_compressor=compressor
)
# Function to retrieve relevant context
def get_relevant_context(query, advanced=True):
if advanced:
docs = compression_retriever.get_relevant_documents(query)
else:
docs = basic_retriever.get_relevant_documents(query)
return "\n\n".join([doc.page_content for doc in docs])
Improve retrieval with query optimization:
# Example: Query transformation for better retrieval
def generate_search_queries(original_query):
"""Generate multiple search
queries to improve retrieval results"""
prompt = f"""
Given the original search query: "{original_query}"
Generate 3 alternative search queries that:
1. Rephrase the question using different terminology
2. Break down complex queries into simpler sub-queries
3. Add relevant context or specify domain information
Format each alternative query on a new line.
"""
response = llm(prompt)
# Parse response to extract queries
alternative_queries = [line.strip() for line in response.split('\n') if line.strip()]
# Include the original query
all_queries = [original_query] + alternative_queries
return all_queries
def enhanced_retrieval(query):
# Generate multiple search queries
search_queries = generate_search_queries(query)
# Retrieve documents for each query
all_docs = []
for search_query in search_queries:
docs = basic_retriever.get_relevant_documents(search_query)
all_docs.extend(docs)
# Remove duplicates and rank by relevance
unique_docs = {}
for doc in all_docs:
doc_id = hash(doc.page_content)
if doc_id not in unique_docs:
unique_docs[doc_id] = doc
# Return the most relevant unique documents
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(list(unique_docs.values()))
final_docs = bm25_retriever.get_relevant_documents(query)
return final_docs[:5] # Return top 5 most relevant documents
# Example: RAG-powered agent
from langchain.agents import initialize_agent, Tool
from langchain.memory import ConversationBufferMemory
# Define tools including RAG
tools = [
Tool(
name="DocumentSearch",
func=lambda q: "\n".join(doc.page_content for doc in enhanced_retrieval(q)),
description="Useful for when you need to find specific information in documents. Input should be a search query."
),
# Add other tools like web search, calculator, etc.
]
# Set up memory
memory = ConversationBufferMemory(memory_key="chat_history")
# Initialize the agent
agent = initialize_agent(
tools=tools,
llm=llm,
memory=memory,
agent="chat-conversational-react-description",
verbose=True
)
# Agent handler function
def rag_agent_response(user_input):
try:
response = agent.run(input=user_input)
return response
except Exception as e:
return f"I encountered an error: {str(e)}"
Be aware of these common challenges when implementing RAG:
Multi-agent systems distribute complex tasks across multiple specialized agents that collaborate to solve problems. This approach can improve robustness, scalability, and specialization.
Architecture | Description | Best For |
---|---|---|
Hub and Spoke | Central coordinator delegates to specialist agents | Task decomposition, diverse specializations |
Debate Framework | Multiple agents critique and refine each other's work | Complex reasoning, reducing bias |
Assembly Line | Sequential processing where each agent handles one step | Well-defined processes with distinct phases |
Hierarchical | Management hierarchy with increasing abstraction | Complex systems requiring multiple levels of planning |
# Example: Hub and spoke multi-agent system
class AgentManager:
def __init__(self, llm):
self.llm = llm
self.specialist_agents = {}
def register_specialist(self, name, description, handler_function):
"""Register a specialist agent with the manager"""
self.specialist_agents[name] = {
"description": description,
"handler": handler_function
}
def route_task(self, user_query):
"""Determine which specialist should handle this query"""
agent_descriptions = "\n".join([
f"- {name}: {details['description']}"
for name, details in self.specialist_agents.items()
])
routing_prompt = f"""
Based on the user query, determine which specialist agent should handle this task.
Available specialists:
{agent_descriptions}
User query: "{user_query}"
Select the most appropriate specialist by name. If multiple specialists are needed,
list them in order of priority. If no specialist is appropriate, respond with "DIRECT_RESPONSE".
"""
response = self.llm(routing_prompt)
# Extract agent name(s) - in a real system, use more robust parsing
selected_agent = response.strip()
return selected_agent
def process_query(self, user_query):
"""Process user query by routing to appropriate specialist(s)"""
selected_agent = self.route_task(user_query)
if selected_agent == "DIRECT_RESPONSE":
# No specialist needed, respond directly
return self.generate_direct_response(user_query)
# Check if the selected agent exists
if selected_agent in self.specialist_agents:
# Route to the specialist
return self.specialist_agents[selected_agent]["handler"](user_query)
else:
# Fallback if routing returned an invalid agent
return self.generate_direct_response(user_query)
def generate_direct_response(self, user_query):
"""Generate a direct response when no specialist is needed"""
response = self.llm(f"User query: {user_query}\nResponse:")
return response
# Example usage
manager = AgentManager(llm)
# Register specialist agents
manager.register_specialist(
name="ResearchAgent",
description="Handles in-depth research queries requiring information synthesis",
handler_function=lambda q: research_agent_handler(q)
)
manager.register_specialist(
name="CodeAgent",
description="Specializes in writing, explaining, and debugging code",
handler_function=lambda q: code_agent_handler(q)
)
manager.register_specialist(
name="DataAnalysisAgent",
description="Processes and analyzes data, creates visualizations",
handler_function=lambda q: data_analysis_handler(q)
)
# Process user query
response = manager.process_query("Can you help me analyze this CSV file of sales data?")
# Example: Debate framework for complex reasoning
def debate_framework(question, num_agents=3, rounds=2):
"""
Use a debate framework where multiple agents discuss a question
to arrive at a more accurate answer
"""
# Initialize debate with the question
debate_history = [f"Question: {question}\n\nThe agents will debate this question."]
# Create agent personas with different perspectives
agent_personas = [
"You are a critical thinker who questions assumptions and looks for logical flaws.",
"You are a creative thinker who considers unconventional approaches and possibilities.",
"You are a detail-oriented analyst who focuses on facts and empirical evidence."
][:num_agents]
# Conduct the debate for the specified number of rounds
for round_num in range(1, rounds + 1):
debate_history.append(f"\n\n--- Round {round_num} ---")
# Each agent takes a turn
for agent_idx, persona in enumerate(agent_personas):
agent_prompt = f"""
{persona}
Below is the debate so far:
{''.join(debate_history)}
As Agent {agent_idx + 1}, provide your perspective on the question.
If this is not the first round, respond to the points made by other agents.
Be concise but thorough in your reasoning.
"""
# Get this agent's contribution
agent_response = llm(agent_prompt, max_tokens=500)
# Add to debate history
debate_history.append(f"\n\nAgent {agent_idx + 1}:\n{agent_response}")
# Final synthesis prompt
synthesis_prompt = f"""
A debate was conducted on the following question:
{question}
The full debate transcript is below:
{''.join(debate_history)}
Synthesize the key insights from this debate into a comprehensive answer.
Highlight areas of agreement and disagreement, and provide a balanced conclusion.
"""
final_answer = llm(synthesis_prompt, max_tokens=800)
return final_answer
Consider these approaches for agent-to-agent communication:
As AI agents become more complex and handle more sophisticated tasks, optimizing their performance becomes critical. This involves improving response quality, reducing latency, managing costs, and ensuring scalability.
Performance optimization directly impacts user experience, operational costs, and system reliability. Optimizing your AI agents helps:
Here are several proven strategies to improve the performance and efficiency of your AI agents:
Use smaller, faster models for initial processing and larger models only when necessary. This approach balances cost and quality efficiently.
# Example: Model cascading approach
def cascading_response(user_query):
# 1. Try with small, fast model first
fast_model = "gpt-3.5-turbo"
response = client.chat.completions.create(
model=fast_model,
messages=[{"role": "user", "content": user_query}]
)
fast_response = response.choices[0].message.content
# 2. Check confidence or quality using a heuristic
confidence_check = client.chat.completions.create(
model=fast_model,
messages=[
{"role": "user", "content": user_query},
{"role": "assistant", "content": fast_response},
{"role": "user", "content": "On a scale of 1-10, how confident are you in this response? Just provide a number."}
]
)
confidence = int(confidence_check.choices[0].message.content.strip())
# 3. If confidence is high, return the fast response
if confidence >= 7:
return fast_response
# 4. Otherwise, use the more powerful model
powerful_model = "gpt-4"
response = client.chat.completions.create(
model=powerful_model,
messages=[{"role": "user", "content": user_query}]
)
return response.choices[0].message.content
Implement smart caching to avoid redundant computation and API calls, significantly reducing latency and costs.
# Example: Semantic caching for LLM responses
import hashlib
import json
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
class SemanticCache:
def __init__(self, embedding_model="all-MiniLM-L6-v2"):
# Initialize embedding model
self.embedding_model = SentenceTransformer(embedding_model)
# Initialize cache store
self.cache = {}
# Initialize FAISS index for fast similarity search
embedding_dim = self.embedding_model.get_sentence_embedding_dimension()
self.index = faiss.IndexFlatL2(embedding_dim)
# Keep track of query-to-index mapping
self.query_map = []
def _get_embedding(self, text):
"""Generate embedding for text"""
return self.embedding_model.encode([text])[0]
def get_response(self, query, generate_func, threshold=0.92):
"""Get response from cache or generate new one"""
# Generate embedding for query
query_embedding = self._get_embedding(query)
# If we have cached items, search for similar queries
if len(self.query_map) > 0:
# Search for similar queries
D, I = self.index.search(np.array([query_embedding]).astype('float32'), 1)
distance = D[0][0]
# Convert distance to similarity score (higher is better)
similarity = 1 / (1 + distance)
if similarity > threshold:
# Retrieve cached response
cache_key = self.query_map[I[0][0]]
return self.cache[cache_key], True # Second value indicates cache hit
# Cache miss - generate new response
response = generate_func(query)
# Add to cache
cache_key = hashlib.md5(query.encode()).hexdigest()
self.cache[cache_key] = response
# Add to index
self.index.add(np.array([query_embedding]).astype('float32'))
self.query_map.append(cache_key)
return response, False # Second value indicates cache miss
# Example usage (separated from class definition)
def example_usage():
from openai import OpenAI
# Initialize your LLM client
client = OpenAI(api_key="your-api-key")
# Initialize the semantic cache
semantic_cache = SemanticCache()
def get_cached_response(query):
def generate_response(q):
# This is where you'd call your LLM API
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": q}]
)
return response.choices[0].message.content
response, cache_hit = semantic_cache.get_response(query, generate_response)
if cache_hit:
print("Cache hit!")
else:
print("Cache miss - generated new response")
return response
# Example queries
print(get_cached_response("What is the capital of France?"))
print(get_cached_response("Tell me about Paris, the capital city of France")) # Should hit cache
if __name__ == "__main__":
example_usage()
Group multiple operations to reduce API overhead and improve throughput, especially useful for embedding and processing multiple inputs.
# Example: Request batching for multiple operations
import asyncio
import time
from collections import deque
class BatchProcessor:
def __init__(self, process_func, max_batch_size=10, max_wait_time=0.5):
self.process_func = process_func
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time
self.queue = deque()
self.processing = False
self.last_batch_time = time.time()
async def add_request(self, item):
# Create a future to track this item's result
future = asyncio.Future()
# Add to queue
self.queue.append((item, future))
# Start processing if needed
if not self.processing:
asyncio.create_task(self._process_batch())
# Return future so caller can await it
return await future
async def _process_batch(self):
self.processing = True
while self.queue:
# Determine if we should process a batch now
current_time = time.time()
queue_size = len(self.queue)
time_since_last_batch = current_time - self.last_batch_time
should_process = (
queue_size >= self.max_batch_size or
time_since_last_batch >= self.max_wait_time
)
if not should_process:
# Wait a bit before checking again
await asyncio.sleep(0.1)
continue
# Process a batch
batch_size = min(self.max_batch_size, queue_size)
batch_items = []
batch_futures = []
for _ in range(batch_size):
item, future = self.queue.popleft()
batch_items.append(item)
batch_futures.append(future)
# Process the batch and get results
try:
batch_results = await self.process_func(batch_items)
# Set results for each future
for i, future in enumerate(batch_futures):
future.set_result(batch_results[i])
except Exception as e:
# Propagate error to all futures
for future in batch_futures:
future.set_exception(e)
# Update tracking variables
self.last_batch_time = time.time()
self.processing = False
# Example usage with embeddings
async def process_embeddings_batch(texts):
"""Process a batch of texts into embeddings"""
# In a real implementation, this would call an embedding API
embeddings = client.embeddings.create(
model="text-embedding-ada-002",
input=texts
)
return [embedding.embedding for embedding in embeddings.data]
# Create batch processor
embedding_batcher = BatchProcessor(process_embeddings_batch)
# Example usage
async def get_embedding(text):
return await embedding_batcher.add_request(text)
Minimize token usage to reduce costs and improve response times while maintaining output quality.
# Example: Token optimization techniques
class TokenOptimizer:
def __init__(self, max_context_tokens=4000):
self.max_context_tokens = max_context_tokens
def count_tokens(self, text):
"""Estimate token count in text - real implementation would use tokenizer"""
# Approximate estimate (in a real implementation, use the specific model's tokenizer)
return len(text.split()) * 1.3 # rough estimate
def prioritize_context(self, context_items, query, reserved_tokens=500):
"""Prioritize context items to fit within token limit"""
# Reserve tokens for the query and response
available_tokens = self.max_context_tokens - reserved_tokens
# Get token count for query
query_tokens = self.count_tokens(query)
available_tokens -= query_tokens
# Prioritize and select context items
selected_items = []
current_tokens = 0
# Sort context items by relevance (in a real implementation, use semantic relevance)
# For this example, we'll use recency as a proxy for relevance
sorted_items = sorted(context_items, key=lambda x: x.get('relevance', 0), reverse=True)
for item in sorted_items:
item_tokens = self.count_tokens(item['text'])
if current_tokens + item_tokens <= available_tokens:
selected_items.append(item)
current_tokens += item_tokens
else:
# If the item is too large, we could truncate it instead of skipping
# (implementation depends on the specific use case)
continue
return selected_items
def optimize_prompt(self, system_prompt, user_messages, context_items=None):
"""Optimize a complete prompt for token efficiency"""
# Count tokens in fixed parts
system_tokens = self.count_tokens(system_prompt)
# Calculate tokens used by user messages (excluding current query)
message_tokens = sum(self.count_tokens(msg['content']) for msg in user_messages)
# Reserve tokens for response
reserved_tokens = 500
# Calculate available tokens for context
available_tokens = self.max_context_tokens - system_tokens - message_tokens - reserved_tokens
# If we have context items, prioritize them
optimized_context = []
if context_items and available_tokens > 0:
# Get current query
current_query = user_messages[-1]['content'] if user_messages else ""
# Prioritize context
optimized_context = self.prioritize_context(
context_items,
current_query,
reserved_tokens
)
# If we're still over budget, trim conversation history
# (keeping the most recent messages)
if system_tokens + message_tokens + reserved_tokens > self.max_context_tokens:
# Keep the most recent messages, dropping older ones
preserved_messages = []
current_tokens = system_tokens + reserved_tokens
# Process messages in reverse order (newest first)
for msg in reversed(user_messages):
msg_tokens = self.count_tokens(msg['content'])
if current_tokens + msg_tokens <= self.max_context_tokens:
preserved_messages.insert(0, msg) # Add to front of list
current_tokens += msg_tokens
else:
break # Stop once we can't add more messages
user_messages = preserved_messages
return {
"system_prompt": system_prompt,
"user_messages": user_messages,
"context_items": optimized_context
}
Consideration | Technique | Impact |
---|---|---|
Latency | Asynchronous processing, connection pooling | Reduced waiting time for users |
Cost Management | Token optimization, model cascading | Lower API costs, efficient resource use |
Throughput | Request batching, load balancing | Higher system capacity |
Reliability | Circuit breakers, exponential backoff | Graceful handling of failures |
Implement comprehensive monitoring to identify bottlenecks:
Deliver a basic response quickly, then enhance it with additional details as they become available.
Compute expensive operations in advance and store results for quick retrieval.
Offload suitable tasks to the client to reduce server load and improve responsiveness.
Follow these steps to implement a comprehensive optimization strategy for your AI agent:
Measure current performance metrics to identify optimization opportunities and track improvements.
Use profiling tools to find the slowest components and highest cost operations in your agent system.
Apply the techniques described above, starting with those that address your biggest bottlenecks.
Continuously track performance metrics and adjust your optimization strategy as usage patterns evolve.
Remember that optimization is an ongoing process, not a one-time effort. As your agent evolves and usage patterns change, regularly revisit your optimization strategy.