This section covers containerization, API serving, experiment tracking, and automation. You'll also build a complete RAG agent project.
14. FastAPI for Model Serving (Intermediate)
Create src/main.py
to build a production-ready API.
import os
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
# Load environment variables first
load_dotenv()
# Initialize rate limiter
limiter = Limiter(key_func=get_remote_address)
app = FastAPI(title="AI Agent Server", version="1.0.0")
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
# --- State object to hold the model ---
class AppState:
llm = None
# --- Pydantic model for request body ---
class PredictionRequest(BaseModel):
prompt: str
model: str = "gpt-4o-mini"
temperature: float = 0.5
# --- App startup event to initialize the model ---
@app.on_event("startup")
async def startup_event():
# Initialize with error handling
if not os.getenv("OPENAI_API_KEY"):
print("š“ CRITICAL: OPENAI_API_KEY is not set. The /predict endpoint will not work.")
AppState.llm = None
else:
try:
AppState.llm = ChatOpenAI(model="gpt-4o-mini")
print("ā
OpenAI client initialized successfully.")
except Exception as e:
print(f"š“ CRITICAL: Could not initialize OpenAI client: {e}")
AppState.llm = None
# --- API Endpoints ---
@app.post("/predict")
@limiter.limit("5/minute") # Limit to 5 requests per minute per IP
async def predict(request: PredictionRequest, req: Request):
if AppState.llm is None:
raise HTTPException(
status_code=503,
detail="AI model is not available. Check server logs for initialization errors."
)
try:
messages = [HumanMessage(content=request.prompt)]
# Use the model instance from the app state
response = AppState.llm.invoke(messages)
return {"response": response.content}
except Exception as e:
# Catch potential API errors during invocation
raise HTTPException(status_code=500, detail=f"An error occurred while processing the request: {e}")
@app.get("/health")
async def health_check():
health_status = {
"status": "healthy",
"message": "AI Agent Server is running",
"model_initialized": AppState.llm is not None
}
return health_status
Run the development server:
pip install fastapi==0.115.6 uvicorn==0.32.1 pydantic==2.10.1 slowapi==0.1.9
uvicorn src.main:app --reload --port 8000
Navigate to http://localhost:8000/docs
to see interactive API documentation.
Production Note: For production, consider adding request timeouts and authentication to your FastAPI endpoints. See FastAPI Security for best practices.
20. Integrated Mini-Project: RAG Agent with a FastAPI Endpoint (Advanced)
This final example ties together several concepts from this guide into a single, functional application. We will build a simple Retrieval-Augmented Generation (RAG) API using FastAPI, LangChain, and ChromaDB.
What this project demonstrates:
- Project Structure: Using the
src/
directory for modular code
- Dependency Management: Using packages like
fastapi
, langchain
, and chromadb
- Vector Databases: Setting up a persistent ChromaDB store
- Advanced Chains: Building a RAG chain with modern LangChain (LCEL)
- API Serving: Exposing the RAG chain through a secure FastAPI endpoint
- Environment Variables: Loading API keys correctly with
python-dotenv
š Project Goal
To create an API endpoint /query
that accepts a question, searches a small knowledge base for relevant context, and uses an LLM to generate an answer based on that context.
Step 1: Update Project Structure and Dependencies
First, ensure your project has the following structure and that the necessary packages are installed. We will create three new files: src/vector_store.py
, src/rag_chain.py
, and src/main_rag_api.py
.
my-ai-agent/
āāā .venv/
āāā .env
āāā chroma_db/ # Will be created automatically by ChromaDB
āāā src/
ā āāā __init__.py
ā āāā vector_store.py # New: Logic for setting up ChromaDB
ā āāā rag_chain.py # New: Logic for the RAG chain
ā āāā main_rag_api.py # New: The FastAPI application
āāā ...
Ensure you have the required packages installed:
See fastapi, uvicorn, langchain, langchain-openai, langchain-community, openai, chromadb, python-dotenv, sentence-transformers on PyPI.
pip install fastapi uvicorn "langchain[llms]" langchain-openai langchain-community openai chromadb python-dotenv sentence-transformers
The sentence-transformers
package is used by ChromaDB's default embedding function if you don't provide one, but we will explicitly use OpenAI's embeddings for better performance.
Step 2: Create the Vector Store (src/vector_store.py
)
This module will handle setting up our document store. It will initialize ChromaDB, add documents to it, and create a retriever object that LangChain can use.
# src/vector_store.py
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
# Define the persistent directory
CHROMA_PATH = "./chroma_db"
def get_retriever():
"""
Initializes and returns a ChromaDB retriever from a predefined set of documents.
"""
# Sample documents for our knowledge base
docs = [
Document(
page_content="VS Code is a lightweight but powerful source code editor from Microsoft.",
metadata={"source": "doc1", "topic": "tools"}
),
Document(
page_content="A virtual environment is a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.",
metadata={"source": "doc2", "topic": "python"}
),
Document(
page_content="RAG, or Retrieval-Augmented Generation, is a technique for enhancing the accuracy and reliability of large language models (LLMs) with facts fetched from external sources.",
metadata={"source": "doc3", "topic": "ai"}
),
Document(
page_content="FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.8+ based on standard Python type hints.",
metadata={"source": "doc4", "topic": "tools"}
),
]
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()
# Create a new ChromaDB persistent client
# This will save the vector store to disk in the 'chroma_db' directory
db_client = chromadb.PersistentClient(path=CHROMA_PATH)
# Create or load the vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory=CHROMA_PATH
)
# Create and return a retriever
# 'k=2' means it will retrieve the top 2 most relevant documents
return vectorstore.as_retriever(search_kwargs={"k": 2})
if __name__ == '__main__':
# A simple test to verify the retriever is working
print("Initializing and testing the vector store...")
retriever = get_retriever()
test_query = "What is RAG?"
results = retriever.invoke(test_query)
print(f"Retrieved {len(results)} documents for query: '{test_query}'")
for doc in results:
print(f"- {doc.page_content}")
print("\nVector store setup complete and verified.")
Step 3: Create the RAG Chain (src/rag_chain.py
)
This module defines the core logic of our AI. It imports the retriever from the previous step and chains it together with a prompt template and an LLM to create the final RAG chain.
# src/rag_chain.py
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from src.vector_store import get_retriever
def get_rag_chain():
"""
Creates and returns a RAG chain using the vector store retriever.
"""
retriever = get_retriever()
# RAG prompt template
template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Keep the answer concise.
Context: {context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Create the RAG chain using LangChain Expression Language (LCEL)
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return rag_chain
if __name__ == '__main__':
# A simple test to verify the chain is working
print("Testing the RAG chain...")
chain = get_rag_chain()
response = chain.invoke("What is FastAPI?")
print(response)
Step 4: Build the FastAPI App (src/main_rag_api.py
)
This is the entry point for our API. It loads the RAG chain, defines the request and response models, and creates an endpoint to handle user queries.
# src/main_rag_api.py
import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from dotenv import load_dotenv
from src.rag_chain import get_rag_chain
# Load environment variables from .env file
load_dotenv()
# Initialize the FastAPI app
app = FastAPI(
title="RAG API Server",
version="1.0",
description="A simple API server for a Retrieval-Augmented Generation agent.",
)
# --- Pydantic Models for Request and Response ---
class QueryRequest(BaseModel):
question: str
class QueryResponse(BaseModel):
answer: str
# --- API Endpoints ---
@app.get("/", summary="Health Check")
async def health_check():
"""A simple health check endpoint to confirm the server is running."""
return {"status": "ok", "message": "RAG API is running"}
@app.post("/query", response_model=QueryResponse, summary="Query the RAG Agent")
async def query_agent(request: QueryRequest):
"""
Receives a question, processes it through the RAG chain, and returns the answer.
"""
if not os.getenv("OPENAI_API_KEY"):
raise HTTPException(status_code=500, detail="OPENAI_API_KEY not found in environment variables.")
if not request.question:
raise HTTPException(status_code=400, detail="Question field cannot be empty.")
try:
# Get the singleton RAG chain instance
rag_chain = get_rag_chain()
answer = rag_chain.invoke(request.question)
return QueryResponse(answer=answer)
except Exception as e:
# A generic error handler for issues during chain invocation
raise HTTPException(status_code=500, detail=f"An error occurred: {e}")
# To run this app:
# uvicorn src.main_rag_api:app --reload --port 8000
Step 5: Run and Test Your Integrated Application
With all the files in place, you can now run your API server.
- Ensure your
.env
file contains your OPENAI_API_KEY
.
-
Start the Server: Open your terminal (with the virtual environment activated) and run:
uvicorn src.main_rag_api:app --reload --port 8000
-
Test via Interactive Docs: Open your browser and navigate to http://127.0.0.1:8000/docs. You will see the FastAPI interface.
- Expand the
/query
endpoint.
- Click "Try it out".
- Enter a question in the request body, such as:
"What is the purpose of a virtual environment?"
- Click "Execute". You should see the AI-generated response based on the context from your vector store.
-
Test via
curl
(Optional):
curl -X POST "http://127.0.0.1:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "What is VS Code?"}'
Expected output:
{"answer":"VS Code is a lightweight but powerful source code editor from Microsoft."}
Production Tip: For public deployments, add authentication and rate limiting to your FastAPI endpoints. See FastAPI Security for best practices.