AI-Powered Help & QnA System¶
Status: Production-Ready Repository: Private Company Repository
Executive Summary¶
A production-ready RAG (Retrieval-Augmented Generation) system designed for enterprise documentation Q&A. Features hybrid retrieval, multi-provider LLM fallback, voice assistant capabilities, and comprehensive evaluation metrics. Built with 100% free-tier components achieving 75%+ cost reduction through intelligent caching.
System Architecture¶
graph TB
subgraph "Client Layer"
WebUI["React Web Interface"]
VoiceUI["GenTalk Voice Assistant"]
end
subgraph "API Layer - FastAPI"
API["REST API"]
Session["Session Management"]
end
subgraph "RAG Orchestration"
RAG["RAG Orchestrator"]
Router["Query Router - BMS or CMS"]
end
subgraph "Retrieval Layer"
Dense["Dense Vector Search - pgvector HNSW"]
Sparse["Sparse BM25 Search - pg_trgm"]
RRF["RRF Fusion - Hybrid retrieval"]
Rerank["BGE Reranker Large - Cross-encoder"]
end
subgraph "LLM Layer"
Primary["Cerebras Llama 70B"]
Secondary["Gemini 2 Flash"]
Tertiary["DeepSeek R1"]
Fallback["Groq or OpenRouter"]
end
subgraph "Storage Layer"
PG["PostgreSQL 16 + pgvector"]
Redis["Redis 7 - Response cache"]
Embed["BGE-large-en - 1024-dim"]
end
WebUI --> API
VoiceUI --> API
API --> Session
Session --> RAG
RAG --> Router
Router --> Dense
Router --> Sparse
Dense --> RRF
Sparse --> RRF
RRF --> Rerank
Rerank --> Primary
Primary -.->|Fallback| Secondary
Secondary -.->|Fallback| Tertiary
Tertiary -.->|Fallback| Fallback
Primary --> Redis
Dense --> PG
Sparse --> PG
PG --> Embed
classDef clientStyle fill:#E1F5FE,stroke:#01579B,stroke-width:2px
classDef apiStyle fill:#FFF9C4,stroke:#F57F17,stroke-width:2px
classDef retrievalStyle fill:#F3E5F5,stroke:#4A148C,stroke-width:2px
classDef llmStyle fill:#E8F5E9,stroke:#1B5E20,stroke-width:2px
classDef storageStyle fill:#FFE0B2,stroke:#E65100,stroke-width:2px
class WebUI,VoiceUI clientStyle
class API,Session,RAG,Router apiStyle
class Dense,Sparse,RRF,Rerank retrievalStyle
class Primary,Secondary,Tertiary,Fallback llmStyle
class PG,Redis,Embed storageStyle
System Overview¶
The system provides intelligent question-answering for enterprise Broadcast Management System (BMS) and Content Management System (CMS) documentation using state-of-the-art RAG techniques.
Key Metrics: - 40K+ lines of documentation indexed - <50ms health check response - 2-5s end-to-end RAG latency - 75%+ cache hit rate - Accuracy target: ≥75% top-1 correctness - Citation coverage: ≥90% - Hallucination rate: ≤5%
Technology Stack¶
Backend Framework¶
- API: FastAPI with async/await support
- Server: Uvicorn ASGI server
- Validation: Pydantic models for request/response
- Python: 3.11+ for modern features
Database & Vector Store¶
- Database: PostgreSQL 16 with pgvector extension
- Vector Indexing: HNSW (Hierarchical Navigable Small World)
- Sparse Search: pg_trgm for BM25-style keyword matching
- Schema: Optimized with indexes and partitioning
Embeddings & Reranking¶
- Embeddings: BGE-large-en-v1.5 (1024-dim, MIT license)
- MTEB Score: 63.5% (better than OpenAI)
- Reranker: BGE-reranker-large (cross-encoder)
- Reranking Score: 67.6% MTEB reranking
- Model Size: 1.3GB embeddings + 560MB reranker
LLM Providers (Free-Tier Cloud APIs)¶
Primary: Cerebras¶
- Model: Llama 3.3 70B / Llama 3.1 8B
- Free Tier: 1M tokens/day
- Speed: Ultra-fast (1B tokens/minute)
- Use Case: Primary generation
Secondary: Google Gemini¶
- Model: Gemini 2.0 Flash
- Free Tier: 1.5M tokens/day
- Speed: Excellent balance
- Use Case: Secondary fallback
Tertiary: DeepSeek¶
- Model: DeepSeek R1 (Reasoner)
- Features: GPT-4 level reasoning
- Cost: 27x lower than GPT-4
- Use Case: Complex reasoning tasks
Quaternary: Groq¶
- Model: Llama 70B / Gemma 9B
- Free Tier: 14.4K TPM
- Speed: Fastest inference
- Use Case: Fast simple queries
Final Fallback: OpenRouter¶
- Model: Llama 3.2 3B (free tier)
- Use Case: Last resort fallback
Caching & Performance¶
- Cache: Redis 7 for response caching
- Hit Rate: 75%+ cost reduction
- TTL: Configurable per query type
- Invalidation: Smart invalidation on document updates
Frontend¶
- Framework: React 18
- Build Tool: Vite for fast development
- HTTP Client: Axios for API communication
- Voice: GenTalk voice assistant with audio recording/playback
Infrastructure¶
- Containerization: Docker Compose for local/staging
- Orchestration: Kubernetes manifests for production
- CI/CD: GitHub Actions workflows (planned)
- Monitoring: Health checks, metrics endpoints
Core Features¶
Hybrid Retrieval System¶
The system combines two complementary search methods:
Dense Vector Search (pgvector)¶
- Semantic similarity using BGE embeddings
- HNSW indexing for fast approximate nearest neighbors
- Captures meaning and context
- Best for conceptual queries
Sparse Keyword Search (BM25 via pg_trgm)¶
- Traditional keyword matching
- Exact term matching with scoring
- Best for specific terminology
- Native PostgreSQL, no extra service
RRF Fusion (Reciprocal Rank Fusion)¶
- Combines results from both methods
- Weighted scoring algorithm
- Balances semantic and keyword relevance
- Improves overall accuracy by 10-15%
Multi-Stage Reranking¶
- Initial Retrieval: Fetch top-N candidates (typically 50-100)
- RRF Fusion: Combine dense + sparse results
- Cross-Encoder Reranking: BGE-reranker-large scores top-K (typically 20)
- Final Selection: Top-8 most relevant chunks for context
This multi-stage approach significantly improves precision while maintaining performance.
Multi-Provider LLM Fallback¶
Automatic fallback chain ensures high availability:
Cerebras Llama 70B (primary)
↓ [if fails or rate limited]
Google Gemini 2.0 Flash
↓ [if fails]
DeepSeek R1 (for complex queries)
↓ [if fails]
Groq Gemma 9B (fast fallback)
↓ [if fails]
OpenRouter (final fallback)
Benefits: - Zero downtime from single provider outages - Cost optimization (use fastest/cheapest when possible) - Quality optimization (use best model for query type) - Rate limit protection
System Detection¶
Automatically detects whether a query is about BMS or CMS:
- Automatic Mode: Analyzes query keywords
- Manual Override: User can specify system
- Filtered Retrieval: Only searches relevant documentation
- Improved Accuracy: Reduces cross-contamination
GenTalk Voice Assistant¶
Full voice interaction capabilities:
- Audio Recording: Browser-based audio capture
- Device Diagnostics: Automatic microphone detection
- Voice-to-Text: Speech recognition
- Text-to-Speech: Natural audio playback
- Session Management: Voice conversation history
- Accessibility: Alternative input method for users
Citation System¶
Every answer includes transparent source attribution:
{
"answer": "To create an advertiser...",
"citations": [
{
"index": 1,
"chunk_id": "uuid-...",
"section_path": "BMS > Advertiser Management > Creating Advertiser",
"heading": "Creating a New Advertiser",
"similarity_score": 0.89
}
]
}
Benefits: - Transparency: Users see source of information - Trust: Verify answers against source documents - Navigation: Click to view full document section - Audit Trail: Track which docs are most useful
Architecture Components¶
RAG Orchestrator¶
Central coordinator for the RAG pipeline:
- Query Analysis: Detect system (BMS/CMS), extract intent
- Cache Check: Look for cached response
- Retrieval: Hybrid dense + sparse search
- Reranking: Cross-encoder scoring
- Context Assembly: Format top chunks for LLM
- Generation: Call LLM with fallback chain
- Citation Extraction: Parse and validate citations
- Response Formatting: Structure final JSON response
- Cache Store: Store response for future queries
Document Ingestion Pipeline¶
Processes markdown documentation into searchable chunks:
- Loading: Read markdown files with metadata
- Cleaning: Remove artifacts, normalize formatting
- Chunking: Header-aware recursive splitting (512 tokens, 50 overlap)
- Embedding: Generate BGE embeddings (1024-dim vectors)
- Metadata: Extract section paths, headings, source info
- Storage: Insert into PostgreSQL with pgvector
- Indexing: Create HNSW index for fast search
- Validation: Verify anchor integrity (98%+ success rate)
Caching Strategy¶
Multi-level caching for performance:
L1 Cache (Redis) - Full response caching - TTL: 1 hour for static docs, 5 minutes for dynamic - Key: hash(query + system + top_k) - Invalidation: On document updates
L2 Cache (In-Memory) - Embedding cache (query embeddings) - Model cache (loaded models in memory) - Frequently accessed chunks
L3 Cache (PostgreSQL) - Materialized views for statistics - Query result cache at database level
API Endpoints¶
Core Endpoints¶
POST /v1/ask¶
Ask a question and get an AI-generated answer.
Request:
{
"question": "How do I create a new advertiser?",
"top_k": 8,
"system": "BMS"
}
Response:
{
"question": "How do I create a new advertiser?",
"answer": "To create a new advertiser in BMS...",
"system": "BMS",
"citations": [...],
"context": [...],
"metadata": {
"retrieval_time_ms": 95,
"generation_time_ms": 1850,
"total_time_ms": 2100,
"cache_hit": false,
"model_used": "cerebras-llama-70b"
}
}
GET /health¶
System health check.
GET /api/v1/llm/cache/stats¶
Cache performance statistics.
POST /api/v1/sessions/create¶
Create new chat session.
Monitoring Endpoints¶
/metrics: Prometheus-compatible metrics/health/deep: Detailed system health/api/v1/llm/tokens/usage: Token usage tracking
Performance Optimization¶
Database Optimization¶
Indexes: - pgvector HNSW index for vector search (<100ms) - B-tree indexes on metadata fields - pg_trgm GIN index for keyword search
Connection Pooling: - Asyncpg pool with 10-50 connections - Connection reuse and health checks - Automatic retry on connection failures
Query Optimization: - Prepared statements for common queries - Efficient vector distance calculations - Limit result sets appropriately
Embedding Optimization¶
CPU-Based Inference: - No GPU required (uses sentence-transformers) - Batch processing for efficiency - Model quantization for speed (planned)
Caching: - Cache query embeddings - Pre-compute document embeddings - Store in memory for frequently used queries
LLM Optimization¶
Response Caching: - 75%+ cache hit rate in production - Significantly reduces API costs - Faster response times (<100ms for cache hits)
Fallback Chain: - Start with fastest/cheapest provider - Escalate only when needed - Track success rates per provider
Prompt Engineering: - Optimized prompts for each provider - Token-efficient prompt design - Clear instructions reduce regeneration
Evaluation & Quality Metrics¶
Gold Standard Dataset¶
- 200-300 curated Q&A pairs
- Covers common user questions
- Includes edge cases and difficult queries
- Regularly updated based on production queries
Accuracy Metrics¶
Top-1 Correctness: ≥75% target - Is the top answer correct? - Measured against gold standard
Top-3 Correctness: ≥90% target - Is correct answer in top 3? - Allows for ranking improvements
Citation Coverage: ≥90% target - Does answer include proper citations? - Are citations valid and relevant?
Hallucination Rate: ≤5% target - Does answer contain unsupported claims? - Verified against source documents
Performance Metrics¶
- P50 Latency: <2s
- P95 Latency: <5s
- P99 Latency: <10s
- Cache Hit Rate: >50%
- Uptime: >99.5%
Quality Assurance¶
Automated Testing: - Regression tests on gold standard - Nightly evaluation runs - CI/CD gates on accuracy metrics
Human Evaluation: - Random sample review (10% of queries) - Thumbs up/down feedback - Support ticket analysis
Continuous Improvement: - Feedback loop from user interactions - Failed queries added to test set - Model and prompt iteration
Deployment Options¶
Docker Compose (Development/Staging)¶
# Start all services
docker-compose up -d
# Services include:
# - PostgreSQL 16 (with pgvector)
# - Redis 7
# - FastAPI backend
# - React frontend
Kubernetes (Production)¶
Resources: - Namespace: enterprise-ai-help - StatefulSet for PostgreSQL - Deployment for API (3+ replicas) - Deployment for frontend - Service for load balancing - Ingress for external access - ConfigMap for configuration - Secrets for credentials
High Availability: - Multi-replica API deployment - PostgreSQL replication (planned) - Redis Sentinel for HA (planned) - Health checks and automatic restart
Infrastructure Requirements¶
Minimum (Development): - 4GB RAM - 2 CPU cores - 10GB disk
Recommended (Production): - 16GB RAM - 8 CPU cores - 100GB SSD - 10Gbps network
Security Considerations¶
API Security: - Input validation on all requests - Rate limiting per IP/user - SQL injection prevention (parameterized queries) - XSS protection in frontend
Data Security: - Encrypted database connections - Secrets in environment variables - No hardcoded credentials - Kubernetes secrets for production
LLM Security: - Prompt injection detection (regex + classifier) - Output filtering for sensitive data - API key rotation - Usage monitoring and alerts
Documentation¶
The project includes comprehensive documentation:
- Project Plan (106KB): 5-week implementation plan with quality gates
- Architecture: System design and component relationships
- API Documentation: OpenAPI/Swagger specifications
- Setup Guides: Development and production deployment
- Feature Docs: GenTalk, evaluation, hybrid retrieval
- Runbooks: Common operations and troubleshooting
Cost Analysis¶
Development (Free Tier)¶
- PostgreSQL: Free (Supabase 500MB or local)
- Redis: Free (Redis Cloud 30MB or local)
- LLMs: 100% free-tier APIs
- Hosting: Local development Total: $0/month
Production (Small Scale)¶
- Database: $25/month (managed PostgreSQL)
- Redis: $15/month (managed Redis)
- LLMs: $0 (free tier sufficient for <1M queries/month)
- Hosting: $50/month (3 replicas on cloud) Total: $90/month
Savings from Caching¶
- Without cache: ~$500/month in LLM costs
- With cache (75% hit rate): ~$125/month Savings: $375/month (75% reduction)
Future Enhancements¶
Planned Features: - Multi-language support (translation layer) - Advanced analytics dashboard - User feedback analysis - Automated retraining pipeline - Vector store optimization - GraphRAG for complex queries - Fine-tuned embeddings on domain data
Infrastructure Improvements: - Database replication for HA - Read replicas for scaling - Advanced monitoring (Prometheus/Grafana) - Distributed tracing - Log aggregation (ELK stack)
Links¶
- Repository: Private Company Repository
- Documentation: Project plan, implementation summary, and API documentation