AI-Powered Help & QnA System

Status: Production-Ready Repository: Private Company Repository

Executive Summary

A production-ready RAG (Retrieval-Augmented Generation) system designed for enterprise documentation Q&A. Features hybrid retrieval, multi-provider LLM fallback, voice assistant capabilities, and comprehensive evaluation metrics. Built with 100% free-tier components achieving 75%+ cost reduction through intelligent caching.

System Architecture

graph TB
    subgraph "Client Layer"
        WebUI["React Web Interface"]
        VoiceUI["GenTalk Voice Assistant"]
    end

    subgraph "API Layer - FastAPI"
        API["REST API"]
        Session["Session Management"]
    end

    subgraph "RAG Orchestration"
        RAG["RAG Orchestrator"]
        Router["Query Router - BMS or CMS"]
    end

    subgraph "Retrieval Layer"
        Dense["Dense Vector Search - pgvector HNSW"]
        Sparse["Sparse BM25 Search - pg_trgm"]
        RRF["RRF Fusion - Hybrid retrieval"]
        Rerank["BGE Reranker Large - Cross-encoder"]
    end

    subgraph "LLM Layer"
        Primary["Cerebras Llama 70B"]
        Secondary["Gemini 2 Flash"]
        Tertiary["DeepSeek R1"]
        Fallback["Groq or OpenRouter"]
    end

    subgraph "Storage Layer"
        PG["PostgreSQL 16 + pgvector"]
        Redis["Redis 7 - Response cache"]
        Embed["BGE-large-en - 1024-dim"]
    end

    WebUI --> API
    VoiceUI --> API
    API --> Session
    Session --> RAG
    RAG --> Router
    Router --> Dense
    Router --> Sparse
    Dense --> RRF
    Sparse --> RRF
    RRF --> Rerank
    Rerank --> Primary
    Primary -.->|Fallback| Secondary
    Secondary -.->|Fallback| Tertiary
    Tertiary -.->|Fallback| Fallback

    Primary --> Redis
    Dense --> PG
    Sparse --> PG
    PG --> Embed

    classDef clientStyle fill:#E1F5FE,stroke:#01579B,stroke-width:2px
    classDef apiStyle fill:#FFF9C4,stroke:#F57F17,stroke-width:2px
    classDef retrievalStyle fill:#F3E5F5,stroke:#4A148C,stroke-width:2px
    classDef llmStyle fill:#E8F5E9,stroke:#1B5E20,stroke-width:2px
    classDef storageStyle fill:#FFE0B2,stroke:#E65100,stroke-width:2px

    class WebUI,VoiceUI clientStyle
    class API,Session,RAG,Router apiStyle
    class Dense,Sparse,RRF,Rerank retrievalStyle
    class Primary,Secondary,Tertiary,Fallback llmStyle
    class PG,Redis,Embed storageStyle

System Overview

The system provides intelligent question-answering for enterprise Broadcast Management System (BMS) and Content Management System (CMS) documentation using state-of-the-art RAG techniques.

Key Metrics: - 40K+ lines of documentation indexed - <50ms health check response - 2-5s end-to-end RAG latency - 75%+ cache hit rate - Accuracy target: ≥75% top-1 correctness - Citation coverage: ≥90% - Hallucination rate: ≤5%

Technology Stack

Backend Framework

  • API: FastAPI with async/await support
  • Server: Uvicorn ASGI server
  • Validation: Pydantic models for request/response
  • Python: 3.11+ for modern features

Database & Vector Store

  • Database: PostgreSQL 16 with pgvector extension
  • Vector Indexing: HNSW (Hierarchical Navigable Small World)
  • Sparse Search: pg_trgm for BM25-style keyword matching
  • Schema: Optimized with indexes and partitioning

Embeddings & Reranking

  • Embeddings: BGE-large-en-v1.5 (1024-dim, MIT license)
  • MTEB Score: 63.5% (better than OpenAI)
  • Reranker: BGE-reranker-large (cross-encoder)
  • Reranking Score: 67.6% MTEB reranking
  • Model Size: 1.3GB embeddings + 560MB reranker

LLM Providers (Free-Tier Cloud APIs)

Primary: Cerebras

  • Model: Llama 3.3 70B / Llama 3.1 8B
  • Free Tier: 1M tokens/day
  • Speed: Ultra-fast (1B tokens/minute)
  • Use Case: Primary generation

Secondary: Google Gemini

  • Model: Gemini 2.0 Flash
  • Free Tier: 1.5M tokens/day
  • Speed: Excellent balance
  • Use Case: Secondary fallback

Tertiary: DeepSeek

  • Model: DeepSeek R1 (Reasoner)
  • Features: GPT-4 level reasoning
  • Cost: 27x lower than GPT-4
  • Use Case: Complex reasoning tasks

Quaternary: Groq

  • Model: Llama 70B / Gemma 9B
  • Free Tier: 14.4K TPM
  • Speed: Fastest inference
  • Use Case: Fast simple queries

Final Fallback: OpenRouter

  • Model: Llama 3.2 3B (free tier)
  • Use Case: Last resort fallback

Caching & Performance

  • Cache: Redis 7 for response caching
  • Hit Rate: 75%+ cost reduction
  • TTL: Configurable per query type
  • Invalidation: Smart invalidation on document updates

Frontend

  • Framework: React 18
  • Build Tool: Vite for fast development
  • HTTP Client: Axios for API communication
  • Voice: GenTalk voice assistant with audio recording/playback

Infrastructure

  • Containerization: Docker Compose for local/staging
  • Orchestration: Kubernetes manifests for production
  • CI/CD: GitHub Actions workflows (planned)
  • Monitoring: Health checks, metrics endpoints

Core Features

Hybrid Retrieval System

The system combines two complementary search methods:

Dense Vector Search (pgvector)

  • Semantic similarity using BGE embeddings
  • HNSW indexing for fast approximate nearest neighbors
  • Captures meaning and context
  • Best for conceptual queries

Sparse Keyword Search (BM25 via pg_trgm)

  • Traditional keyword matching
  • Exact term matching with scoring
  • Best for specific terminology
  • Native PostgreSQL, no extra service

RRF Fusion (Reciprocal Rank Fusion)

  • Combines results from both methods
  • Weighted scoring algorithm
  • Balances semantic and keyword relevance
  • Improves overall accuracy by 10-15%

Multi-Stage Reranking

  1. Initial Retrieval: Fetch top-N candidates (typically 50-100)
  2. RRF Fusion: Combine dense + sparse results
  3. Cross-Encoder Reranking: BGE-reranker-large scores top-K (typically 20)
  4. Final Selection: Top-8 most relevant chunks for context

This multi-stage approach significantly improves precision while maintaining performance.

Multi-Provider LLM Fallback

Automatic fallback chain ensures high availability:

Cerebras Llama 70B (primary)
  ↓ [if fails or rate limited]
Google Gemini 2.0 Flash
  ↓ [if fails]
DeepSeek R1 (for complex queries)
  ↓ [if fails]
Groq Gemma 9B (fast fallback)
  ↓ [if fails]
OpenRouter (final fallback)

Benefits: - Zero downtime from single provider outages - Cost optimization (use fastest/cheapest when possible) - Quality optimization (use best model for query type) - Rate limit protection

System Detection

Automatically detects whether a query is about BMS or CMS:

  • Automatic Mode: Analyzes query keywords
  • Manual Override: User can specify system
  • Filtered Retrieval: Only searches relevant documentation
  • Improved Accuracy: Reduces cross-contamination

GenTalk Voice Assistant

Full voice interaction capabilities:

  • Audio Recording: Browser-based audio capture
  • Device Diagnostics: Automatic microphone detection
  • Voice-to-Text: Speech recognition
  • Text-to-Speech: Natural audio playback
  • Session Management: Voice conversation history
  • Accessibility: Alternative input method for users

Citation System

Every answer includes transparent source attribution:

{
  "answer": "To create an advertiser...",
  "citations": [
    {
      "index": 1,
      "chunk_id": "uuid-...",
      "section_path": "BMS > Advertiser Management > Creating Advertiser",
      "heading": "Creating a New Advertiser",
      "similarity_score": 0.89
    }
  ]
}

Benefits: - Transparency: Users see source of information - Trust: Verify answers against source documents - Navigation: Click to view full document section - Audit Trail: Track which docs are most useful

Architecture Components

RAG Orchestrator

Central coordinator for the RAG pipeline:

  1. Query Analysis: Detect system (BMS/CMS), extract intent
  2. Cache Check: Look for cached response
  3. Retrieval: Hybrid dense + sparse search
  4. Reranking: Cross-encoder scoring
  5. Context Assembly: Format top chunks for LLM
  6. Generation: Call LLM with fallback chain
  7. Citation Extraction: Parse and validate citations
  8. Response Formatting: Structure final JSON response
  9. Cache Store: Store response for future queries

Document Ingestion Pipeline

Processes markdown documentation into searchable chunks:

  1. Loading: Read markdown files with metadata
  2. Cleaning: Remove artifacts, normalize formatting
  3. Chunking: Header-aware recursive splitting (512 tokens, 50 overlap)
  4. Embedding: Generate BGE embeddings (1024-dim vectors)
  5. Metadata: Extract section paths, headings, source info
  6. Storage: Insert into PostgreSQL with pgvector
  7. Indexing: Create HNSW index for fast search
  8. Validation: Verify anchor integrity (98%+ success rate)

Caching Strategy

Multi-level caching for performance:

L1 Cache (Redis) - Full response caching - TTL: 1 hour for static docs, 5 minutes for dynamic - Key: hash(query + system + top_k) - Invalidation: On document updates

L2 Cache (In-Memory) - Embedding cache (query embeddings) - Model cache (loaded models in memory) - Frequently accessed chunks

L3 Cache (PostgreSQL) - Materialized views for statistics - Query result cache at database level

API Endpoints

Core Endpoints

POST /v1/ask

Ask a question and get an AI-generated answer.

Request:

{
  "question": "How do I create a new advertiser?",
  "top_k": 8,
  "system": "BMS"
}

Response:

{
  "question": "How do I create a new advertiser?",
  "answer": "To create a new advertiser in BMS...",
  "system": "BMS",
  "citations": [...],
  "context": [...],
  "metadata": {
    "retrieval_time_ms": 95,
    "generation_time_ms": 1850,
    "total_time_ms": 2100,
    "cache_hit": false,
    "model_used": "cerebras-llama-70b"
  }
}

GET /health

System health check.

GET /api/v1/llm/cache/stats

Cache performance statistics.

POST /api/v1/sessions/create

Create new chat session.

Monitoring Endpoints

  • /metrics: Prometheus-compatible metrics
  • /health/deep: Detailed system health
  • /api/v1/llm/tokens/usage: Token usage tracking

Performance Optimization

Database Optimization

Indexes: - pgvector HNSW index for vector search (<100ms) - B-tree indexes on metadata fields - pg_trgm GIN index for keyword search

Connection Pooling: - Asyncpg pool with 10-50 connections - Connection reuse and health checks - Automatic retry on connection failures

Query Optimization: - Prepared statements for common queries - Efficient vector distance calculations - Limit result sets appropriately

Embedding Optimization

CPU-Based Inference: - No GPU required (uses sentence-transformers) - Batch processing for efficiency - Model quantization for speed (planned)

Caching: - Cache query embeddings - Pre-compute document embeddings - Store in memory for frequently used queries

LLM Optimization

Response Caching: - 75%+ cache hit rate in production - Significantly reduces API costs - Faster response times (<100ms for cache hits)

Fallback Chain: - Start with fastest/cheapest provider - Escalate only when needed - Track success rates per provider

Prompt Engineering: - Optimized prompts for each provider - Token-efficient prompt design - Clear instructions reduce regeneration

Evaluation & Quality Metrics

Gold Standard Dataset

  • 200-300 curated Q&A pairs
  • Covers common user questions
  • Includes edge cases and difficult queries
  • Regularly updated based on production queries

Accuracy Metrics

Top-1 Correctness: ≥75% target - Is the top answer correct? - Measured against gold standard

Top-3 Correctness: ≥90% target - Is correct answer in top 3? - Allows for ranking improvements

Citation Coverage: ≥90% target - Does answer include proper citations? - Are citations valid and relevant?

Hallucination Rate: ≤5% target - Does answer contain unsupported claims? - Verified against source documents

Performance Metrics

  • P50 Latency: <2s
  • P95 Latency: <5s
  • P99 Latency: <10s
  • Cache Hit Rate: >50%
  • Uptime: >99.5%

Quality Assurance

Automated Testing: - Regression tests on gold standard - Nightly evaluation runs - CI/CD gates on accuracy metrics

Human Evaluation: - Random sample review (10% of queries) - Thumbs up/down feedback - Support ticket analysis

Continuous Improvement: - Feedback loop from user interactions - Failed queries added to test set - Model and prompt iteration

Deployment Options

Docker Compose (Development/Staging)

# Start all services
docker-compose up -d

# Services include:
# - PostgreSQL 16 (with pgvector)
# - Redis 7
# - FastAPI backend
# - React frontend

Kubernetes (Production)

Resources: - Namespace: enterprise-ai-help - StatefulSet for PostgreSQL - Deployment for API (3+ replicas) - Deployment for frontend - Service for load balancing - Ingress for external access - ConfigMap for configuration - Secrets for credentials

High Availability: - Multi-replica API deployment - PostgreSQL replication (planned) - Redis Sentinel for HA (planned) - Health checks and automatic restart

Infrastructure Requirements

Minimum (Development): - 4GB RAM - 2 CPU cores - 10GB disk

Recommended (Production): - 16GB RAM - 8 CPU cores - 100GB SSD - 10Gbps network

Security Considerations

API Security: - Input validation on all requests - Rate limiting per IP/user - SQL injection prevention (parameterized queries) - XSS protection in frontend

Data Security: - Encrypted database connections - Secrets in environment variables - No hardcoded credentials - Kubernetes secrets for production

LLM Security: - Prompt injection detection (regex + classifier) - Output filtering for sensitive data - API key rotation - Usage monitoring and alerts

Documentation

The project includes comprehensive documentation:

  • Project Plan (106KB): 5-week implementation plan with quality gates
  • Architecture: System design and component relationships
  • API Documentation: OpenAPI/Swagger specifications
  • Setup Guides: Development and production deployment
  • Feature Docs: GenTalk, evaluation, hybrid retrieval
  • Runbooks: Common operations and troubleshooting

Cost Analysis

Development (Free Tier)

  • PostgreSQL: Free (Supabase 500MB or local)
  • Redis: Free (Redis Cloud 30MB or local)
  • LLMs: 100% free-tier APIs
  • Hosting: Local development Total: $0/month

Production (Small Scale)

  • Database: $25/month (managed PostgreSQL)
  • Redis: $15/month (managed Redis)
  • LLMs: $0 (free tier sufficient for <1M queries/month)
  • Hosting: $50/month (3 replicas on cloud) Total: $90/month

Savings from Caching

  • Without cache: ~$500/month in LLM costs
  • With cache (75% hit rate): ~$125/month Savings: $375/month (75% reduction)

Future Enhancements

Planned Features: - Multi-language support (translation layer) - Advanced analytics dashboard - User feedback analysis - Automated retraining pipeline - Vector store optimization - GraphRAG for complex queries - Fine-tuned embeddings on domain data

Infrastructure Improvements: - Database replication for HA - Read replicas for scaling - Advanced monitoring (Prometheus/Grafana) - Distributed tracing - Log aggregation (ELK stack)

  • Repository: Private Company Repository
  • Documentation: Project plan, implementation summary, and API documentation

← Back to Projects