EnergyLM-7B — LLM Fine-Tuning & Alignment Pipeline

Status: In Progress (Week 1 Complete) Organization: ForceX-AI on HuggingFace Budget: $0 — 100% free-tier compute

Executive Summary

End-to-end LLM training, alignment, and serving pipeline for EnergyLM-7B — a domain-adapted language model for energy systems and scientific reasoning. Fine-tunes Qwen2.5-7B using QLoRA SFT on 20K synthetic instruction samples, aligns with both DPO and ORPO for a controlled comparison, and benchmarks across 10 evaluation dimensions including domain-specific physics calculations, safety, and bilingual (Indonesian) capability.

The entire project runs on $0 budget — Kaggle T4 for training, free-tier LLM APIs for data generation, and HuggingFace Spaces for deployment.

Key Capabilities

Capability Implementation
LLM Fine-Tuning QLoRA SFT on Qwen2.5-7B (4-bit NF4, LoRA r=64)
Alignment Methods DPO vs ORPO controlled comparison study
Chain-of-Thought CoT distillation with <think> tag reasoning
Reward Modeling DeBERTa-v3-base binary classifier on preference pairs
Synthetic Data 20K+ samples via multi-teacher round-robin (Gemini, Groq, OpenRouter)
Data Quality MinHash + semantic dedup, LLM-as-judge filtering, n-gram contamination check
Evaluation 10-benchmark suite: MMLU, GPQA, MATH, MBPP, IFEval + 4 custom domain evals
Quantization AWQ 4-bit + GGUF (Q3/Q4/Q5/Q8) with quality retention benchmarks
Serving vLLM inference benchmarking, HuggingFace Spaces deployment
Cost $0 — Kaggle T4, Colab Free, free API tiers only

Pipeline Architecture

flowchart TB
    subgraph DATA["Data Engineering"]
        T1["Gemini 2.5 Flash"]
        T2["Groq Llama 3.3 70B"]
        T3["OpenRouter gpt-oss-120b"]
        T1 & T2 & T3 -->|"Round-Robin"| GEN["Instruction Generator<br/>20K samples"]
        GEN --> MH["MinHash Dedup<br/>Jaccard 0.85"]
        MH --> SEM["Semantic Dedup<br/>Cosine 0.92"]
        SEM --> QF["LLM-as-Judge<br/>Quality Filter"]
        QF --> CC["Contamination Check<br/>13-gram vs MMLU/GPQA/MATH"]
        CC --> SPLIT["Stratified Split<br/>80/10/10"]
    end

    subgraph TRAIN["Training Pipeline"]
        SPLIT -->|"Train Set"| SFT["QLoRA SFT<br/>Qwen2.5-7B<br/>3 epochs, r=64"]
        SFT --> COT["CoT Distillation<br/>3K reasoning samples"]
        SFT -->|"+ Preference Pairs"| DPO["DPO Alignment<br/>beta=0.1, sigmoid"]
        SFT -->|"+ Preference Pairs"| ORPO["ORPO Alignment<br/>odds-ratio, lr=5e-6"]
        SPLIT -->|"Preferences"| RM["Reward Model<br/>DeBERTa-v3-base"]
    end

    subgraph EVAL["Evaluation & Deployment"]
        DPO & ORPO & COT --> BENCH["10-Benchmark Suite<br/>+ Bootstrap CIs"]
        BENCH --> QUANT["AWQ + GGUF<br/>Quantization"]
        QUANT --> SERVE["vLLM Serving<br/>+ HF Spaces Demo"]
    end

    style DATA fill:#e3f2fd,stroke:#1565c0
    style TRAIN fill:#e8f5e9,stroke:#2e7d32
    style EVAL fill:#fff3e0,stroke:#e65100

Training Configuration

QLoRA SFT Setup

Base Model:      Qwen/Qwen2.5-7B
Quantization:    4-bit NF4 (double quantization)
LoRA Rank:       64 (alpha=128)
Target Modules:  q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Effective Batch: 16 (4 x 4 gradient accumulation)
Learning Rate:   2e-4 (cosine schedule)
Max Seq Length:  2048
Compute:         Kaggle T4 (16GB VRAM, free)

Alignment Comparison

Parameter DPO ORPO
Loss Function Sigmoid Odds-Ratio
Learning Rate 5e-5 5e-6
Beta 0.1 0.1
LoRA Rank 32 32
Starting Point SFT model Base model
Reference Model Implicit (PEFT) Not needed

Data Engineering Pipeline

flowchart LR
    subgraph TEACHERS["Multi-Teacher Generation"]
        direction TB
        G["Gemini 2.5 Flash<br/>Free Tier"]
        Q["Groq Llama 3.3 70B<br/>Free Tier"]
        O["OpenRouter gpt-oss-120b<br/>Free Tier"]
    end

    subgraph QUALITY["Quality Pipeline"]
        direction TB
        D1["MinHash LSH<br/>128 permutations"]
        D2["Semantic Dedup<br/>all-MiniLM-L6-v2"]
        D3["LLM Judge<br/>4-criteria scoring"]
        D4["N-gram Check<br/>vs benchmarks"]
        D1 --> D2 --> D3 --> D4
    end

    subgraph OUTPUT["Datasets"]
        direction TB
        I["energy-instruct-20k<br/>ChatML format"]
        P["energy-preferences-5k<br/>chosen/rejected pairs"]
        C["energy-cot-3k<br/>think tag reasoning"]
    end

    TEACHERS -->|"20 energy topics<br/>4 prompt templates"| QUALITY
    QUALITY --> OUTPUT

    style TEACHERS fill:#fce4ec,stroke:#c62828
    style QUALITY fill:#f3e5f5,stroke:#6a1b9a
    style OUTPUT fill:#e0f2f1,stroke:#00695c

20 Domain Topics: Geothermal systems, nuclear reactor physics, solar PV engineering, wind turbine aerodynamics, reservoir engineering, thermodynamics, fluid dynamics, heat transfer, reactor safety, grid integration, molten salt reactors, well drilling, power plant optimization, energy storage, carbon capture, hydrogen fuel cells, thermal hydraulics, radiation shielding, seismic analysis, CFD for energy.

Evaluation Framework

10-benchmark evaluation with bootstrap confidence intervals:

Benchmark Type Metric Purpose
Energy QA Custom Accuracy Domain knowledge
Physics Calculations Custom Numerical tolerance (5%) Quantitative reasoning
Indonesian Energy Custom BLEU + Accuracy Bilingual capability
Safety Prompts Custom Refusal rate Safety alignment
MMLU (STEM) Standard 5-shot accuracy General knowledge
GPQA Diamond Standard 0-shot accuracy Hard science reasoning
MATH Standard 4-shot accuracy Mathematical reasoning
MBPP Standard pass@1 Code generation
IFEval Standard Strict accuracy Instruction following
LLM-as-Judge Gemini 4-criteria (1-5) Overall quality

Model Variants

flowchart LR
    BASE["Qwen2.5-7B<br/>Base Model"] --> SFT["EnergyLM-7B<br/>SFT"]
    SFT --> COT["EnergyLM-7B<br/>SFT + CoT"]
    SFT --> DPO["EnergyLM-7B<br/>DPO"]
    BASE --> ORPO["EnergyLM-7B<br/>ORPO"]

    DPO & ORPO --> BEST{{"Best Variant"}}
    BEST --> AWQ["AWQ 4-bit"]
    BEST --> Q4["GGUF Q4_K_M"]
    BEST --> Q8["GGUF Q8_0"]

    AWQ -->|"vLLM"| API["API Endpoint"]
    Q4 -->|"llama.cpp"| SPACES["HF Spaces<br/>Demo"]

    style BASE fill:#f5f5f5,stroke:#616161
    style SFT fill:#c8e6c9,stroke:#2e7d32
    style COT fill:#a5d6a7,stroke:#1b5e20
    style DPO fill:#bbdefb,stroke:#1565c0
    style ORPO fill:#b3e5fc,stroke:#0277bd
    style BEST fill:#fff9c4,stroke:#f57f17
    style AWQ fill:#ffe0b2,stroke:#e65100
    style Q4 fill:#ffe0b2,stroke:#e65100
    style Q8 fill:#ffe0b2,stroke:#e65100

Free Compute Strategy

Resource Purpose Quota
Kaggle T4 SFT, DPO, ORPO training 30 GPU-hrs/week
Colab Free T4 Reward model, eval runs ~4 hrs/session
Gemini Free Data generation, quality judge 1500 req/day
Groq Free Data generation (Llama 3.3 70B) 30 req/min
OpenRouter Free Data generation (gpt-oss-120b) 10 req/min
HuggingFace Spaces Model demo deployment Free CPU/GPU

Project Timeline

gantt
    title EnergyLM-7B Development Timeline
    dateFormat YYYY-MM-DD
    axisFormat %b %d

    section Data
    Foundation & Scripts        :done, w1, 2026-05-19, 7d
    Data Generation (20K)       :active, w1b, 2026-05-24, 7d
    CoT + Preferences           :w2b, after w1b, 5d
    Dedup + Filter + Publish    :w2c, after w2b, 2d

    section Training
    SFT QLoRA (3 epochs)        :w3a, 2026-06-02, 3d
    Ablation Studies (3 runs)   :w3b, after w3a, 3d
    CoT Distillation            :w3c, after w3b, 1d
    DPO Training                :w4a, 2026-06-09, 2d
    ORPO Training               :w4b, after w4a, 2d
    Reward Model (DeBERTa)      :w4c, after w4b, 1d

    section Eval & Deploy
    Full Benchmark Suite        :w4d, after w4c, 2d
    Quantization (AWQ + GGUF)   :w5a, 2026-06-16, 2d
    vLLM Benchmarks             :w5b, after w5a, 2d
    HF Spaces Deployment        :w5c, after w5b, 1d

    section Documentation
    Blog Post + Model Cards     :w6a, 2026-06-23, 3d
    OSS Contribution            :w6b, after w6a, 2d
    Public Release              :milestone, after w6b, 0d

Planned HuggingFace Artifacts

All published under the ForceX-AI organization:

Artifact Type Description
ForceX-AI/energy-instruct-20k Dataset 20K energy-domain instruction pairs (ChatML)
ForceX-AI/energy-preferences-5k Dataset 5K chosen/rejected preference pairs for DPO
ForceX-AI/EnergyLM-7B-SFT Model QLoRA SFT adapter + merged model
ForceX-AI/EnergyLM-7B-DPO Model DPO-aligned variant
ForceX-AI/EnergyLM-7B-ORPO Model ORPO-aligned variant
ForceX-AI/EnergyLM-7B-GGUF Model Quantized GGUF files (Q3/Q4/Q5/Q8)
ForceX-AI/EnergyLM-RewardModel Model DeBERTa-v3-base reward classifier

Technology Stack

Layer Technologies
Base Model Qwen/Qwen2.5-7B
Training PyTorch, Transformers, TRL, PEFT, bitsandbytes
Data datasketch, sentence-transformers, Gemini/Groq/OpenRouter APIs
Evaluation lm-evaluation-harness, custom benchmarks, Gemini judge
Quantization AutoAWQ, llama.cpp (GGUF)
Serving vLLM, Gradio, HuggingFace Spaces
Tracking Weights & Biases, HuggingFace Hub
CI/CD GitHub Actions (lint, test, data-validate)
Compute Kaggle T4, Colab Free, free LLM API tiers

Current Status

Week 1: COMPLETE — All foundation code, scripts, notebooks, eval framework, and infrastructure configured. Data generation running across 3 free-tier API backends.

Next: Complete data generation, run quality pipeline, begin SFT training on Kaggle T4.