fbmc-chronos2 / doc /validation_methodology.md
Evgueni Poloukarov
fix: add Windows multiprocessing protection and validation methodology
e5f4fec
# Validation Methodology: Compromised Validation Using Actuals
**Date**: November 13, 2025
**Status**: Accepted by User
**Purpose**: Document limitations and expected optimism bias in Sept 2025 validation
---
## Executive Summary
This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance.
**Expected Impact**: Results will be **20-40% more optimistic** than production reality.
---
## Features Using Actuals (Not Forecasts)
### 1. Weather Features (375 features)
**Compromise**: Using actual weather values instead of weather forecasts
**Production Reality**: Weather forecasts contain errors that propagate to flow predictions
**Impact**:
- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
- This represents the **largest source of optimism bias**
- Expected 15-25% MAE improvement vs. real forecasts
**Why Compromised**:
- OpenMeteo API does not provide historical forecast archives
- Only current forecasts + historical actuals available
- Cannot reconstruct "forecast as of Oct 1" for October validation
### 2. CNEC Outage Features (176 features)
**Compromise**: Using actual outages instead of planned outage forecasts
**Production Reality**: Outage schedules change (cancellations, extensions, unplanned events)
**Impact**:
- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
- Expected 3-7% MAE improvement vs. real outage forecasts
**Why Compromised**:
- ENTSO-E Transparency API does not easily expose outage version history
- Could potentially collect with advanced queries (future work)
- Current dataset contains final outage data, not forecasts
### 3. LTA Features (40 features)
**Compromise**: Using actual LTA values instead of forward-filled from D+0
**Production Reality**: LTA published weeks ahead, minimal uncertainty
**Impact**:
- LTA values are very stable (long-term allocations)
- Expected <1% MAE impact (negligible)
**Why Compromised**:
- JAO API could provide this, but requires additional implementation
- LTA uncertainty minimal compared to weather/load forecasts
### 4. Load Forecast Features (12 features)
**Compromise**: Using actual demand instead of day-ahead load forecasts
**Production Reality**: Load forecasts have 1-3% MAPEerror
**Impact**:
- Load forecast error contributes to flow prediction error
- Expected 5-10% MAE improvement vs. real load forecasts
**Why Compromised**:
- ENTSO-E day-ahead load forecasts available but requires separate collection
- Currently using actual demand from historical data
---
## Features Using Correct Data (No Compromise)
### Temporal Features (12 features)
- Hour, day, month, weekday encodings
- **Always known perfectly** - no forecast error possible
### Historical Features (1,899 features)
- Prices, generation, demand, lags, CNEC bindings
- **Only used in context window** - not forecast ahead
- Correct usage: These are known values up to run_date
---
## Expected Optimism Bias Summary
| Feature Category | Count | Forecast Error | Bias Contribution |
|-----------------|-------|----------------|-------------------|
| Weather | 375 | High (20-30%) | +15-25% MAE bias |
| Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias |
| CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias |
| LTA | 40 | Negligible | <1% MAE bias |
| **Total Expected** | **603** | **Combined** | **+20-40% total** |
**Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**.
---
## Validation Framing
### What This Validation Proves
**Pipeline Correctness**: DynamicForecast system works mechanically
**Leakage Prevention**: Time-aware extraction prevents data leakage
**Model Capability**: Chronos 2 can learn cross-border flow patterns
**Lower Bound**: Establishes best-case performance envelope
**Comparative Studies**: Fair baseline for model comparisons
### What This Validation Does NOT Prove
**Production Accuracy**: Real MAE will be 20-40% higher
**Operational Readiness**: Requires prospective validation
**Feature Importance**: Cannot isolate weather vs. structural effects
**Forecast Skill**: Using perfect information, not forecasts
---
## Precedents in ML Forecasting Literature
This compromised approach is **common and accepted** in ML research when properly documented:
### Academic Precedents
1. **IEEE Power & Energy Society Journals**:
- Many load/renewable forecasting papers use actual weather for validation
- Framed as "perfect weather information" scenarios
- Cited to establish theoretical performance bounds
2. **Energy Forecasting Competitions**:
- Some tracks explicitly provide actual values for covariates
- Focus on model architecture, not forecast accuracy
- Clearly labeled as "oracle" scenarios
3. **Weather-Dependent Forecasting**:
- Wind power forecasting research often uses actual wind observations
- Standard practice when evaluating model capacity independently
### Key Requirement
**Explicit documentation** of limitations (as provided in this document).
---
## Mitigation Strategies
### 1. Clear Communication
- **ALWAYS** state "using actuals for weather/outages/load"
- Frame results as "lower bound on production MAE"
- Never claim production-ready without prospective validation
### 2. Ablation Studies (Future Work)
- Remove weather features → measure MAE increase
- Remove outage features → measure contribution
- Quantify: "Weather contributes ~X MW to MAE"
### 3. Synthetic Forecast Degradation (Future Work)
- Add Gaussian noise to weather features (σ = 2°C for temperature)
- Simulate load forecast error (~2% MAPE)
- Re-evaluate with "noisy forecasts" → closer to production
### 4. Prospective Validation (November 2025+)
- Collect proper forecasts daily starting Nov 1
- Run forecasts using day-ahead weather/load/outages
- Compare Oct (optimistic) vs. Nov (realistic)
---
## Comparison to Baseline Models
Even with compromised validation, comparisons are **valid** if:
- ✅ **All models use same compromised data** (fair comparison)
- ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA)
- ✅ **Relative performance** matters more than absolute MAE
Example:
```
Model | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence | 250 MW | 1.00x (baseline)
Seasonal Naive | 210 MW | 0.84x
Chronos 2 (ours) | 120 MW | 0.48x ← Valid comparison
```
---
## Validation Results Interpretation Guide
### If Sept MAE = 100 MW
- **Lower bound established**: Pipeline works mechanically
- **Production expectation**: 120-140 MW MAE
- **Target assessment**: Still below 134 MW target? ✅ Good sign
- **Action**: Proceed to prospective validation
### If Sept MAE = 150 MW
- **Lower bound established**: 150 MW with perfect info
- **Production expectation**: 180-210 MW MAE
- **Target assessment**: Above 134 MW target ❌ Problem
- **Action**: Investigate errors before production
### If Sept MAE = 200+ MW
- **Systematic issue**: Even perfect information insufficient
- **Action**: Debug feature engineering, check for bugs
---
## Recommended Reporting Language
### Good ✅
> "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."
### Acceptable ✅
> "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."
### Misleading ❌
> "The system achieves 120 MW MAE on validation data and is ready for production."
*(Omits limitations, implies production readiness)*
---
## Conclusion
This compromised validation approach is:
- ✅ **Acceptable** in ML research with proper documentation
- ✅ **Useful** for proving pipeline correctness and model capability
- ✅ **Valid** for comparative studies (vs. baselines, ablations)
- ❌ **NOT sufficient** for claiming production accuracy
- ❌ **NOT a substitute** for prospective validation
**Next Steps**:
1. Run Sept validation with this methodology
2. Document results with limitations clearly stated
3. Begin November prospective validation collection
4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days
---
**Approved By**: User (Nov 13, 2025)
**Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."