Spaces:
Sleeping
Sleeping
| # Validation Methodology: Compromised Validation Using Actuals | |
| **Date**: November 13, 2025 | |
| **Status**: Accepted by User | |
| **Purpose**: Document limitations and expected optimism bias in Sept 2025 validation | |
| --- | |
| ## Executive Summary | |
| This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance. | |
| **Expected Impact**: Results will be **20-40% more optimistic** than production reality. | |
| --- | |
| ## Features Using Actuals (Not Forecasts) | |
| ### 1. Weather Features (375 features) | |
| **Compromise**: Using actual weather values instead of weather forecasts | |
| **Production Reality**: Weather forecasts contain errors that propagate to flow predictions | |
| **Impact**: | |
| - Weather forecast errors typically 1-3°C for temperature, 20-30% for wind | |
| - This represents the **largest source of optimism bias** | |
| - Expected 15-25% MAE improvement vs. real forecasts | |
| **Why Compromised**: | |
| - OpenMeteo API does not provide historical forecast archives | |
| - Only current forecasts + historical actuals available | |
| - Cannot reconstruct "forecast as of Oct 1" for October validation | |
| ### 2. CNEC Outage Features (176 features) | |
| **Compromise**: Using actual outages instead of planned outage forecasts | |
| **Production Reality**: Outage schedules change (cancellations, extensions, unplanned events) | |
| **Impact**: | |
| - Outage forecast accuracy ~80-90% (planned outages fairly reliable) | |
| - Expected 3-7% MAE improvement vs. real outage forecasts | |
| **Why Compromised**: | |
| - ENTSO-E Transparency API does not easily expose outage version history | |
| - Could potentially collect with advanced queries (future work) | |
| - Current dataset contains final outage data, not forecasts | |
| ### 3. LTA Features (40 features) | |
| **Compromise**: Using actual LTA values instead of forward-filled from D+0 | |
| **Production Reality**: LTA published weeks ahead, minimal uncertainty | |
| **Impact**: | |
| - LTA values are very stable (long-term allocations) | |
| - Expected <1% MAE impact (negligible) | |
| **Why Compromised**: | |
| - JAO API could provide this, but requires additional implementation | |
| - LTA uncertainty minimal compared to weather/load forecasts | |
| ### 4. Load Forecast Features (12 features) | |
| **Compromise**: Using actual demand instead of day-ahead load forecasts | |
| **Production Reality**: Load forecasts have 1-3% MAPEerror | |
| **Impact**: | |
| - Load forecast error contributes to flow prediction error | |
| - Expected 5-10% MAE improvement vs. real load forecasts | |
| **Why Compromised**: | |
| - ENTSO-E day-ahead load forecasts available but requires separate collection | |
| - Currently using actual demand from historical data | |
| --- | |
| ## Features Using Correct Data (No Compromise) | |
| ### Temporal Features (12 features) | |
| - Hour, day, month, weekday encodings | |
| - **Always known perfectly** - no forecast error possible | |
| ### Historical Features (1,899 features) | |
| - Prices, generation, demand, lags, CNEC bindings | |
| - **Only used in context window** - not forecast ahead | |
| - Correct usage: These are known values up to run_date | |
| --- | |
| ## Expected Optimism Bias Summary | |
| | Feature Category | Count | Forecast Error | Bias Contribution | | |
| |-----------------|-------|----------------|-------------------| | |
| | Weather | 375 | High (20-30%) | +15-25% MAE bias | | |
| | Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias | | |
| | CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias | | |
| | LTA | 40 | Negligible | <1% MAE bias | | |
| | **Total Expected** | **603** | **Combined** | **+20-40% total** | | |
| **Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**. | |
| --- | |
| ## Validation Framing | |
| ### What This Validation Proves | |
| ✅ **Pipeline Correctness**: DynamicForecast system works mechanically | |
| ✅ **Leakage Prevention**: Time-aware extraction prevents data leakage | |
| ✅ **Model Capability**: Chronos 2 can learn cross-border flow patterns | |
| ✅ **Lower Bound**: Establishes best-case performance envelope | |
| ✅ **Comparative Studies**: Fair baseline for model comparisons | |
| ### What This Validation Does NOT Prove | |
| ❌ **Production Accuracy**: Real MAE will be 20-40% higher | |
| ❌ **Operational Readiness**: Requires prospective validation | |
| ❌ **Feature Importance**: Cannot isolate weather vs. structural effects | |
| ❌ **Forecast Skill**: Using perfect information, not forecasts | |
| --- | |
| ## Precedents in ML Forecasting Literature | |
| This compromised approach is **common and accepted** in ML research when properly documented: | |
| ### Academic Precedents | |
| 1. **IEEE Power & Energy Society Journals**: | |
| - Many load/renewable forecasting papers use actual weather for validation | |
| - Framed as "perfect weather information" scenarios | |
| - Cited to establish theoretical performance bounds | |
| 2. **Energy Forecasting Competitions**: | |
| - Some tracks explicitly provide actual values for covariates | |
| - Focus on model architecture, not forecast accuracy | |
| - Clearly labeled as "oracle" scenarios | |
| 3. **Weather-Dependent Forecasting**: | |
| - Wind power forecasting research often uses actual wind observations | |
| - Standard practice when evaluating model capacity independently | |
| ### Key Requirement | |
| **Explicit documentation** of limitations (as provided in this document). | |
| --- | |
| ## Mitigation Strategies | |
| ### 1. Clear Communication | |
| - **ALWAYS** state "using actuals for weather/outages/load" | |
| - Frame results as "lower bound on production MAE" | |
| - Never claim production-ready without prospective validation | |
| ### 2. Ablation Studies (Future Work) | |
| - Remove weather features → measure MAE increase | |
| - Remove outage features → measure contribution | |
| - Quantify: "Weather contributes ~X MW to MAE" | |
| ### 3. Synthetic Forecast Degradation (Future Work) | |
| - Add Gaussian noise to weather features (σ = 2°C for temperature) | |
| - Simulate load forecast error (~2% MAPE) | |
| - Re-evaluate with "noisy forecasts" → closer to production | |
| ### 4. Prospective Validation (November 2025+) | |
| - Collect proper forecasts daily starting Nov 1 | |
| - Run forecasts using day-ahead weather/load/outages | |
| - Compare Oct (optimistic) vs. Nov (realistic) | |
| --- | |
| ## Comparison to Baseline Models | |
| Even with compromised validation, comparisons are **valid** if: | |
| - ✅ **All models use same compromised data** (fair comparison) | |
| - ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA) | |
| - ✅ **Relative performance** matters more than absolute MAE | |
| Example: | |
| ``` | |
| Model | Sept MAE (Compromised) | Relative to Persistence | |
| -------------------|------------------------|------------------------ | |
| Persistence | 250 MW | 1.00x (baseline) | |
| Seasonal Naive | 210 MW | 0.84x | |
| Chronos 2 (ours) | 120 MW | 0.48x ← Valid comparison | |
| ``` | |
| --- | |
| ## Validation Results Interpretation Guide | |
| ### If Sept MAE = 100 MW | |
| - **Lower bound established**: Pipeline works mechanically | |
| - **Production expectation**: 120-140 MW MAE | |
| - **Target assessment**: Still below 134 MW target? ✅ Good sign | |
| - **Action**: Proceed to prospective validation | |
| ### If Sept MAE = 150 MW | |
| - **Lower bound established**: 150 MW with perfect info | |
| - **Production expectation**: 180-210 MW MAE | |
| - **Target assessment**: Above 134 MW target ❌ Problem | |
| - **Action**: Investigate errors before production | |
| ### If Sept MAE = 200+ MW | |
| - **Systematic issue**: Even perfect information insufficient | |
| - **Action**: Debug feature engineering, check for bugs | |
| --- | |
| ## Recommended Reporting Language | |
| ### Good ✅ | |
| > "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)." | |
| ### Acceptable ✅ | |
| > "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment." | |
| ### Misleading ❌ | |
| > "The system achieves 120 MW MAE on validation data and is ready for production." | |
| *(Omits limitations, implies production readiness)* | |
| --- | |
| ## Conclusion | |
| This compromised validation approach is: | |
| - ✅ **Acceptable** in ML research with proper documentation | |
| - ✅ **Useful** for proving pipeline correctness and model capability | |
| - ✅ **Valid** for comparative studies (vs. baselines, ablations) | |
| - ❌ **NOT sufficient** for claiming production accuracy | |
| - ❌ **NOT a substitute** for prospective validation | |
| **Next Steps**: | |
| 1. Run Sept validation with this methodology | |
| 2. Document results with limitations clearly stated | |
| 3. Begin November prospective validation collection | |
| 4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days | |
| --- | |
| **Approved By**: User (Nov 13, 2025) | |
| **Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented." | |