# Validation Methodology: Compromised Validation Using Actuals **Date**: November 13, 2025 **Status**: Accepted by User **Purpose**: Document limitations and expected optimism bias in Sept 2025 validation --- ## Executive Summary This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance. **Expected Impact**: Results will be **20-40% more optimistic** than production reality. --- ## Features Using Actuals (Not Forecasts) ### 1. Weather Features (375 features) **Compromise**: Using actual weather values instead of weather forecasts **Production Reality**: Weather forecasts contain errors that propagate to flow predictions **Impact**: - Weather forecast errors typically 1-3°C for temperature, 20-30% for wind - This represents the **largest source of optimism bias** - Expected 15-25% MAE improvement vs. real forecasts **Why Compromised**: - OpenMeteo API does not provide historical forecast archives - Only current forecasts + historical actuals available - Cannot reconstruct "forecast as of Oct 1" for October validation ### 2. CNEC Outage Features (176 features) **Compromise**: Using actual outages instead of planned outage forecasts **Production Reality**: Outage schedules change (cancellations, extensions, unplanned events) **Impact**: - Outage forecast accuracy ~80-90% (planned outages fairly reliable) - Expected 3-7% MAE improvement vs. real outage forecasts **Why Compromised**: - ENTSO-E Transparency API does not easily expose outage version history - Could potentially collect with advanced queries (future work) - Current dataset contains final outage data, not forecasts ### 3. LTA Features (40 features) **Compromise**: Using actual LTA values instead of forward-filled from D+0 **Production Reality**: LTA published weeks ahead, minimal uncertainty **Impact**: - LTA values are very stable (long-term allocations) - Expected <1% MAE impact (negligible) **Why Compromised**: - JAO API could provide this, but requires additional implementation - LTA uncertainty minimal compared to weather/load forecasts ### 4. Load Forecast Features (12 features) **Compromise**: Using actual demand instead of day-ahead load forecasts **Production Reality**: Load forecasts have 1-3% MAPEerror **Impact**: - Load forecast error contributes to flow prediction error - Expected 5-10% MAE improvement vs. real load forecasts **Why Compromised**: - ENTSO-E day-ahead load forecasts available but requires separate collection - Currently using actual demand from historical data --- ## Features Using Correct Data (No Compromise) ### Temporal Features (12 features) - Hour, day, month, weekday encodings - **Always known perfectly** - no forecast error possible ### Historical Features (1,899 features) - Prices, generation, demand, lags, CNEC bindings - **Only used in context window** - not forecast ahead - Correct usage: These are known values up to run_date --- ## Expected Optimism Bias Summary | Feature Category | Count | Forecast Error | Bias Contribution | |-----------------|-------|----------------|-------------------| | Weather | 375 | High (20-30%) | +15-25% MAE bias | | Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias | | CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias | | LTA | 40 | Negligible | <1% MAE bias | | **Total Expected** | **603** | **Combined** | **+20-40% total** | **Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**. --- ## Validation Framing ### What This Validation Proves ✅ **Pipeline Correctness**: DynamicForecast system works mechanically ✅ **Leakage Prevention**: Time-aware extraction prevents data leakage ✅ **Model Capability**: Chronos 2 can learn cross-border flow patterns ✅ **Lower Bound**: Establishes best-case performance envelope ✅ **Comparative Studies**: Fair baseline for model comparisons ### What This Validation Does NOT Prove ❌ **Production Accuracy**: Real MAE will be 20-40% higher ❌ **Operational Readiness**: Requires prospective validation ❌ **Feature Importance**: Cannot isolate weather vs. structural effects ❌ **Forecast Skill**: Using perfect information, not forecasts --- ## Precedents in ML Forecasting Literature This compromised approach is **common and accepted** in ML research when properly documented: ### Academic Precedents 1. **IEEE Power & Energy Society Journals**: - Many load/renewable forecasting papers use actual weather for validation - Framed as "perfect weather information" scenarios - Cited to establish theoretical performance bounds 2. **Energy Forecasting Competitions**: - Some tracks explicitly provide actual values for covariates - Focus on model architecture, not forecast accuracy - Clearly labeled as "oracle" scenarios 3. **Weather-Dependent Forecasting**: - Wind power forecasting research often uses actual wind observations - Standard practice when evaluating model capacity independently ### Key Requirement **Explicit documentation** of limitations (as provided in this document). --- ## Mitigation Strategies ### 1. Clear Communication - **ALWAYS** state "using actuals for weather/outages/load" - Frame results as "lower bound on production MAE" - Never claim production-ready without prospective validation ### 2. Ablation Studies (Future Work) - Remove weather features → measure MAE increase - Remove outage features → measure contribution - Quantify: "Weather contributes ~X MW to MAE" ### 3. Synthetic Forecast Degradation (Future Work) - Add Gaussian noise to weather features (σ = 2°C for temperature) - Simulate load forecast error (~2% MAPE) - Re-evaluate with "noisy forecasts" → closer to production ### 4. Prospective Validation (November 2025+) - Collect proper forecasts daily starting Nov 1 - Run forecasts using day-ahead weather/load/outages - Compare Oct (optimistic) vs. Nov (realistic) --- ## Comparison to Baseline Models Even with compromised validation, comparisons are **valid** if: - ✅ **All models use same compromised data** (fair comparison) - ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA) - ✅ **Relative performance** matters more than absolute MAE Example: ``` Model | Sept MAE (Compromised) | Relative to Persistence -------------------|------------------------|------------------------ Persistence | 250 MW | 1.00x (baseline) Seasonal Naive | 210 MW | 0.84x Chronos 2 (ours) | 120 MW | 0.48x ← Valid comparison ``` --- ## Validation Results Interpretation Guide ### If Sept MAE = 100 MW - **Lower bound established**: Pipeline works mechanically - **Production expectation**: 120-140 MW MAE - **Target assessment**: Still below 134 MW target? ✅ Good sign - **Action**: Proceed to prospective validation ### If Sept MAE = 150 MW - **Lower bound established**: 150 MW with perfect info - **Production expectation**: 180-210 MW MAE - **Target assessment**: Above 134 MW target ❌ Problem - **Action**: Investigate errors before production ### If Sept MAE = 200+ MW - **Systematic issue**: Even perfect information insufficient - **Action**: Debug feature engineering, check for bugs --- ## Recommended Reporting Language ### Good ✅ > "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)." ### Acceptable ✅ > "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment." ### Misleading ❌ > "The system achieves 120 MW MAE on validation data and is ready for production." *(Omits limitations, implies production readiness)* --- ## Conclusion This compromised validation approach is: - ✅ **Acceptable** in ML research with proper documentation - ✅ **Useful** for proving pipeline correctness and model capability - ✅ **Valid** for comparative studies (vs. baselines, ablations) - ❌ **NOT sufficient** for claiming production accuracy - ❌ **NOT a substitute** for prospective validation **Next Steps**: 1. Run Sept validation with this methodology 2. Document results with limitations clearly stated 3. Begin November prospective validation collection 4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days --- **Approved By**: User (Nov 13, 2025) **Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."