# Validation Methodology: Compromised Validation Using Actuals

**Date**: November 13, 2025
**Status**: Accepted by User
**Purpose**: Document limitations and expected optimism bias in Sept 2025 validation

---

## Executive Summary

This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance.

**Expected Impact**: Results will be **20-40% more optimistic** than production reality.

---

## Features Using Actuals (Not Forecasts)

### 1. Weather Features (375 features)
**Compromise**: Using actual weather values instead of weather forecasts
**Production Reality**: Weather forecasts contain errors that propagate to flow predictions
**Impact**:
- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
- This represents the **largest source of optimism bias**
- Expected 15-25% MAE improvement vs. real forecasts

**Why Compromised**:
- OpenMeteo API does not provide historical forecast archives
- Only current forecasts + historical actuals available
- Cannot reconstruct "forecast as of Oct 1" for October validation

### 2. CNEC Outage Features (176 features)
**Compromise**: Using actual outages instead of planned outage forecasts
**Production Reality**: Outage schedules change (cancellations, extensions, unplanned events)
**Impact**:
- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
- Expected 3-7% MAE improvement vs. real outage forecasts

**Why Compromised**:
- ENTSO-E Transparency API does not easily expose outage version history
- Could potentially collect with advanced queries (future work)
- Current dataset contains final outage data, not forecasts

### 3. LTA Features (40 features)
**Compromise**: Using actual LTA values instead of forward-filled from D+0
**Production Reality**: LTA published weeks ahead, minimal uncertainty
**Impact**:
- LTA values are very stable (long-term allocations)
- Expected <1% MAE impact (negligible)

**Why Compromised**:
- JAO API could provide this, but requires additional implementation
- LTA uncertainty minimal compared to weather/load forecasts

### 4. Load Forecast Features (12 features)
**Compromise**: Using actual demand instead of day-ahead load forecasts
**Production Reality**: Load forecasts have 1-3% MAPEerror
**Impact**:
- Load forecast error contributes to flow prediction error
- Expected 5-10% MAE improvement vs. real load forecasts

**Why Compromised**:
- ENTSO-E day-ahead load forecasts available but requires separate collection
- Currently using actual demand from historical data

---

## Features Using Correct Data (No Compromise)

### Temporal Features (12 features)
- Hour, day, month, weekday encodings
- **Always known perfectly** - no forecast error possible

### Historical Features (1,899 features)
- Prices, generation, demand, lags, CNEC bindings
- **Only used in context window** - not forecast ahead
- Correct usage: These are known values up to run_date

---

## Expected Optimism Bias Summary

| Feature Category | Count | Forecast Error | Bias Contribution |
|-----------------|-------|----------------|-------------------|
| Weather | 375 | High (20-30%) | +15-25% MAE bias |
| Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias |
| CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias |
| LTA | 40 | Negligible | <1% MAE bias |
| **Total Expected** | **603** | **Combined** | **+20-40% total** |

**Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**.

---

## Validation Framing

### What This Validation Proves
✅ **Pipeline Correctness**: DynamicForecast system works mechanically
✅ **Leakage Prevention**: Time-aware extraction prevents data leakage
✅ **Model Capability**: Chronos 2 can learn cross-border flow patterns
✅ **Lower Bound**: Establishes best-case performance envelope
✅ **Comparative Studies**: Fair baseline for model comparisons

### What This Validation Does NOT Prove
❌ **Production Accuracy**: Real MAE will be 20-40% higher
❌ **Operational Readiness**: Requires prospective validation
❌ **Feature Importance**: Cannot isolate weather vs. structural effects
❌ **Forecast Skill**: Using perfect information, not forecasts

---

## Precedents in ML Forecasting Literature

This compromised approach is **common and accepted** in ML research when properly documented:

### Academic Precedents
1. **IEEE Power & Energy Society Journals**:
   - Many load/renewable forecasting papers use actual weather for validation
   - Framed as "perfect weather information" scenarios
   - Cited to establish theoretical performance bounds

2. **Energy Forecasting Competitions**:
   - Some tracks explicitly provide actual values for covariates
   - Focus on model architecture, not forecast accuracy
   - Clearly labeled as "oracle" scenarios

3. **Weather-Dependent Forecasting**:
   - Wind power forecasting research often uses actual wind observations
   - Standard practice when evaluating model capacity independently

### Key Requirement
**Explicit documentation** of limitations (as provided in this document).

---

## Mitigation Strategies

### 1. Clear Communication
- **ALWAYS** state "using actuals for weather/outages/load"
- Frame results as "lower bound on production MAE"
- Never claim production-ready without prospective validation

### 2. Ablation Studies (Future Work)
- Remove weather features → measure MAE increase
- Remove outage features → measure contribution
- Quantify: "Weather contributes ~X MW to MAE"

### 3. Synthetic Forecast Degradation (Future Work)
- Add Gaussian noise to weather features (σ = 2°C for temperature)
- Simulate load forecast error (~2% MAPE)
- Re-evaluate with "noisy forecasts" → closer to production

### 4. Prospective Validation (November 2025+)
- Collect proper forecasts daily starting Nov 1
- Run forecasts using day-ahead weather/load/outages
- Compare Oct (optimistic) vs. Nov (realistic)

---

## Comparison to Baseline Models

Even with compromised validation, comparisons are **valid** if:
- ✅ **All models use same compromised data** (fair comparison)
- ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA)
- ✅ **Relative performance** matters more than absolute MAE

Example:
```
Model              | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence        | 250 MW                 | 1.00x (baseline)
Seasonal Naive     | 210 MW                 | 0.84x
Chronos 2 (ours)   | 120 MW                 | 0.48x ← Valid comparison
```

---

## Validation Results Interpretation Guide

### If Sept MAE = 100 MW
- **Lower bound established**: Pipeline works mechanically
- **Production expectation**: 120-140 MW MAE
- **Target assessment**: Still below 134 MW target? ✅ Good sign
- **Action**: Proceed to prospective validation

### If Sept MAE = 150 MW
- **Lower bound established**: 150 MW with perfect info
- **Production expectation**: 180-210 MW MAE
- **Target assessment**: Above 134 MW target ❌ Problem
- **Action**: Investigate errors before production

### If Sept MAE = 200+ MW
- **Systematic issue**: Even perfect information insufficient
- **Action**: Debug feature engineering, check for bugs

---

## Recommended Reporting Language

### Good ✅
> "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."

### Acceptable ✅
> "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."

### Misleading ❌
> "The system achieves 120 MW MAE on validation data and is ready for production."
*(Omits limitations, implies production readiness)*

---

## Conclusion

This compromised validation approach is:
- ✅ **Acceptable** in ML research with proper documentation
- ✅ **Useful** for proving pipeline correctness and model capability
- ✅ **Valid** for comparative studies (vs. baselines, ablations)
- ❌ **NOT sufficient** for claiming production accuracy
- ❌ **NOT a substitute** for prospective validation

**Next Steps**:
1. Run Sept validation with this methodology
2. Document results with limitations clearly stated
3. Begin November prospective validation collection
4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days

---

**Approved By**: User (Nov 13, 2025)
**Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."