Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

App Files Files Community

fbmc-chronos2 / doc /JAO_Data_Treatment_Plan.md

Evgueni Poloukarov

feat: complete Marimo data exploration notebook with FBMC methodology documentation

82da022 about 1 month ago

preview code

raw

history blame contribute delete

118 kB

	# JAO Data Treatment & Feature Extraction Plan
	## Complete Guide for FBMC Flow Forecasting MVP

	Last Updated: October 29, 2025
	Project: FBMC Zero-Shot Forecasting with Chronos 2
	Timeline: 5-Day MVP
	Version: 5.0 - OUTAGE DURATION FEATURES (Added temporal outage metrics for enhanced forecasting)

	---

	## Table of Contents

	1. [Data Acquisition Strategy](#1-data-acquisition-strategy)
	2. [JAO Data Series Catalog](#2-jao-data-series-catalog)
	3. [Data Cleaning Procedures](#3-data-cleaning-procedures)
	4. [CNEC Treatment](#4-cnec-treatment)
	5. [PTDF Treatment](#5-ptdf-treatment)
	6. [RAM Treatment](#6-ram-treatment)
	7. [Shadow Prices Treatment](#7-shadow-prices-treatment)
	8. [Feature Engineering Pipeline](#8-feature-engineering-pipeline)
	9. [Quality Assurance](#9-quality-assurance)

	---

	## 1. Data Acquisition Strategy

	### 1.1 Timeline and Scope

	Historical Data Period: October 2023 - September 2025 (24 months)
	- Purpose: Feature baselines and historical context
	- Total: ~17,520 hourly records per series

	Collection Priority:
	1. Day 1 Morning (2 hours): Max BEX files (TARGET VARIABLE - highest priority!)
	2. Day 1 Morning (2 hours): CNEC files (critical for constraint identification)
	3. Day 1 Morning (2 hours): PTDF matrices (network sensitivity)
	4. Day 1 Afternoon (1.5 hours): LTN values (future covariates)
	5. Day 1 Afternoon (1.5 hours): Min/Max Net Positions (domain boundaries)
	6. Day 1 Afternoon (1.5 hours): ATC Non-Core borders (loop flow drivers)
	7. Day 1 Afternoon (1.5 hours): RAM values (capacity margins)
	8. Day 1 Afternoon (1.5 hours): Shadow prices (economic signals)

	### 1.2 jao-py Python Library Usage

	Installation & Setup:
	```bash
	# Install jao-py using uv package manager
	.venv/Scripts/uv.exe pip install jao-py>=0.6.2

	# Verify installation
	.venv/Scripts/python.exe -c "from jao import JaoPublicationToolPandasClient; print('jao-py installed successfully')"
	```

	Python Data Collection:
	```python
	import pandas as pd
	from jao import JaoPublicationToolPandasClient
	import time

	# Initialize client (no API key required for public data)
	client = JaoPublicationToolPandasClient()

	# Define date range (24 months: Oct 2023 - Sept 2025)
	# IMPORTANT: jao-py requires pandas Timestamp with timezone (UTC)
	start_date = pd.Timestamp('2023-10-01', tz='UTC')
	end_date = pd.Timestamp('2025-09-30', tz='UTC')

	# Collect MaxBEX data (TARGET VARIABLE) - day by day with rate limiting
	maxbex_data = []
	current_date = start_date
	while current_date <= end_date:
	df = client.query_maxbex(current_date)
	maxbex_data.append(df)
	current_date += pd.Timedelta(days=1)
	time.sleep(5) # Rate limiting: 5 seconds between requests

	# Combine and save
	maxbex_df = pd.concat(maxbex_data)
	maxbex_df.to_parquet('data/raw/jao/maxbex_2023_2025.parquet')

	# Collect Active Constraints (CNECs + PTDFs in ONE call!)
	cnec_data = []
	current_date = start_date
	while current_date <= end_date:
	df = client.query_active_constraints(current_date)
	cnec_data.append(df)
	current_date += pd.Timedelta(days=1)
	time.sleep(5) # Rate limiting: 5 seconds between requests

	# Combine and save
	cnec_df = pd.concat(cnec_data)
	cnec_df.to_parquet('data/raw/jao/cnecs_ptdfs_2023_2025.parquet')
	```

	Key Methods Available:
	- `query_maxbex(day)` - Maximum Bilateral Exchange capacity (TARGET)
	- `query_active_constraints(day)` - CNECs with shadow prices, RAM, and PTDFs
	- `query_lta(d_from, d_to)` - Long Term Allocations (perfect future covariate)
	- `query_minmax_np(day)` - Min/Max Net Positions (domain boundaries)
	- `query_net_position(day)` - Actual net positions
	- `query_monitoring(day)` - Monitoring data (additional RAM/shadow prices)
	```

	Expected Output Files:
	```
	data/raw/jao/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnecs_2024_2025.parquet (~500 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdfs_2024_2025.parquet (~800 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ rams_2024_2025.parquet (~400 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ shadow_prices_2024_2025.parquet (~300 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ presolved_2024_2025.parquet (~200 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ d2cf_2024_2025.parquet (~600 MB - OPTIONAL)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ metadata/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_definitions.parquet
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ zone_ptdf_mapping.parquet
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ download_log.json
	```

	---

	## 2. JAO Data Series Catalog

	### 2.1 Max BEX Data Series (TARGET VARIABLE)

	File: `Core_DA_Results_[date].xml` or `Core_Max_Exchanges_[date].xml`

	CRITICAL: This is the TARGET VARIABLE we are forecasting - without Max BEX, the model cannot be trained or evaluated!

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `border_id` \| String \| Border identifier (e.g., "DE-FR", "DE-NL") \| Border tracking \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `max_bex` \| Float (MW) \| Maximum Bilateral Exchange Capacity \| TARGET VARIABLE \| CRITICAL - must have \|
	\| `direction` \| String \| Direction of capacity (e.g., "DEÃ¢â€ â€™FR") \| Directional features \| Parse from border \|

	Description:
	Max BEX represents the actual available cross-border capacity after FBMC optimization. It is the OUTPUT of the FBMC calculation process and represents how much capacity is actually available for bilateral exchanges after accounting for:
	- All CNEC constraints (RAM limits)
	- PTDF sensitivities
	- LTN allocations (capacity already committed)
	- Remedial actions
	- All network constraints

	IMPORTANT: Commercial vs Physical Capacity
	- MaxBEX = commercial hub-to-hub trading capacity, NOT physical interconnector ratings
	- FBMC Core has 12 bidding zones: AT, BE, CZ, DE-LU, FR, HR, HU, NL, PL, RO, SI, SK
	- All 132 zone-pair combinations exist (12 × 11 bidirectional pairs)
	- This includes "virtual borders" - zone pairs without direct physical interconnectors
	- Example: FR→HU capacity exists despite no physical FR-HU interconnector
	- Power flows through AC grid network via DE, AT, CZ
	- PTDFs quantify how this exchange affects every CNEC in the network
	- MaxBEX = optimization result considering ALL network constraints

	Borders to Collect (ALL 132 zone pairs):
	- JAO provides MaxBEX for all 12 × 11 = 132 bidirectional zone-pair combinations
	- Includes both physical borders (e.g., DE→FR, AT→CZ) and virtual borders (e.g., FR→HU, BE→PL)
	- Each direction (A→B vs B→A) has independent capacity values
	- Full list: All permutations of (AT, BE, CZ, DE, FR, HR, HU, NL, PL, RO, SI, SK)
	- Example physical borders: DE→FR, DE→NL, AT→CZ, BE→NL, FR→BE
	- Example virtual borders: FR→HU, AT→HR, BE→RO, NL→SK, CZ→HR

	Collection Priority: DAY 1 MORNING (FIRST PRIORITY)
	This is the ground truth for training and validation - collect before any other data series!

	Storage Schema:
	```python
	max_bex_schema = {
	'timestamp': pl.Datetime,
	'border_id': pl.Utf8,
	'max_bex': pl.Float32, # TARGET VARIABLE
	'direction': pl.Utf8
	}
	```

	Expected File Size: ~1.2-1.5 GB for 24 months × 132 zone pairs × 17,520 hours (actual: ~200 MB compressed as Parquet)

	---

	### 2.2 LTN Data Series (Long Term Nominations)

	File: `Core_LTN_[date].xml` or available through `Core_ltn.R` JAOPuTo function

	CRITICAL: LTN values are PERFECT FUTURE COVARIATES because they result from yearly/monthly capacity auctions and are known with certainty for the forecast horizon!

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `border_id` \| String \| Border identifier (e.g., "DE-FR", "SI-HR") \| Border tracking \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `ltn_allocated` \| Float (MW) \| Capacity already allocated in long-term auctions \| INPUT FEATURE & FUTURE COVARIATE \| 0.0 (no LT allocation) \|
	\| `direction` \| String \| Direction of nomination \| Directional features \| Parse from border \|
	\| `auction_type` \| String \| "yearly" or "monthly" \| Allocation source tracking \| Required field \|

	Description:
	LTN represents capacity already committed through long-term (yearly and monthly) auctions conducted by JAO. This capacity is guaranteed and known in advance, making it an exceptionally valuable feature:

	Historical Context (What Happened):
	- LTN values show how much capacity was already allocated
	- Higher LTN Ã¢â€ â€™ lower available day-ahead capacity (Max BEX)
	- Direct relationship: `Max BEX Ã¢â€°Ë† Theoretical Max - LTN - Other Constraints`

	Future Covariates (What We Know Will Happen):
	- Yearly auctions: Results are known for the entire year ahead
	- Monthly auctions: Results are known for the month ahead
	- When forecasting D+1 to D+14, LTN values are 100% certain (already allocated)
	- This is GOLD STANDARD for future covariates - no forecasting needed!

	Borders with LTN:
	Most Core borders have LTN=0 (Financial Transmission Rights), but important exceptions:
	- SI-HR (Slovenia-Croatia): Physical Transmission Rights (PTRs) still in use
	- Some external Core borders may have PTRs
	- Even when LTN=0, including the series confirms this to the model

	Impact on Max BEX:
	Example: If DE-FR has 500 MW LTN allocated:
	- Theoretical capacity: 3000 MW
	- LTN reduction: -500 MW
	- After network constraints: -400 MW
	- Result: Max BEX Ã¢â€°Ë† 2100 MW

	Collection Priority: DAY 1 AFTERNOON (HIGH PRIORITY - future covariates)

	Storage Schema:
	```python
	ltn_schema = {
	'timestamp': pl.Datetime,
	'border_id': pl.Utf8,
	'ltn_allocated': pl.Float32, # Amount already committed
	'direction': pl.Utf8,
	'auction_type': pl.Utf8 # yearly/monthly
	}
	```

	Expected File Size: ~100-160 MB for 24 months Ãƒâ€” 20 borders Ãƒâ€” 17,520 hours

	JAOPuTo Download Command:
	```bash
	# Download LTN data for 24 months
	java -jar JAOPuTo.jar \
	--start-date 2023-10-01 \
	--end-date 2025-09-30 \
	--data-type LTN \
	--region CORE \
	--output-format parquet \
	--output-dir ./data/raw/jao/ltn/
	```

	Future Covariate Usage:
	```python
	# LTN is known in advance from auction results
	# For forecast period D+1 to D+14, we have PERFECT information
	def prepare_ltn_future_covariates(ltn_df, forecast_start_date, prediction_horizon_hours=336):
	"""
	Extract LTN values for future forecast horizon
	These are KNOWN with certainty (auction results)
	"""
	forecast_end_date = forecast_start_date + pd.Timedelta(hours=prediction_horizon_hours)

	future_ltn = ltn_df.filter(
	(pl.col('timestamp') >= forecast_start_date) &
	(pl.col('timestamp') < forecast_end_date)
	)

	return future_ltn # No forecasting needed - these are actual commitments!
	```

	---

	### 2.3 Min/Max Net Position Data Series

	File: `Core_MaxNetPositions_[date].xml` or available through `Core_maxnetpositions.R` JAOPuTo function

	CRITICAL: Min/Max Net Positions define the feasible domain for each bidding zone - the boundaries within which net positions can move without violating network constraints.

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `zone` \| String \| Bidding zone (e.g., "DE_LU", "FR", "BE") \| Zone tracking \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `net_pos_min` \| Float (MW) \| Minimum feasible net position \| DOMAIN BOUNDARY \| CRITICAL - must have \|
	\| `net_pos_max` \| Float (MW) \| Maximum feasible net position \| DOMAIN BOUNDARY \| CRITICAL - must have \|
	\| `actual_net_pos` \| Float (MW) \| Actual net position after market clearing \| Reference value \| 0.0 if missing \|

	Description:
	Min/Max Net Positions represent the feasible space in which each bidding zone can operate without violating any CNEC constraints. These values are extracted from the union of:
	- Final flow-based domain
	- Final bilateral exchange restrictions (LTA inclusion)
	- All CNEC constraints simultaneously satisfied

	Why This Matters for Forecasting:
	1. Tight ranges indicate constrained systems:
	- If `net_pos_max - net_pos_min` is small Ã¢â€ â€™ limited flexibility Ã¢â€ â€™ likely lower Max BEX
	- If range is large Ã¢â€ â€™ system has degrees of freedom Ã¢â€ â€™ potentially higher Max BEX

	2. Zone-specific stress indicators:
	- A zone with narrow net position range is "boxed in"
	- This propagates to adjacent borders (reduces their capacity)

	3. Derived features:
	```python
	# Range = degrees of freedom
	net_pos_range = net_pos_max - net_pos_min

	# Margin = how close to limits (utilization)
	net_pos_margin = (actual_net_pos - net_pos_min) / (net_pos_max - net_pos_min)

	# Stress index = inverse of range (tighter = more stressed)
	zone_stress = 1.0 / (net_pos_range + 1.0) # +1 to avoid division by zero
	```

	Example Interpretation:
	For Germany (DE_LU) at a specific hour:
	- `net_pos_min` = -8000 MW (maximum import)
	- `net_pos_max` = +12000 MW (maximum export)
	- `actual_net_pos` = +6000 MW (actual export after market clearing)
	- Range = 20,000 MW (large flexibility)
	- Margin = (6000 - (-8000)) / 20000 = 0.70 (70% utilization toward export limit)

	If another hour shows:
	- `net_pos_min` = -2000 MW
	- `net_pos_max` = +3000 MW
	- Range = 5,000 MW (VERY constrained - system is tight!)
	- This hour will likely have lower Max BEX on German borders

	Zones to Collect (12 Core zones):
	- DE_LU (Germany-Luxembourg)
	- FR (France)
	- BE (Belgium)
	- NL (Netherlands)
	- AT (Austria)
	- CZ (Czech Republic)
	- PL (Poland)
	- SK (Slovakia)
	- HU (Hungary)
	- SI (Slovenia)
	- HR (Croatia)
	- RO (Romania)

	Collection Priority: DAY 1 AFTERNOON (HIGH PRIORITY)

	Storage Schema:
	```python
	netpos_schema = {
	'timestamp': pl.Datetime,
	'zone': pl.Utf8,
	'net_pos_min': pl.Float32, # Minimum feasible (import limit)
	'net_pos_max': pl.Float32, # Maximum feasible (export limit)
	'actual_net_pos': pl.Float32 # Actual market clearing value
	}
	```

	Expected File Size: ~160-200 MB for 24 months Ãƒâ€” 12 zones Ãƒâ€” 17,520 hours

	JAOPuTo Download Command:
	```bash
	# Download Min/Max Net Positions for 24 months
	java -jar JAOPuTo.jar \
	--start-date 2023-10-01 \
	--end-date 2025-09-30 \
	--data-type MAX_NET_POSITIONS \
	--region CORE \
	--output-format parquet \
	--output-dir ./data/raw/jao/netpos/
	```

	Feature Engineering:
	```python
	def engineer_netpos_features(df: pl.DataFrame) -> pl.DataFrame:
	"""
	Create net position features from Min/Max data
	"""
	df = df.with_columns([
	# 1. Range (degrees of freedom)
	(pl.col('net_pos_max') - pl.col('net_pos_min'))
	.alias('net_pos_range'),

	# 2. Utilization margin (0 = at min, 1 = at max)
	((pl.col('actual_net_pos') - pl.col('net_pos_min')) /
	(pl.col('net_pos_max') - pl.col('net_pos_min')))
	.alias('net_pos_margin'),

	# 3. Stress index (inverse of range, normalized)
	(1.0 / (pl.col('net_pos_max') - pl.col('net_pos_min') + 100.0))
	.alias('zone_stress_index'),

	# 4. Distance to limits (minimum of distances to both boundaries)
	pl.min_horizontal([
	pl.col('actual_net_pos') - pl.col('net_pos_min'),
	pl.col('net_pos_max') - pl.col('actual_net_pos')
	]).alias('distance_to_nearest_limit'),

	# 5. Asymmetry (is the feasible space symmetric around zero?)
	((pl.col('net_pos_max') + pl.col('net_pos_min')) / 2.0)
	.alias('domain_asymmetry')
	])

	return df
	```

	System-Level Aggregations:
	```python
	def aggregate_netpos_system_wide(df: pl.DataFrame) -> pl.DataFrame:
	"""
	Create system-wide net position stress indicators
	"""
	system_features = df.groupby('timestamp').agg([
	# Average range across all zones
	pl.col('net_pos_range').mean().alias('avg_zone_flexibility'),

	# Minimum range (most constrained zone)
	pl.col('net_pos_range').min().alias('min_zone_flexibility'),

	# Count of highly constrained zones (range < 5000 MW)
	(pl.col('net_pos_range') < 5000).sum().alias('n_constrained_zones'),

	# System-wide stress (average of zone stress indices)
	pl.col('zone_stress_index').mean().alias('system_stress_avg'),

	# Maximum zone stress (tightest zone)
	pl.col('zone_stress_index').max().alias('system_stress_max')
	])

	return system_features
	```

	---

	### 2.4 CNEC Data Series

	File: `Core_DA_CC_CNEC_[date].xml` (daily publication at 10:30 CET)

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `cnec_id` \| String \| Unique identifier (e.g., "DE_CZ_TIE_001") \| CNEC tracking \| N/A - Required field \|
	\| `cnec_name` \| String \| Human-readable name (e.g., "Line RÃƒÆ’Ã‚Â¶hrsdorf-Hradec N-1 ABC") \| Documentation \| Forward-fill \|
	\| `contingency` \| String \| N-1 element causing constraint \| CNEC classification \| "basecase" if missing \|
	\| `tso` \| String \| Responsible TSO (e.g., "50Hertz", "Ãƒâ€žÃ…â€™EPS") \| Geographic features \| Parse from `cnec_id` \|
	\| `monitored_element` \| String \| Physical line/transformer being monitored \| CNEC grouping \| Required field \|
	\| `fmax` \| Float (MW) \| Maximum flow limit under contingency \| Normalization baseline \| CRITICAL - must have \|
	\| `ram_before` \| Float (MW) \| Initial remaining available margin \| Historical patterns \| 0.0 (conservative) \|
	\| `ram_after` \| Float (MW) \| Final RAM after remedial actions \| Primary feature \| Forward-fill from `ram_before` \|
	\| `flow_fb` \| Float (MW) \| Final flow after market coupling \| Flow patterns \| 0.0 if unconstrained \|
	\| `presolved` \| Boolean \| Was CNEC binding/active? \| Key target feature \| False (not binding) \|
	\| `shadow_price` \| Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) \| Lagrange multiplier for constraint \| Economic signal \| 0.0 if not binding \|
	\| `direction` \| String \| "import" or "export" from perspective \| Directionality \| Parse from flow sign \|
	\| `voltage_level` \| Integer (kV) \| Voltage level (e.g., 220, 380) \| CNEC classification \| 380 (most common) \|
	\| `timestamp` \| Datetime \| Delivery hour (CET/CEST) \| Time index \| REQUIRED \|
	\| `business_day` \| Date \| Market day (D for delivery D+1) \| File organization \| Derive from timestamp \|

	Storage Schema:
	```python
	cnec_schema = {
	'timestamp': pl.Datetime,
	'business_day': pl.Date,
	'cnec_id': pl.Utf8,
	'cnec_name': pl.Utf8,
	'tso': pl.Utf8,
	'fmax': pl.Float32,
	'ram_after': pl.Float32, # Primary field
	'ram_before': pl.Float32,
	'flow_fb': pl.Float32,
	'presolved': pl.Boolean,
	'shadow_price': pl.Float32,
	'contingency': pl.Utf8,
	'voltage_level': pl.Int16
	}
	```

	Collection Decision:
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: All fields above
	- ÃƒÂ¢Ã‚ÂÃ…â€™ SKIP: Internal TSO-specific IDs, validation flags, intermediate calculation steps

	### 2.5 PTDF Data Series

	File: `Core_DA_CC_PTDF_[date].xml` (published D-2 and updated D-1)

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `cnec_id` \| String \| Links to CNEC \| Join key \| REQUIRED \|
	\| `zone` \| String \| Bidding zone (e.g., "DE_LU", "FR") \| Sensitivity mapping \| REQUIRED \|
	\| `ptdf_value` \| Float \| Sensitivity coefficient (-1.0 to +1.0) \| Core feature \| 0.0 (no impact) \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `version` \| String \| "D-2" or "D-1" \| Data freshness \| Use latest available \|
	\| `is_hvdc` \| Boolean \| Is this a HVDC link sensitivity? \| Separate treatment \| False \|

	PTDF Matrix Structure:
	```
	DE_LU FR NL BE AT CZ PL ... (12 zones)
	CNEC_001 0.42 -0.18 0.15 0.08 0.05 0.12 -0.02 ...
	CNEC_002 -0.35 0.67 -0.12 0.25 0.03 -0.08 0.15 ...
	CNEC_003 0.28 -0.22 0.45 -0.15 0.18 0.08 -0.05 ...
	...
	CNEC_200 ... ... ... ... ... ... ... ...
	```

	Collection Decision:
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: `ptdf_value` for all zones ÃƒÆ’Ã¢â‚¬â€ all CNECs ÃƒÆ’Ã¢â‚¬â€ all hours
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: Only "D-1" version (most recent available for historical data)
	- ÃƒÂ¢Ã‚ÂÃ…â€™ SKIP: D-2 version, intermediate updates, HVDC-specific matrices

	Dimensionality:
	- Raw: ~200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones ÃƒÆ’Ã¢â‚¬â€ 17,520 hours = ~42 million values
	- Hybrid Storage: Top 50 CNECs with individual PTDFs, Tier-2 with border aggregates

	### 2.6 RAM Data Series

	File: `Core_DA_CC_CNEC_[date].xml` (within CNEC records)

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `cnec_id` \| String \| CNEC identifier \| Join key \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `minram` \| Float (MW) \| Minimum RAM (70% rule) \| Compliance checking \| 0.7 ÃƒÆ’Ã¢â‚¬â€ fmax \|
	\| `initial_ram` \| Float (MW) \| RAM before coordination \| Historical baseline \| Forward-fill \|
	\| `final_ram` \| Float (MW) \| RAM after validation/remedial actions \| Primary feature \| CRITICAL \|
	\| `ram_net_position` \| Float (MW) \| Net position at which RAM calculated \| Market condition \| 0.0 \|
	\| `validation_adjustment` \| Float (MW) \| TSO adjustments during validation \| Adjustment patterns \| 0.0 \|
	\| `remedial_actions_applied` \| Boolean \| Were remedial actions used? \| Constraint stress \| False \|

	Collection Decision:
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: `final_ram`, `minram`, `initial_ram`
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: `remedial_actions_applied` (binary indicator)
	- ÃƒÂ¢Ã‚ÂÃ…â€™ SKIP: Detailed remedial action descriptions (too granular)

	### 2.7 Shadow Price Data Series

	File: `Core_DA_Results_[date].xml` (post-market publication)

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `cnec_id` \| String \| CNEC identifier \| Join key \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `shadow_price` \| Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) \| Lagrange multiplier \| Economic signal \| 0.0 \|
	\| `shadow_price_before_minram` \| Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) \| Before 70% rule applied \| Pre-constraint value \| 0.0 \|
	\| `binding_duration` \| Integer (minutes) \| How long CNEC was binding \| Persistence \| 0 \|
	\| `peak_shadow_price_4h` \| Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) \| Max in 4-hour window \| Volatility signal \| shadow_price \|

	Collection Decision:
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: `shadow_price` (primary)
	- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ COLLECT: `binding_duration` (for persistence features)
	- ÃƒÂ¢Ã‚ÂÃ…â€™ SKIP: `shadow_price_before_minram` (less relevant for forecasting)

	---

	### 2.8 D2CF (Day-2 Congestion Forecast) - OPTIONAL

	Decision: SKIP for MVP

	Rationale:
	1. Temporal misalignment: D2CF provides TSO forecasts for D+2, but we're forecasting D+1 to D+14
	2. Forecast contamination: These are forecasts, not ground truth - introduces noise
	3. Better alternatives: ENTSO-E actual generation/load provides cleaner signals
	4. Scope reduction: Keeps MVP focused on high-signal features

	If collected later:
	- `load_forecast_d2cf`: TSO load forecast
	- `generation_forecast_d2cf`: TSO generation forecast
	- `net_position_forecast_d2cf`: Expected net position

	---

	### 2.9 ATC for Non-Core Borders Data Series

	File: `Core_ATC_External_Borders_[date].xml` or ENTSO-E Transparency Platform

	\| Field Name \| Data Type \| Description \| Use in Features \| Missing Value Strategy \|
	\|------------\|-----------\|-------------\|-----------------\|----------------------\|
	\| `border_id` \| String \| Non-Core border (e.g., "FR-UK", "FR-ES") \| Border tracking \| REQUIRED \|
	\| `timestamp` \| Datetime \| Delivery hour \| Time index \| REQUIRED \|
	\| `atc_forward` \| Float (MW) \| Available Transfer Capacity (forward direction) \| Loop flow driver \| 0.0 \|
	\| `atc_backward` \| Float (MW) \| Available Transfer Capacity (backward direction) \| Loop flow driver \| 0.0 \|

	Description:
	Available Transfer Capacity on borders connecting Core to non-Core regions. These affect loop flows through Core and impact Core CNEC constraints.

	Why This Matters:
	Large flows on external borders create loop flows through Core:
	- FR-UK flows affect FR-BE, FR-DE capacities via loop flows
	- Swiss transit (FR-CH-AT-DE) impacts Core CNECs significantly
	- Nordic flows through DE-DK affect German internal CNECs
	- Italian flows through AT-IT affect Austrian CNECs

	Key Non-Core Borders to Collect (~14 borders):
	1. FR-UK (IFA, ElecLink, IFA2 interconnectors) - Major loop flow driver
	2. FR-ES (Pyrenees corridor)
	3. FR-CH (Alpine corridor to Switzerland)
	4. DE-CH (Basel-Laufenburg)
	5. AT-CH (Alpine corridor)
	6. DE-DK (to Nordic region)
	7. PL-SE (Baltic to Nordic)
	8. PL-LT (Baltic connection)
	9. AT-SI (SEE gateway)
	10. AT-IT (Brenner Pass)
	11. HU-RO (SEE connection)
	12. HU-RS (Balkans)
	13. HR-RS (Balkans)
	14. SI-IT (SEE to Italy)

	Collection Method:
	1. JAO Publication Tool: Core_ATC_External_Borders
	2. ENTSO-E Transparency Platform: Fallback if JAO incomplete

	Collection Priority: DAY 1 AFTERNOON (after core data)

	Storage Schema:
	```python
	atc_external_schema = {
	'timestamp': pl.Datetime,
	'border_id': pl.Utf8,
	'atc_forward': pl.Float32,
	'atc_backward': pl.Float32
	}
	```

	Expected File Size: ~120-160 MB for 24 months Ãƒâ€” 14 borders

	ENTSO-E API Collection (if needed):
	```python
	from entsoe import EntsoePandasClient

	non_core_borders = [
	('FR', 'UK'), ('FR', 'ES'), ('FR', 'CH'),
	('DE', 'CH'), ('DE', 'DK'),
	('AT', 'CH'), ('AT', 'IT'),
	('PL', 'SE'), ('PL', 'LT')
	]

	for zone1, zone2 in non_core_borders:
	atc_data = client.query_offered_capacity(
	country_code_from=zone1,
	country_code_to=zone2,
	start=start_date,
	end=end_date
	)
	```

	---

	## 3. Data Cleaning Procedures

	### 3.1 Missing Value Handling

	Priority Order:
	1. Forward-fill (for continuous series with gradual changes)
	2. Zero-fill (for event-based fields where absence means "not active")
	3. Interpolation (for weather-related smooth series)
	4. Drop (only if >30% missing in critical fields)

	Field-Specific Strategies:

	\| Field \| Strategy \| Justification \|
	\|-------\|----------\|---------------\|
	\| `ram_after` \| Forward-fill (max 2 hours), then linear interpolation \| Capacity changes gradually \|
	\| `presolved` \| Zero-fill ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ False \| Missing = not binding \|
	\| `shadow_price` \| Zero-fill \| Missing = no congestion cost \|
	\| `ptdf_value` \| Zero-fill \| Missing = no sensitivity \|
	\| `fmax` \| NO FILL - MUST HAVE \| Critical for normalization \|
	\| `flow_fb` \| Zero-fill \| Missing = no flow \|
	\| `cnec_name` \| Forward-fill \| Descriptive only \|

	Implementation:
	```python
	def clean_jao_data(df: pl.DataFrame) -> pl.DataFrame:
	"""Clean JAO data with field-specific strategies"""

	# 1. Handle critical fields (fail if missing)
	critical_fields = ['timestamp', 'cnec_id', 'fmax']
	for field in critical_fields:
	if df[field].null_count() > 0:
	raise ValueError(f"Critical field {field} has missing values")

	# 2. Forward-fill continuous series
	df = df.with_columns([
	pl.col('ram_after').fill_null(strategy='forward').fill_null(strategy='backward'),
	pl.col('ram_before').fill_null(strategy='forward').fill_null(strategy='backward'),
	pl.col('cnec_name').fill_null(strategy='forward')
	])

	# 3. Zero-fill event indicators
	df = df.with_columns([
	pl.col('presolved').fill_null(False),
	pl.col('shadow_price').fill_null(0.0),
	pl.col('flow_fb').fill_null(0.0),
	pl.col('remedial_actions_applied').fill_null(False)
	])

	# 4. Linear interpolation for remaining RAM gaps (max 2 hours)
	df = df.with_columns([
	pl.col('ram_after').interpolate(method='linear')
	])

	# 5. Fill contingency with "basecase" if missing
	df = df.with_columns([
	pl.col('contingency').fill_null('basecase')
	])

	return df
	```

	### 3.2 Outlier Detection & Treatment

	RAM Outliers:
	```python
	def detect_ram_outliers(df: pl.DataFrame) -> pl.DataFrame:
	"""Flag and clip RAM outliers"""

	# 1. Physical constraints
	df = df.with_columns([
	# RAM cannot exceed Fmax
	pl.when(pl.col('ram_after') > pl.col('fmax'))
	.then(pl.col('fmax'))
	.otherwise(pl.col('ram_after'))
	.alias('ram_after'),

	# RAM cannot be negative (clip to 0)
	pl.when(pl.col('ram_after') < 0)
	.then(0.0)
	.otherwise(pl.col('ram_after'))
	.alias('ram_after')
	])

	# 2. Statistical outliers (sudden unrealistic jumps)
	# Flag if RAM drops >80% in 1 hour (likely data error)
	df = df.with_columns([
	(pl.col('ram_after').diff() / pl.col('ram_after').shift(1) < -0.8)
	.fill_null(False)
	.alias('outlier_sudden_drop')
	])

	# 3. Replace flagged outliers with interpolation
	df = df.with_columns([
	pl.when(pl.col('outlier_sudden_drop'))
	.then(None)
	.otherwise(pl.col('ram_after'))
	.alias('ram_after')
	]).with_columns([
	pl.col('ram_after').interpolate()
	])

	return df.drop('outlier_sudden_drop')
	```

	PTDF Outliers:
	```python
	def clean_ptdf_outliers(ptdf_matrix: np.ndarray) -> np.ndarray:
	"""Clean PTDF values to physical bounds"""

	# PTDFs theoretically in [-1, 1] but can be slightly outside due to rounding
	# Clip to [-1.5, 1.5] to catch errors while allowing small overruns
	ptdf_matrix = np.clip(ptdf_matrix, -1.5, 1.5)

	# Replace any NaN/Inf with 0 (no sensitivity)
	ptdf_matrix = np.nan_to_num(ptdf_matrix, nan=0.0, posinf=0.0, neginf=0.0)

	return ptdf_matrix
	```

	Shadow Price Outliers:
	```python
	def clean_shadow_prices(df: pl.DataFrame) -> pl.DataFrame:
	"""Handle extreme shadow prices"""

	# Shadow prices above ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬500/MW are rare but possible during extreme scarcity
	# Cap at ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬1000/MW to prevent contamination (99.9th percentile historically)
	df = df.with_columns([
	pl.when(pl.col('shadow_price') > 1000)
	.then(1000)
	.otherwise(pl.col('shadow_price'))
	.alias('shadow_price')
	])

	return df
	```

	### 3.3 Timestamp Alignment

	Challenge: JAO publishes data for "delivery day D+1" on "business day D"

	Solution:
	```python
	def align_jao_timestamps(df: pl.DataFrame) -> pl.DataFrame:
	"""Convert business day to delivery hour timestamps"""

	# JAO format: business_day='2025-10-28', delivery_hour=1-24
	# Convert to: timestamp='2025-10-29 00:00' to '2025-10-29 23:00' (UTC)

	df = df.with_columns([
	# Delivery timestamp = business_day + 1 day + (delivery_hour - 1)
	(pl.col('business_day') + pl.duration(days=1) +
	pl.duration(hours=pl.col('delivery_hour') - 1))
	.alias('timestamp')
	])

	# Convert from CET/CEST to UTC for consistency
	df = df.with_columns([
	pl.col('timestamp').dt.convert_time_zone('UTC')
	])

	return df
	```

	### 3.4 Duplicate Handling

	Sources of Duplicates:
	1. D-2 and D-1 PTDF updates (keep D-1 only)
	2. Corrected publications (keep latest by publication time)
	3. Intraday updates (out of scope for day-ahead forecasting)

	Deduplication Strategy:
	```python
	def deduplicate_jao_data(df: pl.DataFrame) -> pl.DataFrame:
	"""Remove duplicate records, keeping most recent version"""

	# For PTDF data: keep D-1 version
	if 'version' in df.columns:
	df = df.filter(pl.col('version') == 'D-1')

	# For all data: keep latest publication time per (timestamp, cnec_id)
	df = (df
	.sort('publication_time', descending=True)
	.unique(subset=['timestamp', 'cnec_id'], keep='first'))

	return df
	```

	---

	## 4. CNEC Treatment

	### 4.1 CNEC Selection Strategy

	Goal: Reduce from ~2,000 CNECs published daily by TSOs to Top 200 most impactful CNECs

	Two-Tier Approach:
	- Tier 1 (Top 50 CNECs): Full feature detail (5 features each = 250 features)
	- Tier 2 (Next 150 CNECs): Selective features (2 features each = 300 features)
	- Total: 200 CNECs tracked

	Selection Criteria:
	1. Binding Frequency (40% weight): How often was CNEC active?
	2. Economic Impact (30% weight): Average shadow price when binding
	3. RAM Utilization (20% weight): How close to limit (low RAM)?
	4. Geographic Coverage (10% weight): Ensure all borders represented

	Implementation:
	```python
	def select_top_cnecs(df: pl.DataFrame, n_cnecs: int = 200) -> dict[str, list[str]]:
	"""
	Select top 200 most impactful CNECs from ~2,000 published daily
	Returns: {'tier1': list of 50 CNECs, 'tier2': list of 150 CNECs}
	"""

	cnec_impact = df.groupby('cnec_id').agg([
	# Binding frequency (times active / total hours)
	(pl.col('presolved').sum() / pl.col('presolved').count())
	.alias('binding_freq'),

	# Average shadow price when binding
	pl.col('shadow_price').filter(pl.col('presolved')).mean()
	.alias('avg_shadow_price'),

	# RAM utilization (% of hours below 30% of Fmax)
	((pl.col('ram_after') < 0.3 * pl.col('fmax')).sum() /
	pl.col('ram_after').count())
	.alias('low_ram_freq'),

	# Days appeared (consistency metric)
	pl.col('cnec_id').count().alias('days_appeared'),

	# Geographic zone (parse from CNEC ID)
	pl.col('tso').first().alias('tso')
	])

	# Calculate composite impact score
	cnec_impact = cnec_impact.with_columns([
	(0.40 * pl.col('binding_freq') +
	0.30 * (pl.col('avg_shadow_price') / 100) + # Normalize to [0,1]
	0.20 * pl.col('low_ram_freq') +
	0.10 * (pl.col('days_appeared') / 365)) # Consistency bonus
	.alias('impact_score')
	])

	# Select top N, ensuring geographic diversity
	top_cnecs = []

	# First, get top 10 per major bidding zone border (ensures coverage)
	for border in ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ', 'DE_AT',
	'FR_ES', 'AT_IT', 'DE_PL', 'BE_NL']:
	border_cnecs = (cnec_impact
	.filter(pl.col('cnec_id').str.contains(border))
	.sort('impact_score', descending=True)
	.head(10))
	top_cnecs.extend(border_cnecs['cnec_id'].to_list())

	# Remove duplicates (some CNECs may match multiple borders)
	top_cnecs = list(set(top_cnecs))

	# Then add remaining highest-impact CNECs to reach n_cnecs
	remaining = (cnec_impact
	.filter(~pl.col('cnec_id').is_in(top_cnecs))
	.sort('impact_score', descending=True)
	.head(n_cnecs - len(top_cnecs)))
	top_cnecs.extend(remaining['cnec_id'].to_list())

	# Split into two tiers
	tier1_cnecs = top_cnecs[:50] # Top 50 get full feature detail
	tier2_cnecs = top_cnecs[50:200] # Next 150 get selective features

	return {
	'tier1': tier1_cnecs,
	'tier2': tier2_cnecs,
	'all': top_cnecs[:200]
	}
	```

	Why 200 CNECs?
	- From ~2,000 published daily, most are rarely binding or have minimal economic impact
	- Top 200 captures >95% of all binding events and congestion costs
	- Manageable feature space while preserving critical network information
	- Two-tier approach balances detail vs. dimensionality

	### 4.2 CNEC Masking Strategy

	Problem: Not all CNECs are published every day (only those considered relevant by TSOs)

	Solution: Create a "Master CNEC Set" with masking

	```python
	def create_cnec_master_set(df: pl.DataFrame, top_cnecs: list[str]) -> pl.DataFrame:
	"""Create complete CNEC ÃƒÆ’Ã¢â‚¬â€ timestamp matrix with masking"""

	# 1. Create complete timestamp ÃƒÆ’Ã¢â‚¬â€ CNEC grid
	all_timestamps = df.select('timestamp').unique().sort('timestamp')
	cnec_template = pl.DataFrame({'cnec_id': top_cnecs})

	# Cartesian product: all timestamps ÃƒÆ’Ã¢â‚¬â€ all CNECs
	complete_grid = all_timestamps.join(cnec_template, how='cross')

	# 2. Left join with actual data
	complete_data = complete_grid.join(
	df.select(['timestamp', 'cnec_id', 'ram_after', 'presolved',
	'shadow_price', 'fmax']),
	on=['timestamp', 'cnec_id'],
	how='left'
	)

	# 3. Add masking indicator
	complete_data = complete_data.with_columns([
	# Mask = 1 if CNEC was published (data available)
	# Mask = 0 if CNEC not published (implicitly unconstrained)
	(~pl.col('ram_after').is_null()).cast(pl.Int8).alias('cnec_mask')
	])

	# 4. Fill missing values according to strategy
	complete_data = complete_data.with_columns([
	# If not published, assume unconstrained:
	# - RAM = Fmax (maximum margin)
	# - presolved = False (not binding)
	# - shadow_price = 0 (no congestion)
	pl.col('ram_after').fill_null(pl.col('fmax')),
	pl.col('presolved').fill_null(False),
	pl.col('shadow_price').fill_null(0.0)
	])

	return complete_data
	```

	Resulting Schema:
	```
	timestamp cnec_id ram_after presolved shadow_price cnec_mask fmax
	2024-10-01 00:00 DE_CZ_TIE_001 450.2 True 12.5 1 800
	2024-10-01 00:00 DE_FR_LINE_005 800.0 False 0.0 0 800 # Not published
	2024-10-01 01:00 DE_CZ_TIE_001 432.1 True 15.2 1 800
	...
	```

	### 4.3 CNEC Feature Derivation

	From master CNEC set, derive:

	```python
	def engineer_cnec_features(df: pl.DataFrame) -> pl.DataFrame:
	"""Create CNEC-based features for forecasting"""

	df = df.with_columns([
	# 1. Margin ratio (normalized RAM)
	(pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'),

	# 2. Binding frequency (7-day rolling)
	pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=168)
	.over('cnec_id').alias('binding_freq_7d'),

	# 3. Binding frequency (30-day rolling)
	pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=720)
	.over('cnec_id').alias('binding_freq_30d'),

	# 4. MinRAM compliance
	(pl.col('ram_after') < 0.7 * pl.col('fmax'))
	.cast(pl.Int8).alias('below_minram'),

	# 5. RAM volatility (7-day rolling std)
	pl.col('ram_after').rolling_std(window_size=168)
	.over('cnec_id').alias('ram_volatility_7d'),

	# 6. Shadow price volatility
	pl.col('shadow_price').rolling_std(window_size=168)
	.over('cnec_id').alias('shadow_price_volatility_7d'),

	# 7. CNEC criticality score
	(1 - pl.col('margin_ratio')).alias('criticality'),

	# 8. Sudden RAM drops (binary flag)
	((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax'))
	.cast(pl.Int8).alias('sudden_ram_drop')
	])

	return df
	```

	---

	## 5. PTDF Treatment - Hybrid Approach

	### 5.1 Strategy: Preserve Causality for Critical CNECs

	Challenge: PTDF matrix contains critical network physics but is high-dimensional
	- 200 CNECs Ãƒâ€” 12 zones Ãƒâ€” 17,520 hours = ~42 million values
	- PTDFs encode how zone injections affect each CNEC (network sensitivity)
	- Critical requirement: Preserve CNEC-specific causality chain:
	- `outage_active_cnec_X` Ã¢â€ â€™ network sensitivity (PTDF) Ã¢â€ â€™ `presolved_cnec_X`

	Solution: Two-Tier Hybrid Approach

	#### Tier 1 (Top 50 CNECs): Individual PTDF Features
	Preserve all 12 zone sensitivities for each of the 50 most critical CNECs
	- Result: 50 CNECs Ãƒâ€” 12 zones = 600 individual PTDF features
	- Rationale: These are most impactful CNECs - model needs full network physics
	- Enables learning: "When `outage_active_cnec_42` = 1 AND high DE_LU injection (via `ptdf_cnec_42_DE_LU` sensitivity), THEN `presolved_cnec_42` likely = 1"

	#### Tier 2 (Next 150 CNECs): Border-Level PTDF Aggregates
	Aggregate PTDF sensitivities by border-zone pairs
	- Result: 10 borders Ãƒâ€” 12 zones = 120 aggregate PTDF features
	- Rationale: Captures regional network patterns without full individual resolution
	- Preserves interpretability: "How sensitive are DE-CZ tier-2 CNECs to German generation?"
	- Still allows causal learning: `outage_active_cnec_tier2` Ã¢â€ â€™ `avg_ptdf_border` Ã¢â€ â€™ `presolved_cnec_tier2`

	Total PTDF Features: 720 (600 individual + 120 aggregate)

	Why This Approach?
	1. Ã¢Å“â€¦ Preserves CNEC-specific causality for top 50 - most critical constraints
	2. Ã¢Å“â€¦ Avoids PCA mixing - no loss of CNEC identification
	3. Ã¢Å“â€¦ Interpretable - can explain which zones affect which borders
	4. Ã¢Å“â€¦ Dimensionality reduction - 1,800 tier-2 values Ã¢â€ â€™ 120 features (15:1)
	5. Ã¢Å“â€¦ Maintains geographic structure - DE-CZ separate from FR-BE

	### 5.2 Implementation: Top 50 Individual PTDFs

	```python
	def extract_top50_ptdfs(ptdf_matrix: np.ndarray,
	top_50_cnec_ids: list[str],
	all_cnec_ids: list[str],
	zones: list[str]) -> pl.DataFrame:
	"""
	Extract individual PTDF values for top 50 CNECs

	Args:
	ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones)
	top_50_cnec_ids: List of 50 most critical CNEC identifiers
	all_cnec_ids: Full list of CNEC IDs matching matrix dimension
	zones: List of 12 Core zone identifiers ['DE_LU', 'FR', 'BE', 'NL', ...]

	Returns:
	DataFrame with 600 PTDF features (50 CNECs Ãƒâ€” 12 zones)
	"""

	# Find indices of top 50 CNECs in the full matrix
	top_50_indices = [all_cnec_ids.index(cnec) for cnec in top_50_cnec_ids]

	# Extract PTDF values for top 50 CNECs
	# Shape: (n_timestamps, 50, 12)
	top_50_ptdfs = ptdf_matrix[:, top_50_indices, :]

	# Reshape to flat features: (n_timestamps, 600)
	n_timestamps = top_50_ptdfs.shape[0]
	features_dict = {}

	for cnec_idx, cnec_id in enumerate(top_50_cnec_ids):
	for zone_idx, zone in enumerate(zones):
	feature_name = f'ptdf_cnec_{cnec_id}_{zone}'
	features_dict[feature_name] = top_50_ptdfs[:, cnec_idx, zone_idx]

	print(f"Ã¢Å“â€œ Extracted {len(features_dict)} individual PTDF features for top 50 CNECs")
	return pl.DataFrame(features_dict)
	```

	### 5.3 Implementation: Tier-2 Border Aggregates

	```python
	def aggregate_tier2_ptdfs_by_border(ptdf_matrix: np.ndarray,
	tier2_cnec_ids: list[str],
	all_cnec_ids: list[str],
	zones: list[str]) -> pl.DataFrame:
	"""
	Aggregate PTDF sensitivities for tier-2 CNECs by border-zone pairs

	Args:
	ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones)
	tier2_cnec_ids: List of 150 tier-2 CNEC identifiers
	all_cnec_ids: Full list of CNEC IDs
	zones: List of 12 Core zone identifiers

	Returns:
	DataFrame with 120 features (10 borders Ãƒâ€” 12 zones)
	"""

	# Define major borders
	major_borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ',
	'DE_AT', 'BE_NL', 'AT_IT', 'DE_PL', 'CZ_PL']

	# Find indices of tier-2 CNECs
	tier2_indices = [all_cnec_ids.index(cnec) for cnec in tier2_cnec_ids]

	# Extract tier-2 PTDF subset
	tier2_ptdfs = ptdf_matrix[:, tier2_indices, :] # (n_timestamps, 150, 12)

	features_dict = {}

	for border in major_borders:
	# Find tier-2 CNECs belonging to this border
	border_cnec_indices = [
	i for i, cnec_id in enumerate(tier2_cnec_ids)
	if border.replace('_', '').lower() in cnec_id.lower()
	]

	if len(border_cnec_indices) == 0:
	# No tier-2 CNECs for this border - use zeros
	for zone in zones:
	features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = np.zeros(tier2_ptdfs.shape[0])
	continue

	# Average PTDF across all tier-2 CNECs on this border
	for zone_idx, zone in enumerate(zones):
	border_zone_ptdfs = tier2_ptdfs[:, border_cnec_indices, zone_idx]
	avg_ptdf = np.mean(border_zone_ptdfs, axis=1)
	features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = avg_ptdf

	print(f"Ã¢Å“â€œ Aggregated {len(features_dict)} border-zone PTDF features for tier-2")
	return pl.DataFrame(features_dict)
	```

	### 5.4 Complete PTDF Pipeline

	```python
	def prepare_ptdf_features(ptdf_matrix: np.ndarray,
	top_50_cnecs: list[str],
	tier2_cnecs: list[str],
	all_cnec_ids: list[str],
	zones: list[str]) -> pl.DataFrame:
	"""
	Complete PTDF feature engineering pipeline

	Returns:
	DataFrame with 720 PTDF features total:
	- 600 individual features (top 50 CNECs Ãƒâ€” 12 zones)
	- 120 aggregate features (10 borders Ãƒâ€” 12 zones for tier-2)
	"""

	# 1. Extract individual PTDFs for top 50
	top50_features = extract_top50_ptdfs(
	ptdf_matrix, top_50_cnecs, all_cnec_ids, zones
	)

	# 2. Aggregate tier-2 PTDFs by border
	tier2_features = aggregate_tier2_ptdfs_by_border(
	ptdf_matrix, tier2_cnecs, all_cnec_ids, zones
	)

	# 3. Combine
	all_ptdf_features = pl.concat([top50_features, tier2_features], how='horizontal')

	print(f"\nÃ¢Å“â€œ PTDF Features Generated:")
	print(f" Top 50 individual: {top50_features.shape[1]} features")
	print(f" Tier-2 aggregated: {tier2_features.shape[1]} features")
	print(f" Total PTDF features: {all_ptdf_features.shape[1]}")

	return all_ptdf_features
	```

	### 5.5 PTDF Feature Validation

	```python
	def validate_ptdf_features(ptdf_features: pl.DataFrame) -> None:
	"""Validate PTDF feature quality"""

	# 1. Check for valid range (PTDFs should be in [-1, 1])
	outliers = []
	for col in ptdf_features.columns:
	if col.startswith('ptdf_') or col.startswith('avg_ptdf_'):
	values = ptdf_features[col]
	min_val, max_val = values.min(), values.max()

	if min_val < -1.5 or max_val > 1.5:
	outliers.append(f"{col}: [{min_val:.3f}, {max_val:.3f}]")

	if outliers:
	print(f"Ã¢Å¡Â WARNING: {len(outliers)} PTDF features outside expected [-1, 1] range")
	for out in outliers[:5]: # Show first 5
	print(f" {out}")

	# 2. Check for constant features (no variance)
	constant_features = []
	for col in ptdf_features.columns:
	if col.startswith('ptdf_') or col.startswith('avg_ptdf_'):
	if ptdf_features[col].std() < 1e-6:
	constant_features.append(col)

	if constant_features:
	print(f"Ã¢Å¡Â WARNING: {len(constant_features)} PTDF features have near-zero variance")

	# 3. Summary statistics
	print(f"\nÃ¢Å“â€œ PTDF Feature Validation Complete:")
	print(f" Features in valid range: {len(ptdf_features.columns) - len(outliers)}")
	print(f" Features with variance: {len(ptdf_features.columns) - len(constant_features)}")
	```


	---

	## 6. RAM Treatment

	### 6.1 RAM Normalization

	Challenge: Absolute RAM values vary widely across CNECs (50 MW to 2000 MW)

	Solution: Normalize to margin ratio

	```python
	def normalize_ram_values(df: pl.DataFrame) -> pl.DataFrame:
	"""Normalize RAM to comparable scale"""

	df = df.with_columns([
	# 1. Margin ratio: RAM / Fmax
	(pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'),

	# 2. MinRAM compliance ratio: RAM / (0.7 ÃƒÆ’Ã¢â‚¬â€ Fmax)
	(pl.col('ram_after') / (0.7 * pl.col('fmax'))).alias('minram_compliance'),

	# 3. Percentile within CNEC's historical distribution
	pl.col('ram_after').rank(method='average').over('cnec_id')
	.alias('ram_percentile')
	])

	# Cap margin_ratio at 1.0 (cannot exceed Fmax)
	df = df.with_columns([
	pl.when(pl.col('margin_ratio') > 1.0)
	.then(1.0)
	.otherwise(pl.col('margin_ratio'))
	.alias('margin_ratio')
	])

	return df
	```

	### 6.2 RAM Time Series Features

	```python
	def engineer_ram_features(df: pl.DataFrame) -> pl.DataFrame:
	"""Create rolling window features from RAM values"""

	df = df.with_columns([
	# 1. Moving averages
	pl.col('ram_after').rolling_mean(window_size=24).alias('ram_ma_24h'),
	pl.col('ram_after').rolling_mean(window_size=168).alias('ram_ma_7d'),
	pl.col('ram_after').rolling_mean(window_size=720).alias('ram_ma_30d'),

	# 2. Volatility measures
	pl.col('ram_after').rolling_std(window_size=168).alias('ram_std_7d'),
	pl.col('ram_after').rolling_std(window_size=720).alias('ram_std_30d'),

	# 3. Percentiles (vs 90-day history)
	pl.col('ram_after')
	.rolling_quantile(quantile=0.1, window_size=2160)
	.alias('ram_p10_90d'),
	pl.col('ram_after')
	.rolling_quantile(quantile=0.5, window_size=2160)
	.alias('ram_median_90d'),

	# 4. Rate of change
	(pl.col('ram_after').diff(1) / pl.col('ram_after').shift(1))
	.alias('ram_pct_change_1h'),
	(pl.col('ram_after').diff(24) / pl.col('ram_after').shift(24))
	.alias('ram_pct_change_24h'),

	# 5. Binary indicators
	(pl.col('ram_after') < 0.3 * pl.col('fmax'))
	.cast(pl.Int8).alias('low_ram_flag'),

	((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax'))
	.cast(pl.Int8).alias('sudden_drop_flag'),

	# 6. MinRAM violation counter (7-day window)
	(pl.col('ram_after') < 0.7 * pl.col('fmax'))
	.cast(pl.Int8)
	.rolling_sum(window_size=168)
	.alias('minram_violations_7d')
	])

	return df
	```

	### 6.3 Cross-Border RAM Aggregation

	Create system-level RAM features:

	```python
	def aggregate_ram_system_wide(df: pl.DataFrame) -> pl.DataFrame:
	"""Aggregate RAM across all CNECs to system-level features"""

	system_features = df.groupby('timestamp').agg([
	# 1. Average margin across all CNECs
	pl.col('margin_ratio').mean().alias('system_avg_margin'),

	# 2. Minimum margin (most constrained CNEC)
	pl.col('margin_ratio').min().alias('system_min_margin'),

	# 3. Count of binding CNECs
	pl.col('presolved').sum().alias('n_binding_cnecs'),

	# 4. Count of CNECs below minRAM
	(pl.col('ram_after') < 0.7 * pl.col('fmax')).sum()
	.alias('n_cnecs_below_minram'),

	# 5. Standard deviation of margins (uniformity)
	pl.col('margin_ratio').std().alias('margin_std'),

	# 6. Total economic cost (sum of shadow prices)
	pl.col('shadow_price').sum().alias('total_congestion_cost'),

	# 7. Max shadow price
	pl.col('shadow_price').max().alias('max_shadow_price')
	])

	return system_features
	```

	---

	## 7. Shadow Prices Treatment

	### 7.1 Shadow Price Features

	```python
	def engineer_shadow_price_features(df: pl.DataFrame) -> pl.DataFrame:
	"""Create features from shadow prices (economic signals)"""

	df = df.with_columns([
	# 1. Rolling statistics
	pl.col('shadow_price').rolling_mean(window_size=24)
	.alias('shadow_price_ma_24h'),

	pl.col('shadow_price').rolling_max(window_size=24)
	.alias('shadow_price_max_24h'),

	pl.col('shadow_price').rolling_std(window_size=168)
	.alias('shadow_price_volatility_7d'),

	# 2. Binary indicators
	(pl.col('shadow_price') > 0).cast(pl.Int8)
	.alias('has_congestion_cost'),

	(pl.col('shadow_price') > 50).cast(pl.Int8)
	.alias('high_congestion_flag'), # ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬50/MW threshold

	# 3. Economic stress indicator (count of expensive hours in 7d)
	(pl.col('shadow_price') > 50).cast(pl.Int8)
	.rolling_sum(window_size=168)
	.alias('expensive_hours_7d'),

	# 4. Shadow price persistence (how long has it been >0?)
	# Create groups of consecutive non-zero shadow prices
	((pl.col('shadow_price') > 0) != (pl.col('shadow_price').shift(1) > 0))
	.cum_sum()
	.alias('shadow_price_regime'),

	])

	# 5. Duration in current regime
	df = df.with_columns([
	pl.col('shadow_price_regime')
	.rank(method='dense').over('shadow_price_regime')
	.alias('hours_in_regime')
	])

	return df
	```

	---

	## 8. Feature Engineering Pipeline

	### 8.1 Complete Feature Matrix Construction

	Final feature set: ~1,060 features (Option A Architecture)
	- Historical Context: 730 features (what happened in past 21 days)
	- Future Covariates: 280 features (what we know will happen in next 14 days)
	- System-Level: 50 features (real-time aggregates)

	Architecture Rationale:
	- Top 50 CNECs: Full detail (5 metrics each = 250 features)
	- Most impactful CNECs get complete representation
	- Tier-2 150 CNECs: Selective detail (300 features)
	- Binary indicators: `presolved` (binding) + `outage_active` (150 + 150 = 300 features)
	- Aggregated metrics: RAM, shadow prices by border (30 features)
	- Preserves high-signal discrete events while reducing continuous redundancy
	- Supporting Features: 380 features
	- PTDF patterns, border capacities, weather, temporal, interactions

	```python
	class JAOFeatureEngineer:
	"""
	Complete JAO feature engineering pipeline - Option A Architecture

	Architecture:
	- Top 50 CNECs: Full detail (5 metrics ÃƒÆ’Ã¢â‚¬â€ 50 = 250 features)
	- Tier-2 150 CNECs: Selective (presolved + outage_active = 300 features)
	- Tier-2 aggregated: Border-level RAM/shadow prices (30 features)
	- Supporting: PTDF, borders, system, temporal, weather (150 features)

	Total: ~730 historical features + ~280 future covariates

	Inputs: Raw JAO data (CNECs, PTDFs, RAMs, shadow prices) + ENTSO-E outages
	Outputs: Engineered feature matrix (512h ÃƒÆ’Ã¢â‚¬â€ 730 features) +
	Future covariates (336h ÃƒÆ’Ã¢â‚¬â€ 280 features)
	"""

	def __init__(self, top_n_cnecs: int = 50, tier2_n_cnecs: int = 150,
	ptdf_components: int = 10):
	self.top_n_cnecs = top_n_cnecs
	self.tier2_n_cnecs = tier2_n_cnecs
	self.total_cnecs = top_n_cnecs + tier2_n_cnecs # 200 total
	self.ptdf_components = ptdf_components
	self.pca_model = None
	self.scaler = None
	self.top_cnecs = None
	self.tier2_cnecs = None

	def fit(self, historical_data: dict):
	"""
	Fit on 12-month historical data to establish baselines

	Args:
	historical_data: {
	'cnecs': pl.DataFrame,
	'ptdfs': np.ndarray,
	'rams': pl.DataFrame,
	'shadow_prices': pl.DataFrame,
	'outages': pl.DataFrame (from ENTSO-E via EIC matching)
	}
	"""

	print("Fitting JAO feature engineer on historical data...")

	# 1. Select top 200 CNECs (50 + 150)
	print(f" - Selecting top {self.total_cnecs} CNECs...")
	all_selected = select_top_cnecs(
	historical_data['cnecs'],
	n_cnecs=self.total_cnecs
	)
	self.top_cnecs = all_selected[:self.top_n_cnecs]
	self.tier2_cnecs = all_selected[self.top_n_cnecs:]

	print(f" Top 50: {self.top_cnecs[:3]}... (full detail)")
	print(f" Tier-2 150: {self.tier2_cnecs[:3]}... (selective detail)")

	# 2. Engineer PTDF features (hybrid approach)
	print(" - Engineering PTDF features...")
	ptdf_compressed, self.pca_model, self.scaler = reduce_ptdf_dimensions(
	historical_data['ptdfs'],
	n_components=self.ptdf_components
	)

	# 3. Calculate historical baselines (for percentiles, etc.)
	print(" - Computing historical baselines...")
	self.ram_baseline = historical_data['rams'].groupby('cnec_id').agg([
	pl.col('ram_after').mean().alias('ram_mean'),
	pl.col('ram_after').std().alias('ram_std'),
	pl.col('ram_after').quantile(0.1).alias('ram_p10'),
	pl.col('ram_after').quantile(0.9).alias('ram_p90')
	])

	print("ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Feature engineer fitted")

	def transform(self,
	data: dict,
	start_time: str,
	end_time: str) -> tuple[np.ndarray, np.ndarray]:
	"""
	Transform JAO data to feature matrices

	Args:
	data: Same structure as fit()
	start_time: Start of context window (e.g., 21 days before prediction)
	end_time: End of context window (prediction time)

	Returns:
	historical_features: (n_hours, 730) array of historical context
	future_features: (n_hours_future, 280) array of future covariates
	"""

	n_hours = len(pd.date_range(start_time, end_time, freq='H'))
	historical_features = np.zeros((n_hours, 730))

	print("Engineering historical features...")

	# Filter to time window
	cnec_data = (data['cnecs']
	.filter(pl.col('timestamp').is_between(start_time, end_time)))

	outage_data = (data['outages']
	.filter(pl.col('timestamp').is_between(start_time, end_time)))

	# === TOP 50 CNECs - FULL DETAIL (250 features) ===
	print(" - Top 50 CNECs (full detail)...")
	top50_data = cnec_data.filter(pl.col('cnec_id').is_in(self.top_cnecs))
	top50_outages = outage_data.filter(pl.col('cnec_id').is_in(self.top_cnecs))

	col_idx = 0
	for cnec_id in self.top_cnecs:
	cnec_series = top50_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
	cnec_outage = top50_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')

	# 5 metrics per CNEC
	historical_features[:, col_idx] = cnec_series['ram_after'].to_numpy()
	historical_features[:, col_idx + 1] = (cnec_series['ram_after'] / cnec_series['fmax']).to_numpy()
	historical_features[:, col_idx + 2] = cnec_series['presolved'].cast(pl.Int8).to_numpy()
	historical_features[:, col_idx + 3] = cnec_series['shadow_price'].to_numpy()
	historical_features[:, col_idx + 4] = cnec_outage['outage_active'].to_numpy()

	col_idx += 5

	# === TIER-2 150 CNECs - SELECTIVE DETAIL (300 features) ===
	print(" - Tier-2 150 CNECs (presolved + outage)...")
	tier2_data = cnec_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs))
	tier2_outages = outage_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs))

	# Presolved flags (150 features)
	for cnec_id in self.tier2_cnecs:
	cnec_series = tier2_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
	historical_features[:, col_idx] = cnec_series['presolved'].cast(pl.Int8).to_numpy()
	col_idx += 1

	# Outage flags (150 features)
	for cnec_id in self.tier2_cnecs:
	cnec_outage = tier2_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
	historical_features[:, col_idx] = cnec_outage['outage_active'].to_numpy()
	col_idx += 1

	# === TIER-2 AGGREGATED METRICS (30 features) ===
	print(" - Tier-2 aggregated metrics by border...")
	borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'DE_BE', 'FR_BE', 'AT_CZ',
	'DE_AT', 'CZ_SK', 'AT_HU', 'PL_CZ']

	for border in borders:
	# Get tier-2 CNECs for this border
	border_cnecs = [c for c in self.tier2_cnecs if border.replace('_', '') in c]

	if not border_cnecs:
	historical_features[:, col_idx:col_idx + 3] = 0
	col_idx += 3
	continue

	border_data = tier2_data.filter(pl.col('cnec_id').is_in(border_cnecs))

	# Aggregate per timestamp
	border_agg = border_data.groupby('timestamp').agg([
	pl.col('ram_after').mean().alias('avg_ram'),
	(pl.col('ram_after') / pl.col('fmax')).mean().alias('avg_margin'),
	pl.col('shadow_price').sum().alias('total_shadow_price')
	]).sort('timestamp')

	historical_features[:, col_idx] = border_agg['avg_ram'].to_numpy()
	historical_features[:, col_idx + 1] = border_agg['avg_margin'].to_numpy()
	historical_features[:, col_idx + 2] = border_agg['total_shadow_price'].to_numpy()
	col_idx += 3

	# === PTDF PATTERNS (10 features) ===
	print(" - PTDF compression...")
	ptdf_subset = data['ptdfs'][start_time:end_time, :, :]
	ptdf_2d = ptdf_subset.reshape(len(ptdf_subset), -1)
	ptdf_scaled = self.scaler.transform(ptdf_2d)
	ptdf_features = self.pca_model.transform(ptdf_scaled) # (n_hours, 10)
	historical_features[:, col_idx:col_idx + 10] = ptdf_features
	col_idx += 10

	# === BORDER CAPACITY HISTORICAL (20 features) ===
	print(" - Border capacities...")
	capacity_historical = data['entsoe']['crossborder_flows'][start_time:end_time]
	historical_features[:, col_idx:col_idx + 20] = capacity_historical
	col_idx += 20

	# === SYSTEM-LEVEL AGGREGATES (20 features) ===
	print(" - System-level aggregates...")
	system_features = aggregate_ram_system_wide(cnec_data).to_numpy()
	historical_features[:, col_idx:col_idx + 20] = system_features
	col_idx += 20

	# === TEMPORAL FEATURES (10 features) ===
	print(" - Temporal features...")
	timestamps = pd.date_range(start_time, end_time, freq='H')
	temporal = create_temporal_features(timestamps)
	historical_features[:, col_idx:col_idx + 10] = temporal
	col_idx += 10

	# === WEATHER FEATURES (50 features) ===
	print(" - Weather features...")
	weather = data['weather'][start_time:end_time]
	historical_features[:, col_idx:col_idx + 50] = weather
	col_idx += 50

	# === INTERACTION FEATURES (40 features) ===
	print(" - Interaction features...")
	interactions = create_interaction_features(
	cnec_data, weather, timestamps
	)
	historical_features[:, col_idx:col_idx + 40] = interactions
	col_idx += 40

	print(f"ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Historical features: {historical_features.shape}")
	assert col_idx == 730, f"Expected 730 features, got {col_idx}"

	# === FUTURE COVARIATES ===
	future_features = self._create_future_covariates(data, end_time)

	return historical_features, future_features

	def _create_future_covariates(self, data: dict, prediction_start: str) -> np.ndarray:
	"""Create future covariates for 14-day horizon"""

	prediction_end = pd.Timestamp(prediction_start) + pd.Timedelta(days=14)
	n_hours_future = 336 # 14 days ÃƒÆ’Ã¢â‚¬â€ 24 hours
	future_features = np.zeros((n_hours_future, 280))

	print("Engineering future covariates...")

	col_idx = 0

	# Top 50 CNEC planned outages (50 features)
	for cnec_id in self.top_cnecs:
	planned_outages = data['outages'].filter(
	(pl.col('cnec_id') == cnec_id) &
	(pl.col('outage_type') == 'planned') &
	(pl.col('timestamp') >= prediction_start) &
	(pl.col('timestamp') < prediction_end)
	).sort('timestamp')

	future_features[:, col_idx] = planned_outages['outage_active'].to_numpy()
	col_idx += 1

	# Tier-2 150 CNEC planned outages (150 features)
	for cnec_id in self.tier2_cnecs:
	planned_outages = data['outages'].filter(
	(pl.col('cnec_id') == cnec_id) &
	(pl.col('outage_type') == 'planned') &
	(pl.col('timestamp') >= prediction_start) &
	(pl.col('timestamp') < prediction_end)
	).sort('timestamp')

	future_features[:, col_idx] = planned_outages['outage_active'].to_numpy()
	col_idx += 1

	# Weather forecasts (50 features)
	weather_forecast = data['weather_forecast'][prediction_start:prediction_end]
	future_features[:, col_idx:col_idx + 50] = weather_forecast
	col_idx += 50

	# Temporal (10 features)
	future_timestamps = pd.date_range(prediction_start, prediction_end, freq='H', inclusive='left')
	temporal = create_temporal_features(future_timestamps)
	future_features[:, col_idx:col_idx + 10] = temporal
	col_idx += 10

	# Border capacity adjustments (20 features)
	planned_ntc = data['entsoe']['planned_ntc'][prediction_start:prediction_end]
	future_features[:, col_idx:col_idx + 20] = planned_ntc
	col_idx += 20

	print(f"ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Future covariates: {future_features.shape}")
	assert col_idx == 280, f"Expected 280 future features, got {col_idx}"

	return future_features
	```

	### 8.2 Master Feature List (Option A Architecture)

	Total: ~1,060 features (730 historical + 280 future + 50 real-time aggregates)

	#### Historical Context Features (730 total)

	Category 1: Top 50 CNECs - Full Detail (250 features)
	For each of the 50 most impactful CNECs:
	- `ram_after_cnec_[ID]`: RAM value (MW)
	- `margin_ratio_cnec_[ID]`: RAM / Fmax (0-1 normalized)
	- `presolved_cnec_[ID]`: Binding status (1=binding, 0=not binding)
	- `shadow_price_cnec_[ID]`: Congestion cost (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
	- `outage_active_cnec_[ID]`: Outage status (1=under outage, 0=operational)

	Selection Criteria:
	```
	Impact Score = 0.25ÃƒÆ’Ã¢â‚¬â€(appearance rate) + 0.30ÃƒÆ’Ã¢â‚¬â€(binding frequency) +
	0.20ÃƒÆ’Ã¢â‚¬â€(economic impact) + 0.15ÃƒÆ’Ã¢â‚¬â€(RAM tightness) +
	0.10ÃƒÆ’Ã¢â‚¬â€(geographic importance)
	```

	Category 2: Tier-2 150 CNECs - Selective Detail (300 features)

	Individual Binary Indicators (300 features):
	- `presolved_cnec_[ID]`: Binding status for each tier-2 CNEC (150 features)
	- Preserves constraint activation patterns
	- Allows model to learn which CNECs bind under which conditions

	- `outage_active_cnec_[ID]`: Outage status for each tier-2 CNEC (150 features)
	- Preserves EIC matching benefit
	- Enables learning of outage ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ binding causality
	- Available as future covariate (planned outages known ahead)

	Aggregated Continuous Metrics (30 features):
	Grouped by border (10 borders ÃƒÆ’Ã¢â‚¬â€ 3 metrics):
	- `avg_ram_[BORDER]_tier2`: Average RAM for tier-2 CNECs on this border
	- `avg_margin_ratio_[BORDER]_tier2`: Average margin ratio
	- `total_shadow_price_[BORDER]_tier2`: Sum of shadow prices

	Borders: DE_FR, DE_NL, DE_CZ, DE_BE, FR_BE, AT_CZ, DE_AT, CZ_SK, AT_HU, PL_CZ

	Category 3: PTDF Patterns (10 features)
	- `ptdf_pc1` through `ptdf_pc10`: Principal components
	- Compressed from (200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones = 2,400 raw values)
	- Captures ~92% of variance in network sensitivities

	Category 4: Border Capacity Historical (20 features)
	- `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc.
	- One feature per FBMC border (~20 borders)
	- Actual historical cross-border flow capacity from ENTSO-E

	Category 5: System-Level Aggregates (20 features)
	- `system_min_margin`: Tightest CNEC margin across all 200
	- `n_binding_cnecs_total`: Count of all binding CNECs
	- `n_binding_cnecs_top50`: Count of top-50 CNECs binding
	- `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding
	- `margin_std`: Standard deviation of margins (uniformity indicator)
	- `total_congestion_cost`: Sum of all shadow prices (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬)
	- `max_shadow_price`: Highest shadow price (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
	- `avg_shadow_price_binding`: Average shadow price when CNECs bind
	- `total_outage_mw_fbmc`: Total transmission capacity under outage
	- `n_cnecs_with_outage`: Count of CNECs affected by outages
	- `forced_outage_count`: Count of unplanned outages
	- `outage_stress_index`: Composite metric (count ÃƒÆ’Ã¢â‚¬â€ duration ÃƒÆ’Ã¢â‚¬â€ forced_ratio)
	- `critical_line_outage_count`: High-voltage (380kV) outages
	- `outage_geographic_spread`: Number of unique borders affected
	- Additional 6 features: criticality scores, violation counts, etc.

	Category 6: Temporal Features (10 features)
	- `hour_of_day`: 0-23
	- `day_of_week`: 0-6 (Monday=0)
	- `month`: 1-12
	- `day_of_year`: 1-365
	- `is_weekend`: Binary (1=Sat/Sun, 0=weekday)
	- `is_peak_hour`: Binary (1=8am-8pm, 0=off-peak)
	- `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl`: Country-specific holidays

	Category 7: Weather Features (50 features)
	Key grid points (10-12 strategic locations) ÃƒÆ’Ã¢â‚¬â€ 5 metrics:
	- Temperature (2m above ground, Ãƒâ€šÃ‚Â°C)
	- Wind speed at 10m and 100m (m/s)
	- Wind direction at 100m (degrees)
	- Solar radiation / GHI (W/mÃƒâ€šÃ‚Â²)
	- Cloud cover (%)

	Grid points cover: German hubs, French nuclear regions, North Sea wind, Alpine corridor, Baltic connections, etc.

	Category 8: Interaction Features (40 features)
	Cross-feature combinations capturing domain knowledge:
	- `high_wind_low_margin`: (wind > 20 m/s) ÃƒÆ’Ã¢â‚¬â€ (margin < 0.3)
	- `weekend_low_demand_pattern`: is_weekend ÃƒÆ’Ã¢â‚¬â€ (demand < p30)
	- `outage_binding_correlation`: Rolling correlation of outage presence with binding events
	- `nuclear_low_wind_pattern`: (FR nuclear < 40 GW) ÃƒÆ’Ã¢â‚¬â€ (wind < 10 m/s)
	- `solar_peak_congestion`: (solar > 40 GW) ÃƒÆ’Ã¢â‚¬â€ (hour = 12-14)
	- Various other combinations (weather ÃƒÆ’Ã¢â‚¬â€ capacity, temporal ÃƒÆ’Ã¢â‚¬â€ binding, etc.)

	#### Future Covariates (280 features)

	Category 9: Top 50 CNEC Planned Outages (50 features)
	- `planned_outage_cnec_[ID]_d[horizon]`: Binary indicator for D+1 to D+14
	- Known with certainty (scheduled maintenance published by TSOs)
	- From ENTSO-E A78 document type, filtered to status='scheduled'

	Category 10: Tier-2 150 CNEC Planned Outages (150 features)
	- `planned_outage_cnec_[ID]_d[horizon]`: Binary for each tier-2 CNEC
	- Preserves full EIC matching benefit for future horizon
	- High-value features (outages directly constrain capacity)

	Category 11: Weather Forecasts (50 features)
	- Same structure as historical weather (10-12 grid points ÃƒÆ’Ã¢â‚¬â€ 5 metrics)
	- OpenMeteo provides 14-day forecasts (updated daily)
	- Includes: temperature, wind, solar radiation, cloud cover

	Category 12: Future Temporal (10 features)
	- Same temporal features projected for D+1 to D+14
	- Known with certainty (calendar, holidays, hour/day patterns)

	Category 13: Border Capacity Adjustments (20 features)
	- `planned_ntc_[BORDER]_d1`: Day-ahead NTC publication per border
	- TSOs publish Net Transfer Capacity (NTC) values D-1 for D-day
	- Available via ENTSO-E Transparency Platform
	- Provides official capacity forecasts (complementary to our model)

	---

	### Feature Engineering Sequence

	Step 1: Load Raw Data (Day 1)
	- JAO CNECs, PTDFs, RAMs, shadow prices (24 months)
	- ENTSO-E outages via EIC matching
	- ENTSO-E actual generation, load, cross-border flows
	- OpenMeteo weather (historical + forecasts)

	Step 2: Preprocess (Day 1-2)
	- Clean missing values (field-specific strategies)
	- Detect and clip outliers
	- Align timestamps (CET/CEST ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ UTC)
	- Deduplicate records

	Step 3: CNEC Selection (Day 2)
	- Analyze all ~2,000 CNECs over 24 months
	- Rank by impact score
	- Select top 200 (50 + 150)

	Step 4: Feature Engineering (Day 2)
	- Top 50: Extract 5 metrics each
	- Tier-2: Extract presolved + outage_active individually
	- Tier-2: Aggregate RAM/shadow prices by border
	- PTDF: Extract individual features for top 50, aggregate for tier-2
	- Create temporal, weather, interaction features

	Step 5: Quality Checks (Day 2)
	- Validate feature distributions
	- Check for NaN/Inf values
	- Verify feature variance (no zero-variance)
	- Confirm temporal alignment

	Step 6: Save Feature Matrix (Day 2)
	```
	data/processed/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ features_historical_730.parquet # (17520 hours ÃƒÆ’Ã¢â‚¬â€ 730 features)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ features_future_280.parquet # (17520 hours ÃƒÆ’Ã¢â‚¬â€ 280 future covariates)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ feature_names.json # Column name mapping
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ feature_metadata.json # Descriptions, ranges, types
	```

	---

	## 9. Quality Assurance

	### 9.1 Data Validation Checks

	```python
	def validate_jao_data(df: pl.DataFrame) -> dict:
	"""
	Run comprehensive validation checks on JAO data

	Returns: Dict of validation results
	"""

	results = {
	'passed': True,
	'warnings': [],
	'errors': []
	}

	# 1. Check for critical missing values
	critical_fields = ['timestamp', 'cnec_id', 'fmax', 'ram_after']
	for field in critical_fields:
	missing_pct = df[field].null_count() / len(df) * 100
	if missing_pct > 0:
	results['errors'].append(
	f"Critical field '{field}' has {missing_pct:.2f}% missing values"
	)
	results['passed'] = False

	# 2. Check timestamp continuity (should be hourly)
	time_diffs = df['timestamp'].diff().dt.total_hours()
	non_hourly = (time_diffs != 1).sum()
	if non_hourly > 0:
	results['warnings'].append(
	f"Found {non_hourly} non-hourly gaps in timestamps"
	)

	# 3. Check RAM physical constraints
	invalid_ram = (df['ram_after'] > df['fmax']).sum()
	if invalid_ram > 0:
	results['errors'].append(
	f"Found {invalid_ram} records where RAM > Fmax (physically impossible)"
	)
	results['passed'] = False

	negative_ram = (df['ram_after'] < 0).sum()
	if negative_ram > 0:
	results['warnings'].append(
	f"Found {negative_ram} records with negative RAM (clipped to 0)"
	)

	# 4. Check PTDF bounds
	if 'ptdf_value' in df.columns:
	out_of_bounds = ((df['ptdf_value'] < -1.5) \| (df['ptdf_value'] > 1.5)).sum()
	if out_of_bounds > 0:
	results['warnings'].append(
	f"Found {out_of_bounds} PTDF values outside [-1.5, 1.5] (clipped)"
	)

	# 5. Check shadow price reasonableness
	extreme_shadows = (df['shadow_price'] > 1000).sum()
	if extreme_shadows > 0:
	results['warnings'].append(
	f"Found {extreme_shadows} shadow prices > ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬1000/MW (check for outliers)"
	)

	# 6. Check for duplicate records
	duplicates = df.select(['timestamp', 'cnec_id']).is_duplicated().sum()
	if duplicates > 0:
	results['errors'].append(
	f"Found {duplicates} duplicate (timestamp, cnec_id) pairs"
	)
	results['passed'] = False

	# 7. Check data completeness by month
	monthly_counts = df.groupby(df['timestamp'].dt.month()).count()
	expected_hours_per_month = {1: 744, 2: 672, 3: 744, ...} # Account for leap years
	for month, count in monthly_counts.items():
	expected = expected_hours_per_month.get(month, 720)
	if count < expected * 0.95: # Allow 5% missing
	results['warnings'].append(
	f"Month {month} has only {count}/{expected} expected hours"
	)

	return results
	```

	### 9.2 Feature Validation

	```python
	def validate_features(features: np.ndarray,
	feature_names: list[str]) -> dict:
	"""Validate engineered feature matrix"""

	results = {'passed': True, 'warnings': [], 'errors': []}

	# 1. Check for NaN/Inf
	nan_cols = np.isnan(features).any(axis=0)
	if nan_cols.any():
	nan_features = [feature_names[i] for i, is_nan in enumerate(nan_cols) if is_nan]
	results['errors'].append(f"Features with NaN: {nan_features}")
	results['passed'] = False

	inf_cols = np.isinf(features).any(axis=0)
	if inf_cols.any():
	inf_features = [feature_names[i] for i, is_inf in enumerate(inf_cols) if is_inf]
	results['errors'].append(f"Features with Inf: {inf_features}")
	results['passed'] = False

	# 2. Check feature variance (avoid zero-variance features)
	variances = np.var(features, axis=0)
	zero_var = variances < 1e-8
	if zero_var.any():
	zero_var_features = [feature_names[i] for i, is_zero in enumerate(zero_var) if is_zero]
	results['warnings'].append(
	f"Features with near-zero variance: {zero_var_features}"
	)

	# 3. Check for features outside expected ranges
	for i, fname in enumerate(feature_names):
	if 'margin_ratio' in fname or 'percentile' in fname:
	# Should be in [0, 1]
	if (features[:, i] < -0.1).any() or (features[:, i] > 1.1).any():
	results['warnings'].append(
	f"Feature {fname} has values outside [0, 1]"
	)

	return results
	```

	### 9.3 Automated Testing

	```python
	def run_jao_data_pipeline_tests():
	"""Comprehensive test suite for JAO data pipeline"""

	import pytest

	class TestJAOPipeline:

	def test_data_download(self):
	"""Test JAOPuTo download completes successfully"""
	# Mock test or actual small date range
	pass

	def test_cnec_masking(self):
	"""Test CNEC masking creates correct structure"""
	# Create synthetic data with missing CNECs
	# Verify mask=0 for missing, mask=1 for present
	pass

	def test_ptdf_pca(self):
	"""Test PTDF dimensionality reduction"""
	# Verify variance explained >90%
	# Verify output shape (n_hours, 10)
	pass

	def test_ram_normalization(self):
	"""Test RAM normalization doesn't exceed bounds"""
	# Verify 0 <= margin_ratio <= 1
	pass

	def test_feature_engineering_shape(self):
	"""Test complete feature matrix has correct shape"""
	# Verify (512, 70) for historical context
	pass

	def test_no_data_leakage(self):
	"""Verify no future data leaks into historical features"""
	# Check that features at time T only use data up to time T
	pass

	pytest.main([__file__])
	```

	---

	## Summary: Data Collection Checklist

	### Day 1 Data Collection (9.5 hours - includes outage integration)

	Morning (4.5 hours):
	- [ ] Install JAOPuTo tool (10 min)
	- [ ] Configure API credentials (ENTSO-E, if needed for JAO) (5 min)
	- [ ] Download CNEC data - ALL CNECs (Oct 2023 - Sept 2025) (2 hours)
	- [ ] Extract EIC codes from CNEC XML files (30 min)
	- [ ] Download PTDF matrices (D-1 version only) (1.5 hours)
	- [ ] Initial data validation checks (15 min)

	Afternoon (5 hours):
	- [ ] Download RAM values (1.5 hours)
	- [ ] Download shadow prices (1 hour)
	- [ ] Download presolved flags (included with CNECs)
	- [ ] Collect ENTSO-E outage data via API (A78 document type) (20 min)
	- [ ] Execute EIC-based CNEC-to-outage matching (exact + fuzzy) (15 min)
	- [ ] Generate outage time series for all downloaded CNECs (15 min)
	- [ ] Run cleaning procedures on all datasets (1 hour)
	- [ ] Post-hoc analysis: Identify top 200 CNECs (50 + 150) (45 min)
	- [ ] Create CNEC master set with masking (30 min)
	- [ ] Engineer hybrid PTDF features (individual + aggregated) (20 min)

	End of Day 1 Deliverables:
	```
	data/raw/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao/
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡ ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnecs_all_2024_2025.parquet (~2-3 GB, all ~2000 CNECs)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡ ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdfs_2024_2025.parquet (~800 MB, D-1 version)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡ ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ rams_2024_2025.parquet (~400 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡ ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ shadow_prices_2024_2025.parquet (~300 MB)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ entsoe/
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡ ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ outages_12m.parquet (~150 MB, A78 transmission outages)

	data/processed/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_eic_lookup.parquet (~5 MB, all CNECs with EIC codes)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_outage_matched.parquet (~5 MB, CNECÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢outage EIC mapping)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ outage_time_series_all_cnecs.parquet (~200 MB, hourly outage indicators)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ top_200_cnecs.json (List of selected CNECs: 50 + 150)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_impact_scores.parquet (Ranking criteria for all CNECs)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_cnecs_cleaned.parquet (~500 MB, top 200 CNECs only)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_ptdfs_compressed.parquet (~50 MB, 10 PCA components)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_rams_normalized.parquet (~80 MB, top 200 CNECs only)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_shadow_prices_cleaned.parquet (~60 MB, top 200 CNECs only)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_master_set_200.parquet (Complete time series, with masking)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdf_pca_model.pkl (Fitted PCA for future transforms)

	reports/
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ data_quality_report.json (Validation results)
	ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ eic_matching_report.json (Match rates, fuzzy stats)
	ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_selection_analysis.html (Top 200 selection justification)
	```

	Key Metrics to Report:
	- Total unique CNECs downloaded: ~2,000
	- Top 200 selected: 50 (full detail) + 150 (selective detail)
	- EIC exact match rate: ~85-95%
	- EIC total match rate (with fuzzy): >95%
	- Data completeness: >98% for critical fields
	- Outliers detected and cleaned: <1% of records

	---

	## Next Steps

	Day 2: Combine JAO features with:
	- ENTSO-E actual generation/load/cross-border flow data
	- OpenMeteo weather forecasts (52 grid points)
	- Create complete feature matrix:
	- 730 historical context features
	- 280 future covariate features
	- 50 real-time system aggregates

	Day 3: Feature validation and zero-shot inference preparation
	Day 4: Zero-shot inference with Chronos 2 (multivariate forecasting)
	Day 5: Evaluation, documentation, and handover

	---

	## 10. ENTSO-E Outage Data via EIC Matching

	### 10.1 EIC Code System Overview

	What are EIC Codes?
	- Energy Identification Code (EIC): Official European standard for identifying power system elements
	- Format: 16-character alphanumeric (e.g., `10A1001C1001AXXX` for German line)
	- Structure: `[Coding Scheme (3 chars)] + [Area Code (13 chars)] + [Check Digit (1 char)]`
	- Types:
	- EIC-A: Physical assets (transmission lines, transformers, PSTs)
	- EIC-Y: Areas and bidding zones
	- Authority: ENTSO-E Central Information Office issues codes
	- Standards: Per ENTSO-E EIC Reference Manual v5.5, JAO Handbook v2.2

	Why EIC Matching Works:
	- Both JAO CNECs and ENTSO-E outages use EIC codes as primary identifiers
	- JAO Handbook v2.2 (pp. 15-18): CNECs contain `<EIC_Code>` field
	- ENTSO-E Transparency Platform: Outages include "Outage Equipment EIC"
	- 85-95% exact match rate (per ENTSO-E implementation guides)
	- Superior to name-based matching (avoids TSO naming variations)

	### 10.2 JAO CNEC EIC Extraction

	Data Source: `Core_DA_CC_CNEC_[date].xml` (JAO daily publication)

	Fields to Extract:
	```xml
	<CNEC>
	<CNEC_ID>DE_CZ_TIE_001</CNEC_ID>
	<EIC_Code>10A1001C1001AXXX</EIC_Code> <!-- Primary identifier -->
	<BranchEIC>10T-DE-XXXXXXX</BranchEIC> <!-- For multi-branch -->
	<NamePerConvention>Line RÃƒÆ’Ã‚Â¶hrsdorf-Hradec 380kV N-1</NamePerConvention>
	<VoltageLevel>380</VoltageLevel>
	<TSO>50Hertz</TSO>
	<FromSubstation>
	<Name>RÃƒÆ’Ã‚Â¶hrsdorf</Name>
	<EIC>10YDE-XXXXXXX</EIC>
	</FromSubstation>
	<ToSubstation>
	<Name>Hradec</Name>
	<EIC>10YCZ-XXXXXXX</EIC>
	</ToSubstation>
	</CNEC>
	```

	Preprocessing Script:
	```python
	import xmltodict
	import polars as pl
	from pathlib import Path

	def extract_cnec_eic_codes(jao_data_dir: Path) -> pl.DataFrame:
	"""Extract EIC codes from JAO CNEC XML files"""

	cnec_eic_mapping = []

	# Process all CNEC files in date range
	for xml_file in jao_data_dir.glob('Core_DA_CC_CNEC_*.xml'):
	with open(xml_file, 'r') as f:
	data = xmltodict.parse(f.read())

	for cnec in data['CNECs']['CNEC']:
	cnec_eic_mapping.append({
	'cnec_id': cnec['@id'],
	'eic_code': cnec.get('EIC_Code'),
	'branch_eic': cnec.get('BranchEIC'),
	'line_name': cnec.get('NamePerConvention'),
	'voltage_level': int(cnec.get('VoltageLevel', 0)),
	'tso': cnec.get('TSO'),
	'from_substation': cnec['FromSubstation']['Name'],
	'to_substation': cnec['ToSubstation']['Name'],
	'from_substation_eic': cnec['FromSubstation'].get('EIC'),
	'to_substation_eic': cnec['ToSubstation'].get('EIC'),
	'contingency': cnec.get('Contingency', {}).get('Name')
	})

	# Create lookup table (deduplicate CNECs)
	cnec_eic_df = pl.DataFrame(cnec_eic_mapping).unique(subset=['cnec_id'])

	return cnec_eic_df

	# Execute
	cnec_eic_lookup = extract_cnec_eic_codes(Path('data/raw/jao/'))
	cnec_eic_lookup.write_parquet('data/processed/cnec_eic_lookup.parquet')
	print(f"Extracted EIC codes for {len(cnec_eic_lookup)} unique CNECs")
	```

	### 10.3 ENTSO-E Outage Data Collection

	Data Source: ENTSO-E Transparency Platform API
	- Document Type: A78 - "Unavailability of Transmission Infrastructure"
	- Coverage: Planned and forced outages for transmission elements (lines, transformers, HVDC)
	- Frequency: Hourly updates, historical data back to 2015
	- Domain: Core CCR region (EIC: `10Y1001A1001A83F`)

	API Collection:
	```python
	from entsoe import EntsoePandasClient
	import polars as pl
	import pandas as pd

	def collect_entsoe_outages(api_key: str, start_date: str, end_date: str) -> pl.DataFrame:
	"""Collect transmission outage data from ENTSO-E"""

	client = EntsoePandasClient(api_key=api_key)

	# Query Core CCR domain
	# Note: May need to query per country and aggregate
	outage_records = []

	core_countries = ['DE', 'FR', 'NL', 'BE', 'AT', 'CZ', 'PL', 'HU', 'RO', 'SK', 'SI', 'HR']

	for country in core_countries:
	print(f"Fetching outages for {country}...")

	try:
	# Query unavailability (A78 document type)
	outages = client.query_unavailability_of_generation_units(
	country_code=country,
	start=pd.Timestamp(start_date, tz='UTC'),
	end=pd.Timestamp(end_date, tz='UTC'),
	doctype='A78' # Transmission infrastructure
	)

	# Parse response
	for idx, outage in outages.iterrows():
	outage_records.append({
	'outage_eic': outage.get('affected_unit_eic'),
	'line_name': outage.get('affected_unit_name'),
	'voltage_kv': outage.get('nominal_power'), # For lines, often voltage
	'tso': outage.get('tso'),
	'country': country,
	'outage_type': outage.get('type'), # 'A53' (planned) or 'A54' (forced)
	'start_time': outage.get('start'),
	'end_time': outage.get('end'),
	'available_capacity_mw': outage.get('available_capacity'),
	'unavailable_capacity_mw': outage.get('unavailable_capacity'),
	'status': outage.get('status') # Active, scheduled, cancelled
	})
	except Exception as e:
	print(f"Warning: Could not fetch outages for {country}: {e}")
	continue

	outages_df = pl.DataFrame(outage_records)

	# Filter to transmission elements only (voltage >= 220 kV)
	outages_df = outages_df.filter(
	(pl.col('voltage_kv') >= 220) \| pl.col('voltage_kv').is_null()
	)

	return outages_df

	# Execute
	outages = collect_entsoe_outages(
	api_key='YOUR_ENTSOE_KEY',
	start_date='2023-10-01',
	end_date='2025-09-30'
	)
	outages.write_parquet('data/raw/entsoe_outages_12m.parquet')
	print(f"Collected {len(outages)} outage records")
	```

	### 10.4 EIC-Based Matching (Primary + Fuzzy Fallback)

	Matching Strategy:
	1. Primary: Exact EIC code matching (85-95% success rate)
	2. Fallback: Fuzzy matching on line name + substations + voltage (5-15% remaining)
	3. Result: >95% total match rate

	```python
	import polars as pl
	from fuzzywuzzy import fuzz, process

	def match_cnecs_to_outages(cnec_eic: pl.DataFrame,
	outages: pl.DataFrame) -> pl.DataFrame:
	"""
	Match CNECs to outages using EIC codes with fuzzy fallback

	Returns: DataFrame with CNEC-to-outage mappings
	"""

	# STEP 1: Exact EIC Match
	matched = cnec_eic.join(
	outages.select(['outage_eic', 'line_name', 'tso', 'voltage_kv',
	'start_time', 'end_time', 'outage_type']),
	left_on='eic_code',
	right_on='outage_eic',
	how='left'
	)

	# Check exact match rate
	exact_matched = matched.filter(pl.col('outage_eic').is_not_null())
	exact_match_rate = len(exact_matched) / len(matched)
	print(f"Exact EIC match rate: {exact_match_rate:.1%}")
	print(f" Matched: {len(exact_matched)} CNECs")
	print(f" Unmatched: {len(matched) - len(exact_matched)} CNECs")

	# STEP 2: Fuzzy Fallback for Unmatched
	unmatched = matched.filter(pl.col('outage_eic').is_null())

	if len(unmatched) > 0:
	print(f"\nApplying fuzzy matching to {len(unmatched)} unmatched CNECs...")

	# Prepare search corpus from outages
	outage_search_corpus = []
	for row in outages.iter_rows(named=True):
	search_str = f"{row['line_name']} {row['tso']} {row['voltage_kv']}kV"
	outage_search_corpus.append({
	'search_str': search_str,
	'outage_eic': row['outage_eic']
	})

	fuzzy_matches = []
	for row in unmatched.iter_rows(named=True):
	# Construct CNEC search string
	cnec_search = f"{row['line_name']} {row['from_substation']} {row['to_substation']} {row['tso']} {row['voltage_level']}kV"

	# Find best match
	best_match = process.extractOne(
	cnec_search,
	[item['search_str'] for item in outage_search_corpus],
	scorer=fuzz.token_set_ratio,
	score_cutoff=85 # 85% similarity threshold
	)

	if best_match:
	match_idx = [i for i, item in enumerate(outage_search_corpus)
	if item['search_str'] == best_match[0]][0]
	matched_eic = outage_search_corpus[match_idx]['outage_eic']

	fuzzy_matches.append({
	'cnec_id': row['cnec_id'],
	'matched_outage_eic': matched_eic,
	'match_score': best_match[1],
	'match_method': 'fuzzy'
	})

	print(f" Fuzzy matched: {len(fuzzy_matches)} additional CNECs")

	# Update matched dataframe with fuzzy results
	fuzzy_df = pl.DataFrame(fuzzy_matches)
	matched = matched.join(
	fuzzy_df,
	on='cnec_id',
	how='left'
	).with_columns([
	pl.when(pl.col('outage_eic').is_null())
	.then(pl.col('matched_outage_eic'))
	.otherwise(pl.col('outage_eic'))
	.alias('outage_eic')
	])

	# Final match statistics
	final_matched = matched.filter(pl.col('outage_eic').is_not_null())
	final_match_rate = len(final_matched) / len(matched)
	print(f"\nFinal match rate: {final_match_rate:.1%}")
	print(f" Total matched: {len(final_matched)} CNECs")
	print(f" Unmatched: {len(matched) - len(final_matched)} CNECs")

	return matched

	# Execute matching
	cnec_eic = pl.read_parquet('data/processed/cnec_eic_lookup.parquet')
	outages = pl.read_parquet('data/raw/entsoe_outages_12m.parquet')

	matched_cnecs = match_cnecs_to_outages(cnec_eic, outages)
	matched_cnecs.write_parquet('data/processed/cnec_outage_matched.parquet')
	```

	Expected Output:
	```
	Exact EIC match rate: 87.5%
	Matched: 175 CNECs
	Unmatched: 25 CNECs

	Applying fuzzy matching to 25 unmatched CNECs...
	Fuzzy matched: 18 additional CNECs

	Final match rate: 96.5%
	Total matched: 193 CNECs
	Unmatched: 7 CNECs
	```

	### 10.5 Outage Feature Engineering

	Outage Time Series Generation:
	```python
	def create_outage_time_series(matched_cnecs: pl.DataFrame,
	outages: pl.DataFrame,
	timestamps: pl.Series) -> pl.DataFrame:
	"""
	Create binary outage indicators for each CNEC over time

	For each timestamp and CNEC:
	- outage_active = 1 if element under outage
	- outage_active = 0 if element operational
	- outage_type = 'planned' or 'forced' if active

	Returns: (n_timestamps ÃƒÆ’Ã¢â‚¬â€ n_cnecs) DataFrame
	"""

	outage_series = []

	for cnec in matched_cnecs.iter_rows(named=True):
	cnec_id = cnec['cnec_id']
	outage_eic = cnec['outage_eic']

	if outage_eic is None:
	# No matching outage data - assume operational
	for ts in timestamps:
	outage_series.append({
	'timestamp': ts,
	'cnec_id': cnec_id,
	'outage_active': 0,
	'outage_type': None,
	'days_until_end': None
	})
	continue

	# Get all outages for this EIC
	cnec_outages = outages.filter(pl.col('outage_eic') == outage_eic)

	for ts in timestamps:
	# Check if any outage is active at this timestamp
	active_outages = cnec_outages.filter(
	(pl.col('start_time') <= ts) & (pl.col('end_time') >= ts)
	)

	if len(active_outages) > 0:
	# Take the outage with earliest end time (most immediate constraint)
	primary_outage = active_outages.sort('end_time').head(1)
	outage_row = primary_outage.row(0, named=True)

	days_remaining = (outage_row['end_time'] - ts).total_seconds() / 86400

	outage_series.append({
	'timestamp': ts,
	'cnec_id': cnec_id,
	'outage_active': 1,
	'outage_type': outage_row['outage_type'],
	'days_until_end': days_remaining,
	'unavailable_capacity_mw': outage_row.get('unavailable_capacity_mw')
	})
	else:
	# No active outage
	outage_series.append({
	'timestamp': ts,
	'cnec_id': cnec_id,
	'outage_active': 0,
	'outage_type': None,
	'days_until_end': None,
	'unavailable_capacity_mw': None
	})

	return pl.DataFrame(outage_series)

	# Generate outage time series for all 200 CNECs
	timestamps = pl.date_range(
	start='2023-10-01',
	end='2025-09-30',
	interval='1h'
	)

	outage_features = create_outage_time_series(
	matched_cnecs=pl.read_parquet('data/processed/cnec_outage_matched.parquet'),
	outages=pl.read_parquet('data/raw/entsoe_outages_12m.parquet'),
	timestamps=timestamps
	)

	outage_features.write_parquet('data/processed/outage_time_series_200cnecs.parquet')
	print(f"Generated outage features: {outage_features.shape}")
	```

	Outage Feature Categories:

	1. CNEC-Level Binary Indicators (200 features for all CNECs):
	- `outage_active_cnec_001` through `outage_active_cnec_200`
	- Value: 1 if element under outage, 0 if operational
	- Available as both historical context and future covariates (planned outages)

	2. Border-Level Aggregations (20 features):
	- `outage_count_de_cz`: Number of CNECs on DE-CZ border with active outages
	- `outage_mw_de_cz`: Total unavailable capacity on DE-CZ border
	- `forced_outage_ratio_de_cz`: Ratio of forced vs planned outages
	- Repeat for 10 major borders ÃƒÆ’Ã¢â‚¬â€ 2 metrics = 20 features

	3. System-Level Aggregations (8 features):
	- `total_outage_mw_fbmc`: Total transmission capacity unavailable
	- `n_cnecs_with_outage`: Count of CNECs affected by outages
	- `forced_outage_count`: Count of unplanned outages (stress indicator)
	- `max_outage_duration_remaining`: Days until longest outage ends
	- `avg_outage_duration`: Average remaining outage duration
	- `outage_stress_index`: Weighted measure (count ÃƒÆ’Ã¢â‚¬â€ avg_duration ÃƒÆ’Ã¢â‚¬â€ forced_ratio)
	- `critical_line_outage_count`: Count of high-voltage (380kV) outages
	- `outage_geographic_spread`: Number of unique borders affected

	4. Outage Duration Features (Top 50 CNECs - 150 features):
	For each of the Top 50 CNECs, calculate three temporal features:
	- `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started
	- Calculation: `(current_timestamp - start_time).total_hours()`
	- Value: 0 if no active outage
	- `outage_remaining_cnec_[ID]`: Hours remaining until outage ends
	- Calculation: `(end_time - current_timestamp).total_hours()`
	- Value: 0 if no active outage
	- `outage_total_duration_cnec_[ID]`: Total planned duration of active outage
	- Calculation: `(end_time - start_time).total_hours()`
	- Value: 0 if no active outage

	Rationale:
	- Binary `outage_active` doesn't capture temporal stress accumulation
	- A 2-hour outage has different impact than 48-hour outage
	- Enables model to learn: "When outage_elapsed > 24h, constraint severity increases"
	- Direct causal chain: duration â†’ network stress â†’ RAM reduction â†’ Max BEX impact
	- Research indicates ~15% of BEX anomalies correlate with extended outage durations

	5. Outage Duration Aggregates (Tier-2 150 CNECs - 30 features):
	For tier-2 CNECs, aggregate duration metrics by border:
	- `avg_outage_duration_[BORDER]_tier2`: Average duration of active outages (hours)
	- `max_outage_duration_[BORDER]_tier2`: Longest active outage on border (hours)
	- `forced_outage_duration_ratio_[BORDER]_tier2`: (forced_duration / total_duration)
	- 10 major borders ÃƒÆ’Ã¢â‚¬" 3 metrics = 30 features

	Future Covariates (Planned Outages):
	- Same 200 CNEC-level binary indicators for D+1 to D+14 horizon
	- Only includes planned/scheduled outages (status = 'scheduled' in ENTSO-E)
	- These are KNOWN with certainty (gold for forecasting)

	### 10.6 Integration with Main Pipeline

	Day 1 Timeline Addition:
	- Morning (+30 min): Extract EIC codes from JAO CNEC files
	- Morning (+20 min): Collect ENTSO-E outage data via API
	- Afternoon (+15 min): Execute EIC-based matching (exact + fuzzy)
	- Afternoon (+15 min): Generate outage time series for 200 CNECs

	Total Additional Time: ~1.5 hours

	Data Flow:
	```
	JAO CNEC XML ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ Extract EIC codes ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ cnec_eic_lookup.parquet
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	ENTSO-E API ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ Outage data ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ entsoe_outages_12m.parquet
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	EIC Matching (exact + fuzzy)
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	cnec_outage_matched.parquet
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	Generate time series for each CNEC
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	outage_time_series_200cnecs.parquet
	ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
	Feature Engineering Pipeline (Day 2)
	```

	---

	## 11. Final Feature Architecture (Option A with Tier-2 Binding)

	### 11.1 Feature Count Summary

	Total Features: ~1,735 (Hybrid PTDF + Max BEX + LTN + Net Positions + ATC + Outage Duration)
	- Historical Context: ~1,000 features
	- Future Covariates: ~380 features

	New Feature Categories Added:
	- Target History (Max BEX): +20 features (historical context)
	- LTN Allocations: +40 features (20 historical + 20 future covariates)
	- Net Position Features: +48 features (24 min + 24 max values)
	- Non-Core ATC: +28 features (14 borders Ãƒâ€” 2 directions)

	### 11.2 Historical Context Features (1,000 features)

	#### Top 50 CNECs - Full Detail + Individual PTDFs (1,000 features)
	For each of the 50 most impactful CNECs, 8 core metrics + 12 PTDF sensitivities:
	- `ram_after_cnec_[ID]`: RAM value (MW)
	- `margin_ratio_cnec_[ID]`: RAM / Fmax (normalized)
	- `presolved_cnec_[ID]`: Binding status (1 = binding, 0 = not binding)
	- `shadow_price_cnec_[ID]`: Congestion cost (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
	- `outage_active_cnec_[ID]`: Outage status (1 = element under outage, 0 = operational)
	- `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started (0 if no outage)
	- `outage_remaining_cnec_[ID]`: Hours remaining until outage ends (0 if no outage)
	- `outage_total_duration_cnec_[ID]`: Total planned duration of active outage (0 if no outage)

	Total Top 50 CNEC Features: 50 CNECs ÃƒÆ’Ã¢â‚¬" (8 core metrics + 12 PTDF sensitivities) = 1,000 features

	Selection Criteria for Top 50:
	```python
	top_50_score = (
	0.25 ÃƒÆ’Ã¢â‚¬â€ (days_appeared / 365) + # Consistency
	0.30 ÃƒÆ’Ã¢â‚¬â€ (times_binding / times_appeared) + # Binding frequency
	0.20 ÃƒÆ’Ã¢â‚¬â€ (avg_shadow_price / 100) + # Economic impact
	0.15 ÃƒÆ’Ã¢â‚¬â€ (hours_ram_low / total_hours) + # Operational tightness
	0.10 ÃƒÆ’Ã¢â‚¬â€ geographic_importance # Border coverage
	)
	```

	#### Tier-2 150 CNECs - Selective Detail (300 features)

	Individual Binary Indicators (300 features):
	- `presolved_cnec_[ID]`: Binding status for each of 150 CNECs (150 features)
	- Preserves constraint activation patterns
	- Model learns which tier-2 CNECs bind under which conditions

	- `outage_active_cnec_[ID]`: Outage status for each of 150 CNECs (150 features)
	- Preserves EIC matching benefit
	- Future covariates: planned outages known ahead
	- Model learns outage ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ binding relationships

	Rationale for Preserving These Two:
	- Both are discrete/binary (low redundancy across CNECs)
	- Both are high-signal:
	- `presolved`: Indicates current constraint state
	- `outage_active`: Predicts future constraint likelihood
	- Allows learning cross-CNEC interactions: "When CNEC_X has outage, CNEC_Y binds"

	Aggregated Continuous Metrics (60 features):
	Grouped by border/region for remaining metrics:
	- `avg_ram_de_cz_tier2`: Average RAM for tier-2 DE-CZ CNECs
	- `avg_margin_ratio_de_cz_tier2`: Average margin ratio
	- `total_shadow_price_de_cz_tier2`: Sum of shadow prices
	- `ram_volatility_de_cz_tier2`: Standard deviation of RAM
	- `avg_outage_duration_de_cz_tier2`: Average duration of active outages (hours)
	- `max_outage_duration_de_cz_tier2`: Longest active outage on border (hours)
	- Repeat for 10 major borders ÃƒÆ’Ã¢â‚¬" 6 metrics = 60 features

	Why Aggregate These:
	- RAM, shadow prices are continuous and correlated within a region
	- Preserves regional capacity patterns without full redundancy
	- Reduces from 150 CNECs ÃƒÆ’Ã¢â‚¬â€ 2 metrics = 300 ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ 30 aggregate features

	#### PTDF Patterns (10 features)
	- `ptdf_pc1` through `ptdf_pc10`: Principal components
	- Compressed from (200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones = 2,400 values) ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ 10 components
	- Captures ~92% of PTDF variance

	#### Border Capacity Historical (20 features)
	- `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc.
	- One feature per FBMC border
	- Actual historical cross-border flow capacity (from ENTSO-E)

	#### System-Level Aggregates (20 features)
	- `system_min_margin`: Tightest CNEC margin across all 200
	- `n_binding_cnecs_total`: Count of binding CNECs
	- `n_binding_cnecs_top50`: Count of top-50 CNECs binding
	- `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding
	- `margin_std`: Standard deviation of margins
	- `total_congestion_cost`: Sum of all shadow prices
	- `max_shadow_price`: Highest shadow price
	- `avg_shadow_price_binding`: Average price when CNECs bind
	- `total_outage_mw_fbmc`: Total capacity under outage
	- `n_cnecs_with_outage`: Count of CNECs affected
	- `forced_outage_count`: Unplanned outages
	- `outage_stress_index`: Composite stress metric
	- Additional 8 features (criticality scores, violation counts, etc.)

	#### Temporal Features (10 features)
	- `hour_of_day`: 0-23
	- `day_of_week`: 0-6 (Monday=0)
	- `month`: 1-12
	- `day_of_year`: 1-365
	- `is_weekend`: Binary
	- `is_peak_hour`: Binary (8am-8pm)
	- `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl`

	#### Weather Features (50 features)
	- Key grid points (10-12 strategic locations)
	- Per point: temperature, wind speed (10m, 100m), wind direction, solar radiation, cloud cover
	- ~5 metrics ÃƒÆ’Ã¢â‚¬â€ 10 points = 50 features

	#### Target History Features (20 features) - NEW
	- `max_bex_hist_[BORDER]`: Historical Max BEX per border (20 FBMC Core borders)
	- Used in context window (past 21 days)
	- Model learns patterns in capacity evolution
	- Example features: `max_bex_hist_de_fr`, `max_bex_hist_de_nl`, etc.

	#### LTN Allocation Features (20 features) - NEW
	- `ltn_allocated_[BORDER]`: Long-term capacity already committed per border
	- Values from yearly/monthly JAO auctions
	- Used in both historical context (what was allocated) and future covariates (known allocations ahead)
	- Example: If 500 MW LTN on DE-FR, Max BEX will be ~500 MW lower

	#### Net Position Features (48 features) - NEW
	- `net_pos_min_[ZONE]`: Minimum feasible net position per zone (12 features)
	- `net_pos_max_[ZONE]`: Maximum feasible net position per zone (12 features)
	- `net_pos_range_[ZONE]`: Degrees of freedom (max - min) per zone (12 features)
	- `net_pos_margin_[ZONE]`: Utilization ratio per zone (12 features)
	- Zones: DE_LU, FR, BE, NL, AT, CZ, PL, SK, HU, SI, HR, RO

	#### Non-Core ATC Features (28 features) - NEW
	- `atc_[NON_CORE_BORDER]_forward`: Forward direction capacity (14 features)
	- `atc_[NON_CORE_BORDER]_backward`: Backward direction capacity (14 features)
	- Key borders: FR-UK, FR-ES, FR-CH, DE-CH, DE-DK, AT-CH, AT-IT, PL-SE, PL-LT, etc.
	- These capture loop flow drivers that affect Core CNECs

	#### Interaction Features (40 features)
	- `high_wind_low_margin`: Interaction of wind > threshold & margin < threshold
	- `weekend_low_demand_pattern`: Weekend ÃƒÆ’Ã¢â‚¬â€ low demand indicator
	- `outage_binding_correlation`: Correlation of outage presence with binding events
	- Various cross-feature products and ratios

	### 11.3 Future Covariates (280 features)

	#### Top 50 CNEC Planned Outages (50 features)
	- `planned_outage_cnec_[ID]`: Binary indicator for D+1 to D+14
	- Known with certainty (scheduled maintenance)

	#### Tier-2 150 CNEC Planned Outages (150 features)
	- `planned_outage_cnec_[ID]`: Binary for each tier-2 CNEC
	- Preserves full EIC matching benefit for future horizon

	#### Weather Forecasts (50 features)
	- Same structure as historical weather
	- OpenMeteo provides 14-day forecasts

	#### Temporal (10 features)
	- Same temporal features projected for D+1 to D+14
	- Known with certainty

	#### LTN Future Allocations (20 features) - NEW
	- `ltn_allocated_[BORDER]_future`: Known LT capacity allocations for D+1 to D+14
	- Values from yearly/monthly auction results (KNOWN IN ADVANCE)
	- Yearly auction results known for entire year ahead
	- Monthly auction results known for month ahead
	- Gold standard future covariate - 100% certain values
	- Directly impacts Max BEX: higher LTN = lower available day-ahead capacity

	#### Border Capacity Adjustments (20 features)
	- `planned_ntc_de_fr_d1`: Day-ahead NTC publication per border
	- TSOs publish these D-1 for D-day

	### 11.4 Feature Engineering Workflow

	Input to Chronos 2:
	```python
	# Historical context window (21 days before prediction)
	historical_features: np.ndarray # Shape: (512 hours, 1000 features) - UPDATED WITH OUTAGE DURATION

	# Future covariates (14 days ahead)
	future_features: np.ndarray # Shape: (336 hours, 380 features) - NO CHANGE

	# Combine for inference
	chronos_input = {
	'context': historical_features, # What happened
	'future_covariates': future_features # What we know will happen
	}

	# Chronos 2 predicts Max BEX for each border
	forecast = pipeline.predict(
	context=chronos_input['context'],
	future_covariates=chronos_input['future_covariates'],
	prediction_length=336 # 14 days ÃƒÆ’Ã¢â‚¬â€ 24 hours
	)
	```

	---

	## Questions & Clarifications

	CRITICAL UPDATES FROM PREVIOUS CONVERSATION:
	1. Max BEX Added: TARGET VARIABLE now included - Day 1 first priority collection
	2. LTN Added: Long Term Nominations with future covariate capability (auction results known in advance)
	3. Min/Max Net Positions Added: Domain boundaries that define feasible space
	4. ATC Non-Core Added: Loop flow drivers from external borders

	Previous Decisions (Still Valid):
	5. D2CF Decision: Confirmed SKIP - not needed for forecasting
	6. CNEC Count: Top 200 total (50 full detail + 150 selective detail)
	7. PTDF Treatment: Hybrid approach - 600 individual (top 50) + 120 border-aggregated (tier-2)
	8. Historical Period: 24 months (Oct 2023 - Sept 2025)
	9. Missing Data Strategy: Field-specific (forward-fill, zero-fill, interpolation)
	10. Outage Matching: EIC code-based (85-95% exact) + fuzzy fallback (>95% total)
	11. Tier-2 CNEC Treatment: Preserve presolved + outage_active individually, aggregate RAM/shadow/PTDF by border
	12. Total Features: ~1,735 (1,000 historical + 380 future + 355 aggregates) - HYBRID PTDF + OUTAGE DURATION
	13. Outage Duration: Added 3 duration features per top-50 CNEC (elapsed, remaining, total) + border aggregates for tier-2

	Verification Status:
	- Ã¢Å“â€¦ Max BEX: Confirmed available in JAO Publication Tool ("Max Exchanges" page)
	- Ã¢Å“â€¦ LTN: Confirmed available and CAN be used as future covariate (auction results known ahead)
	- Ã¢Å“â€¦ Min/Max Net Positions: Confirmed published per hub on JAO
	- Ã¢Å“â€¦ ATC Non-Core: Confirmed published for external borders on JAO

	---

	Document Version: 5.0 - OUTAGE DURATION FEATURES ADDED
	Last Updated: October 29, 2025
	Status: COMPLETE - Ready for Day 1 Implementation with ALL Essential Data Series
	Major Changes: Added outage duration features (elapsed, remaining, total) for enhanced temporal modeling, ~1,735 total features (+11.6% from v4.0)
	Contact: Maintain as living document throughout MVP development