---
library_name: diffusers
tags:
  - ltx
  - video-generation
  - audio-to-video
  - video-conditioning
license: apache-2.0
---

# LTX-2 Audio-to-Video Pipeline with Video Conditioning

A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with **video conditioning** support.

## Features

- Audio-conditioned video generation (lip-sync)
- **Video conditioning** for motion/pose guidance
- Configurable conditioning strength and start frame
- Compatible with LTX-2 LoRAs (face-swap, camera control, etc.)

## Installation

```bash
pip install diffusers transformers torch torchaudio av
```

## Usage

```python
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image

# Load pipeline with custom video conditioning support
pipe = DiffusionPipeline.from_pretrained(
    "Lightricks/LTX-2",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Optional: Load a LoRA (e.g., face-swap)
# pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video", 
#                        weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors")
# pipe.fuse_lora(lora_scale=1.1)

# Load inputs
image = load_image("input_face.png")

# Generate with video conditioning
video, audio = pipe(
    image=image,                          # Frame 0 appearance
    video="reference_motion.mp4",         # Video for motion conditioning
    video_conditioning_strength=1.0,      # How strongly to follow motion (0-1)
    video_conditioning_frame_idx=1,       # Start video conditioning at frame 1
    audio="audio.wav",                    # Audio for lip-sync
    prompt="a person speaking naturally, smooth animation",
    negative_prompt="low quality, blurry, distorted",
    width=512,
    height=768,
    num_frames=121,
    frame_rate=24.0,
    num_inference_steps=40,
    guidance_scale=4.0,
    return_dict=False,
)
```

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `image` | PIL.Image | None | Input image for frame 0 conditioning |
| `video` | str/List/Tensor | None | Reference video for motion conditioning |
| `video_conditioning_strength` | float | 1.0 | Strength of video conditioning (0.0-1.0) |
| `video_conditioning_frame_idx` | int | 1 | Frame index where video conditioning starts |
| `audio` | str/Tensor | None | Audio input for lip-sync |

### Video Conditioning Frame Index

- `0`: Video conditioning replaces all frames
- `1` (default): Frame 0 = image, frames 1+ = video motion
- `N`: Frames 0 to N-1 = image/noise, frames N+ = video conditioning

## Distilled Model (8-step)

For faster generation with the distilled model:

```python
pipe = DiffusionPipeline.from_pretrained(
    "rootonchair/LTX-2-19b-distilled",
    custom_pipeline="linoyts/ltx2-audio-video-conditioning",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875]

video, audio = pipe(
    image=image,
    video="reference.mp4",
    audio="audio.wav",
    prompt="...",
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMAS,
    guidance_scale=1.0,
    return_dict=False,
)
```

## License

Apache 2.0