--- library_name: diffusers tags: - ltx - video-generation - audio-to-video - video-conditioning license: apache-2.0 --- # LTX-2 Audio-to-Video Pipeline with Video Conditioning A custom diffusers pipeline for LTX-2 that extends audio-to-video generation with **video conditioning** support. ## Features - Audio-conditioned video generation (lip-sync) - **Video conditioning** for motion/pose guidance - Configurable conditioning strength and start frame - Compatible with LTX-2 LoRAs (face-swap, camera control, etc.) ## Installation ```bash pip install diffusers transformers torch torchaudio av ``` ## Usage ```python import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image # Load pipeline with custom video conditioning support pipe = DiffusionPipeline.from_pretrained( "Lightricks/LTX-2", custom_pipeline="linoyts/ltx2-audio-video-conditioning", torch_dtype=torch.bfloat16 ) pipe.to("cuda") # Optional: Load a LoRA (e.g., face-swap) # pipe.load_lora_weights("Alissonerdx/BFS-Best-Face-Swap-Video", # weight_name="ltx-2/head_swap_v1_13500_first_frame.safetensors") # pipe.fuse_lora(lora_scale=1.1) # Load inputs image = load_image("input_face.png") # Generate with video conditioning video, audio = pipe( image=image, # Frame 0 appearance video="reference_motion.mp4", # Video for motion conditioning video_conditioning_strength=1.0, # How strongly to follow motion (0-1) video_conditioning_frame_idx=1, # Start video conditioning at frame 1 audio="audio.wav", # Audio for lip-sync prompt="a person speaking naturally, smooth animation", negative_prompt="low quality, blurry, distorted", width=512, height=768, num_frames=121, frame_rate=24.0, num_inference_steps=40, guidance_scale=4.0, return_dict=False, ) ``` ## Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `image` | PIL.Image | None | Input image for frame 0 conditioning | | `video` | str/List/Tensor | None | Reference video for motion conditioning | | `video_conditioning_strength` | float | 1.0 | Strength of video conditioning (0.0-1.0) | | `video_conditioning_frame_idx` | int | 1 | Frame index where video conditioning starts | | `audio` | str/Tensor | None | Audio input for lip-sync | ### Video Conditioning Frame Index - `0`: Video conditioning replaces all frames - `1` (default): Frame 0 = image, frames 1+ = video motion - `N`: Frames 0 to N-1 = image/noise, frames N+ = video conditioning ## Distilled Model (8-step) For faster generation with the distilled model: ```python pipe = DiffusionPipeline.from_pretrained( "rootonchair/LTX-2-19b-distilled", custom_pipeline="linoyts/ltx2-audio-video-conditioning", torch_dtype=torch.bfloat16 ) pipe.to("cuda") DISTILLED_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875] video, audio = pipe( image=image, video="reference.mp4", audio="audio.wav", prompt="...", num_inference_steps=8, sigmas=DISTILLED_SIGMAS, guidance_scale=1.0, return_dict=False, ) ``` ## License Apache 2.0