-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe1, Tuna Han Salih Meral1, Adil Kaan Akan2, Kaan Oktay2, Pinar Yanardag1
1Virginia Tech   2fal

Teaser

∞-RoPE demonstrates three core capabilities: Infinite-length video generation enabled by Block-Relativistic RoPE, fine-grained action-control through KV Flush, and cinematic multi-cut scene composition via RoPE Cut.

Ultra-long Video Generation

Generate videos of unlimited length beyond the base model's temporal horizon

Dynamic Scene Cuts

Cinematic multi-cut transitions within a single autoregressive rollout

Action Control

Dynamic prompt changes with instant responsiveness and smooth transitions

Motivations
|

In contrast to previous autoregressive methods, which primarily extend temporal horizons via new training procedures, distillation strategies, or large-scale infrastructure, -RoPE explores what already distilled models can achieve by reparameterizing temporal RoPE and KV caching at inference time, and can be applied in a plug-and-play fashion on top of existing Self-Forcing variants to enable effectively infinite-horizon, controllable video generation.

Self-Forcing

Self-Forcing baseline method

Self-Forcing + Block-Relativistic RoPE

Self-Forcing with Block-Relativistic RoPE

Qualitative Results

Qualitative results demonstrating the capabilities of ∞-RoPE.

Ultra Long Video Generation
|

Showcasing ultra-long autoregressive generations that sustain high fidelity over extended durations.

Ultra Long Video 1

Ultra Long Video 2

Single Subject Action Control
|

Fine-grained action control demonstrations for individual subjects.

Action Transition (x4)

Action Transition (x4)

Action Transition (x4)

Action Transition (x4)

Action Transition (x4)

Action Transition (x4)

Action Transition (x6)

Action Transition (x6)

Action Transition (x6)

Multiple Subject Action Control
|

Coordinated multi-character sequences showcasing simultaneous action conditioning, synchronized motion planning, and subject-specific temporal control.

Action Transition (x4)

Action Transition (x5)

Action Transition (x6)

Action Transition (x6)

Action Transition (x6)

Action Transition (x6)

Action Control + Character Introduction
|

Demonstrations of action control combined with character introduction capabilities.

Action Transition (x6) + Character Introduction (x3)

Action Transition (x4) + Character Introduction (x3)

Action Transition (x3) + Character Introduction (x2)

Action Control + Object Introduction
|

Demonstrations of action control combined with object introduction capabilities.

Video 1

Video 2

Video 3

Long Video Generation (Cache Size ≤ flimit)
|

Demonstrations of long video generation when cache size is at or below the limit parameter.

Video 1

Video 2

Video 3

Video 4

Video 5

Video 6

Long Video Generation (Cache Size > flimit)
|

Demonstrations of long video generation capabilities when cache size exceeds the limit parameter.

Video 1

Video 2

Video 3

Video 4

Video 5

Video 6

Dynamic Scene Cut
|

Demonstrations of dynamic scene transitions and cuts in video generation.

Harry Potter Trailer

Titanic Trailer

Game of Thrones Trailer

The Shawshank Redemption Trailer

Barbie Trailer

Interstellar Trailer

Ablations

Visual demonstrations of ablation studies showing the impact of different components in our method.

Onset Index Ablations (x6)
|

Ablation study on the effect of the f0 parameter. All videos are generated with a KV cache size of 6. The results clearly support the relativistic property of our approach, showing that high-fidelity long-form generation is not bounded by the Temporal RoPE = KV cache size, but on the pretrained model’s natural generation horizon, which is 21 in our case.

Block-Relativistic RoPE [f0=6]

Block-Relativistic RoPE [f0=9]

Block-Relativistic RoPE [f0=12]

Block-Relativistic RoPE [f0=15]

Block-Relativistic RoPE [f0=18]

Block-Relativistic RoPE [f0=21]

KV Cache Size Ablations (x6)
|

Ablation study demonstrating the impact of the KV cache size. All experiments are conducted with f0 = 21. The qualitative observations are consistent with the quantitative results reported in Figure 7 of the main paper. As the KV cache size increases, the dynamic degree decreases and the preservation of subject identity becomes weaker.

Block-Relativistic RoPE [KV Cache=6]

Block-Relativistic RoPE [KV Cache=9]

Block-Relativistic RoPE [KV Cache=12]

Block-Relativistic RoPE [KV Cache=15]

Block-Relativistic RoPE [KV Cache=18]

Block-Relativistic RoPE [KV Cache=21]

Temporal RoPE Jump Index Ablations (x4)
|

Ablation study demonstrating the impact of the temporal RoPE jump index. As the jump index Δ increases, the scene undergoes more drastic changes. When f + Δ exceeds the RoPE horizon flimit, a scene transition effect appears. A detailed discussion of this effect, along with quantitative results, is provided in the Appendix.

Block-Relativistic RoPE [Δ=6]

Block-Relativistic RoPE [Δ=21]

Block-Relativistic RoPE [Δ=45]

Block-Relativistic RoPE [Δ=90]

Qualitative Comparison
|

Comparison of ∞-RoPE with baseline methods across different video durations.

5 Seconds

Action Control Comparison
|

Comparison of different action control strategies and configurations.

Comparison 1