Hidir Yesiltepe1, Tuna Han Salih Meral1, Adil Kaan Akan2, Kaan Oktay2, Pinar Yanardag1
1Virginia Tech 2fal
In contrast to previous autoregressive methods, which primarily extend temporal horizons via new training procedures, distillation strategies, or large-scale infrastructure, ∞-RoPE explores what already distilled models can achieve by reparameterizing temporal RoPE and KV caching at inference time, and can be applied in a plug-and-play fashion on top of existing Self-Forcing variants to enable effectively infinite-horizon, controllable video generation.
Self-Forcing baseline method
Self-Forcing with Block-Relativistic RoPE
Qualitative results demonstrating the capabilities of ∞-RoPE.
Showcasing ultra-long autoregressive generations that sustain high fidelity over extended durations.
Fine-grained action control demonstrations for individual subjects.
Coordinated multi-character sequences showcasing simultaneous action conditioning, synchronized motion planning, and subject-specific temporal control.
Demonstrations of action control combined with character introduction capabilities.
Demonstrations of action control combined with object introduction capabilities.
Demonstrations of long video generation when cache size is at or below the limit parameter.
Demonstrations of long video generation capabilities when cache size exceeds the limit parameter.
Demonstrations of dynamic scene transitions and cuts in video generation.
Visual demonstrations of ablation studies showing the impact of different components in our method.
Ablation study on the effect of the f0 parameter. All videos are generated with a KV cache size of 6. The results clearly support the relativistic property of our approach, showing that high-fidelity long-form generation is not bounded by the Temporal RoPE = KV cache size, but on the pretrained model’s natural generation horizon, which is 21 in our case.
Ablation study demonstrating the impact of the KV cache size. All experiments are conducted with f0 = 21. The qualitative observations are consistent with the quantitative results reported in Figure 7 of the main paper. As the KV cache size increases, the dynamic degree decreases and the preservation of subject identity becomes weaker.
Ablation study demonstrating the impact of the temporal RoPE jump index. As the jump index Δ increases, the scene undergoes more drastic changes. When f + Δ exceeds the RoPE horizon flimit, a scene transition effect appears. A detailed discussion of this effect, along with quantitative results, is provided in the Appendix.
Comparison of ∞-RoPE with baseline methods across different video durations.
Comparison of different action control strategies and configurations.