# Methods

## Backbone and adapters

All experiments fine-tune **CogVideoX-5B T2V** (`THUDM/CogVideoX-5b`) with a low-rank adapter; the diffusion transformer (5.57B params) and its T5-XXL text encoder + 3D VAE remain frozen. We keep mixed-precision bf16 throughout (`train.py:206`, `train.py:240`); CogVideoX uses the `CogVideoXDPMScheduler` with `timestep_spacing='trailing'` (`src/generate.py:68-70`). Output latents at the training resolution are `T=13, H=30, W=45` (`logs/train_cogvideox_t2v_openvid_*.log`), produced by the VAE at 4× temporal and 8× spatial stride, with the transformer's spatial patch size 2.

We compare four LoRA variants, all rank=16, α=16 (so the LoRA scale α/r=1), implemented in `src/lora_modules.py`:

1. **Standard LoRA** (`StandardLoRA`, line 86) — vanilla MLP (down/up linears, B-zero init), 16,515,072 trainable params.
2. **Transformer LoRA** (`TransformerLoRA`, line 126) — multi-head temporal self-attention over T pooled tokens (n_heads=2, head_dim=8); 16,644,096 params.
3. **GDN-CSSM LoRA k=1** (`GDNCSSMLoRA`, line 517) — Gated DeltaNet over 2-D spectral domain: down to rank=16, per-head Q/K/V at d_k=4, FFT2 over (H,W), L2-normalised keys, factored α/β decay gates, learned spectral kernel of size 1×1, delta-rule scan over T, iFFT2, RMSNorm, SiLU output gate, up-project. 14,508,816 params.
4. **GDN-CSSM LoRA k=5** — same as above with 5×5 spatial-init spectral kernel. 14,508,816 params.

A 5th and 6th variant injected 2-D spatial RoPE on the internal Q/K before FFT (`pos_embed='rope2d'`, `src/lora_modules.py:35-83`), as a position-encoding ablation.

LoRAs are injected only into the transformer's **self-attention** Q/K/V/O projections — `--target_modules attn1.to_q attn1.to_k attn1.to_v attn1.to_out.0` (`scripts/train_cogvideox_lora_openvid.sh:35`). Cross-attention (`attn2.*`) is untouched, by design: rank-16 LoRAs on cross-attention reproducibly degrade prompt following.

## Training data: OpenVid-1M curated subset

Earlier runs trained on `videophysics/videophy2_train` produced LoRAs that matched base within autograder noise, because that dataset is the autograder training set — 80% of clips are physics failures with `joint=0`. All numbers reported here use a curated **OpenVid-1M** subset (`nkp37/OpenVid-1M`, ICLR 2025) following VideoREPA's recipe (NeurIPS 2025). Filter: `motion_score ≥ 1`, `aesthetic_score ≥ 5`, `6.5 ≤ seconds ≤ 30` (the 6.5 s lower bound matters because CogVideoX trains at 49 frames × 8 fps = 6.125 s — shorter clips would be temporally oversampled by `precompute_cogvideox_latents.py:42`). Retaining only clips actually downloaded leaves **14,857 train + 742 val** clips at `/oscar/scratch/dlinsley/openvid_32k_latents/{train,val}` (5% random val split, seed 0; `scripts/launch_precompute_openvid.sh:39-43`).

## Latent precompute

Training never instantiates T5-XXL or the VAE — both are too memory-heavy alongside the transformer (`train.py:228-233` raises if `--latent_data` is missing for CogVideoX). We pre-encode each clip once with `scripts/precompute_cogvideox_latents.py`: load 49 uniformly-sampled frames, resize to 720×480, scale to [-1,1], VAE-encode (with VAE tiling and slicing enabled, `precompute_cogvideox_latents.py:99-100`), apply the VAE `scaling_factor`, and T5-encode the caption to a [226, 4096] embedding. The `--skip_image_latents` flag drops the first-frame I2V tensor (T2V doesn't use it; saves ~40% disk per file). Each `.pt` holds `latents [13, 16, 60, 90] bf16 + prompt_embeds [226, 4096] bf16`. Total disk: 57 GB latents from 173 GB raw video (4 sharded GPUs in parallel, `scripts/launch_precompute_openvid.sh`, ~5–6 h wall ≈ 22 GPU-h).

## Training loop

`train.py` with the LoRA injection wrapper in `src/wan_lora_wrapper.py`. Fixed for all variants (`scripts/train_cogvideox_lora_openvid.sh`):

| Setting | Value |
| --- | --- |
| Optimiser | AdamW, β=(0.9, 0.999), wd=0.01 (`train.py:354-359`) |
| LR schedule | Linear warmup 200 steps → cosine to 0 (`train.py:150-160`) |
| Peak LR | 1e-4 |
| Batch size | 1 (single GPU, no grad accumulation) |
| Max steps | 5000 |
| Loss | Flow-matching, `target = noise − latents` (`train.py:734-735`) |
| Eval cadence | every 250 steps, 10 batches averaged |
| Early stopping | patience=5, min_delta=1e-4, on val loss (`train.py:803-815`) |
| Mixed precision | bf16 (no GradScaler in bf16 path) |
| Gradient checkpointing | enabled for CogVideoX (`train.py:327-330`) |
| Grad clip | max-norm 1.0 |
| Seed | 42 |

Per-variant trainable parameter counts and step times (measured from `tqdm` lines in `logs/train_cogvideox_t2v_openvid_*.log`):

| Variant | Trainable params | Step time | Stop step | Best val loss |
| --- | --- | --- | --- | --- |
| `standard_r16` | 16,515,072 | 3.09 s | 2500 | 0.2102 |
| `transformer_r16` | 16,644,096 | 3.26 s | 2500 | 0.2042 |
| `gdn_cssm_r16_k1` | 14,508,816 | 7.41 s | 4250 | **0.1761** |
| `gdn_cssm_r16_k5` | 14,508,816 | 7.40 s | 2500 | 0.1966 |
| `gdn_cssm_r16_k1_rope` | 14,508,816 | ~7.5 s | 2500 | 0.2169 |
| `gdn_cssm_r16_k5_rope` | 14,508,816 | ~7.4 s | 2500 | 0.2071 |

The four core variants were launched concurrently, one per GPU (`scripts/launch_train_openvid_quad.sh`, 30 s stagger to avoid simultaneous model-load pressure). GDN-CSSM is ~2.4× slower per step than Standard/Transformer because of two FFT2 + iFFT2 calls and a sequential delta-rule scan over T=13.

## Inference

For evaluation we generate 591 videos per variant, one per unique caption that VideoPhy-2 used to score Wan2.2-14B (`scripts/videophy2_generate.py:25-61` joins `videophysics/videophy2_test` to `videophysics/videophy2_upsampled_prompts`). We follow the paper protocol: **generate with the Mistral-NeMo-12B upsampled long caption, score with the raw short caption** (`videophy2_generate.py:252-258`). Inference settings (`scripts/launch_generate_openvid.sh`):

```
49 frames × 720×480 × 8 fps × 50 DPM steps × CFG 6.0 × seed 42
```

These are CogVideoX-5B's native release settings; our base Joint = 24.0% matches the paper's HUMAN-rated CogVideoX-5B Joint of 25.0% within 1pp, so the generation pipeline is publication-grade. Each variant takes ~22 h wall on one GPU for the 591 clips (start→finish timestamps in `logs/gen_openvid_gdn_cssm_r16_k1.log`: `2026-04-27 11:00:30` → `2026-04-28 09:18:38` = 22h18m, ~135 s/clip including pipeline load).

## Autograder

We use the **official VideoPhy-2 mPLUG-Owl autograder** (`videophysics/videophy_2_auto`) via the upstream `~/VideoPhy/VIDEOPHY2/inference.py`, run inside a separate `worldscore` conda env pinned to `transformers==4.49`. Two earlier scoring iterations were broken: a state-leak in `scripts/score_videophy2.py`'s batch loop and silent mPLUG-Owl corruption when run under transformers 5.x with compatibility shims. Every number reported uses the upstream `inference.py` directly with `--task sa` and `--task pc`, deterministic decoding (`do_sample=False, top_k=1, temperature=0.001`), 32 frames per clip. Joint = `(SA ≥ 4) AND (PC ≥ 4)` per clip. Significance is tested with paired McNemar (Joint binary) and paired bootstrap CI (5000 resamples) over the 591 prompts (matched by `videopath`).

## Hardware and budget

All training, generation, and scoring run on **NVIDIA B200** GPUs (180 GB HBM3e) provisioned in groups of 1–2 per job. Memory is comfortable: bf16 CogVideoX-5B transformer is ~11 GB, the frozen VAE/T5 are off-device during training (latents pre-encoded), per-LoRA optimiser state is <60 MB, and gradient checkpointing keeps activations bounded — peak training VRAM stays well under 80 GB even on the GDN-CSSM variants. Inference uses the same single-GPU `CogVideoXPipeline` and never gradient-tracks, so headroom is similar.

GPU-hour budget for this paper's full set of OpenVid experiments (6 LoRA variants + 1 base for inference/scoring):

| Phase | Wall | Concurrency | GPU-hours |
| --- | --- | --- | --- |
| OpenVid filter + .zip download | ~10 h | network-bound, 0 GPU | 0 |
| Latent precompute (15.6K clips) | ~5.5 h | 4 GPU | ~22 |
| LoRA training (6 variants, early-stopped) | ~9 h slowest | 4 GPU concurrent + 2 sequential | ~31 (k1=8.8 + k5=5.1 + k1_rope=5.2 + k5_rope=5.1 + std=2.1 + xfm=2.3) |
| Video generation (7 × 591 clips) | ~22 h per variant | up to 4 GPU sharded | ~154 |
| Autograding (7 × {SA,PC}) | ~50 min | 4 GPU in 3 waves | ~3 |
| **Total (this run)** | **~3 days end-to-end** | — | **~210 GPU-hours** |

Counting one earlier failed iteration on the wrong (`videophysics/videophy2_train`) data and two re-scoring rounds before settling on the upstream autograder, the full project consumed ≈ **1.5–2× this budget** (~350–420 B200-hours). All checkpoints live at `checkpoints_cogvideox_t2v_openvid/{variant}/lora_best.pt`, generated videos at `/oscar/scratch/dlinsley/videophy2_t2v_out_openvid/{variant}/videos/`, and per-variant `scores_{sa,pc}.csv` are merged into `eval_compare/openvid_videophy2_results.json` for plotting.

## Reproducibility

To re-run end-to-end:
1. `bash scripts/launch_precompute_openvid.sh` (4 GPUs, ~6 h)
2. `bash scripts/launch_train_openvid_quad.sh` (4 GPUs, up to 9 h)
3. `for V in standard_r16 transformer_r16 gdn_cssm_r16_k1 gdn_cssm_r16_k5 base; do CUDA_VISIBLE_DEVICES=0 bash scripts/launch_generate_openvid.sh $V; done` (or shard over 4 GPUs)
4. Score each variant with `~/VideoPhy/VIDEOPHY2/inference.py --task {sa,pc}` in the `worldscore` env.