Privileged On-Policy Distillation
Beyond Absolute Imitation: Anchored Residual Guidance
AR-OPD treats privileged information as a target-design problem. Instead of copying a full-view oracle distribution directly, it anchors supervision to a locally reachable partial view and transfers the remaining foresight as a controlled residual.
Method
Full-view oracle signals are useful, but unsafe as absolute targets.
A full privileged teacher can assign high probability to tokens justified by future information that the student cannot access at the current prefix. AR-OPD separates the target into a compatible anchor and a bounded full-minus-partial residual.
Main Results
Anchored residual guidance improves across seven benchmarks.
Best average score across math, code, science, and medical QA benchmarks.
MATH500 and AMC23, exceeding Full OPD by 3.6 and 2.5 points.
Average gain over Partial OPD, showing that controlled full-view residuals add useful guidance.
Target Reliability
Mechanism evidence matches the performance pattern.
Diagnostics show that full-view targets create larger teacher-student disagreement near rollout tails, while AR-OPD reduces shortcut events and is strongest on longer trajectories.
- Support mismatch rises late in long rollouts.
- Full-view reliability degrades with privileged-context length.
- AR-OPD reduces shortcut-like privileged-information leakage.
Code and Data
Public code package with reproducible data layout.
Training and evaluation code is included directly in the AR-OPD repository. The release includes launch scripts, configuration files, evaluation data, data manifests, and reconstruction utilities. Large derived datasets, checkpoints, and report bundles are kept local by design.
Minimal Setup
git clone https://github.com/vanhowe/AR-OPD.git
cd AR-OPD/code
bash scripts/install_env.sh
bash scripts/prepare_data_layout.sh
# After raw training files or chunk bundles are available locally:
bash scripts/materialize_all_derived_datasets.sh
Resources
Paper, code, and citation.
BibTeX
@misc{aropd2026,
title = {Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation},
author = {Wenhao Zhang},
year = {2026},
eprint = {2606.10385},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2606.10385}
}