Privileged On-Policy Distillation

Beyond Absolute Imitation: Anchored Residual Guidance

AR-OPD treats privileged information as a target-design problem. Instead of copying a full-view oracle distribution directly, it anchors supervision to a locally reachable partial view and transfers the remaining foresight as a controlled residual.

Read Paper Code arXiv

+2.3 points over Full OPD

+12.1 points over base model

21.7% fewer shortcut events

+7.2 points on 768-1024 token rollouts

Method

Full-view oracle signals are useful, but unsafe as absolute targets.

A full privileged teacher can assign high probability to tokens justified by future information that the student cannot access at the current prefix. AR-OPD separates the target into a compatible anchor and a bounded full-minus-partial residual.

AR-OPD architecture diagram — AR-OPD evaluates partial and full teacher views on the same student state, then constructs an anchored residual target before distillation.

Main Results

Anchored residual guidance improves across seven benchmarks.

Average 70.3

Best average score across math, code, science, and medical QA benchmarks.

Math 74.6 / 57.5

MATH500 and AMC23, exceeding Full OPD by 3.6 and 2.5 points.

Residual value +5.1

Average gain over Partial OPD, showing that controlled full-view residuals add useful guidance.

Target Reliability

Mechanism evidence matches the performance pattern.

Diagnostics show that full-view targets create larger teacher-student disagreement near rollout tails, while AR-OPD reduces shortcut events and is strongest on longer trajectories.

Support mismatch rises late in long rollouts.
Full-view reliability degrades with privileged-context length.
AR-OPD reduces shortcut-like privileged-information leakage.

Teacher reliability diagnostic plots — Teacher reliability and support-gap diagnostics.

Training dynamics and long-rollout accuracy plots — Validation accuracy, shortcut counts, and long-rollout accuracy.

Code and Data

Public code package with reproducible data layout.

Training and evaluation code is included directly in the AR-OPD repository. The release includes launch scripts, configuration files, evaluation data, data manifests, and reconstruction utilities. Large derived datasets, checkpoints, and report bundles are kept local by design.

Code package AR-OPD/code Data layout Data README Restore assets prepare_data_layout.sh

Minimal Setup

git clone https://github.com/vanhowe/AR-OPD.git
cd AR-OPD/code
bash scripts/install_env.sh
bash scripts/prepare_data_layout.sh

# After raw training files or chunk bundles are available locally:
bash scripts/materialize_all_derived_datasets.sh

Resources

Paper, code, and citation.

Paper PDF Preprint Code AR-OPD/code arXiv 2606.10385

BibTeX

@misc{aropd2026,
  title = {Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation},
  author = {Wenhao Zhang},
  year = {2026},
  eprint = {2606.10385},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2606.10385}
}