Detecting RLVR Training Data

Abstract

Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective.

We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-kNN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the k smallest nearest-neighbor edit distances.

Min-kNN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-kNN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.

RLVR compresses diverse reasoning paths into shared structural modes.

Analyzing Reasoning Patterns under RLVR

While RLVR improves performance, we systematically quantify how it reshapes the reasoning space. Our analysis on Qwen-2.5-7B (via DAPO and GRPO) reveals a structural divergence between seen and unseen data.

1. Generation Rigidity

We measure generation diversity using Edit Distance (EAD), which quantifies the lexical similarity between generated responses.

As shown in the figure, we observe a consistent, monotonic decline in diversity as RLVR training progresses. This confirms that RLVR does not merely optimize the final answer but systematically narrows the reasoning space, leading to more rigid and homogeneous outputs.

Evolution of generation diversity during RLVR training.

Top: Visualization of repetitive n-grams.
Bottom: Growth of rigid 3-gram categories during RLVR training.

2. Convergence of Symbolic Segments

■ Restatement ■ Logic ■ Boilerplate ■ Other

What drives the structural convergence in RLVR? We classified recurring text fragments into four functional categories:

Restatement: Verbatim or paraphrased repetitions of the problem statement.
Boilerplate: Generic reasoning templates and connective phrases that serve as structural fillers.
Symbolic Logic: Core reasoning steps involving algebraic manipulations, formulas, or standardized mathematical transformations.
Other: Remaining tokens that do not fit the above categories.

Our analysis reveals that Symbolic Logic fragments exhibit the most rapid increase during RLVR training. Unlike generic filler, these logic segments represent the essential reasoning structure that the model increasingly compresses into rigid templates.

This suggests that RLVR creates a distinctive behavioral signature by driving models toward a limited set of standardized reasoning modes.

3. Structural Modes: Seen vs. Unseen

Does RLVR affect seen and unseen data differently? We find a clear behavioral signature in the structural space.

Seen Prompts: Exhibit stronger rigidity, typically collapsing into only 2-4 reasoning clusters.
Unseen Prompts: Maintain significantly higher structural diversity and more reasoning paths.

This quantifiable gap in "clustering tendency" is what allows Min-kNN Distance to reliably identify training data.

Distribution of reasoning structure clusters across prompts.

Method: Min-kNN Distance

Min-kNN Distance is a robust, black-box method for detecting RLVR training data exposure without requiring access to token probabilities or internal model parameters.

Sampling: Generate multiple completions for a given prompt from the model.
Edit Distance: Compute the pairwise normalized edit distance between all generated completions.
Min-kNN: Identify the nearest neighbor for each completion and average the k smallest distances.

Core Insight: RLVR induces a structural collapse on seen prompts, leading to highly repetitive outputs and significantly lower Min-kNN values compared to unseen prompts.

$$NN_{i} = \min_{i \neq j} d(c_i, c_j)$$

$$\text{Min-}k\text{NN} = \frac{1}{k} \sum_{i=1}^{k} NN_{(i)}$$

Where $d$ is the normalized edit distance and $k=10$ by default.

Experimental Results

We evaluated Min-kNN Distance across various reasoning models and RLVR training setups. Our method consistently outperforms all existing membership inference and reinforcement learning contamination detection baselines:

State-of-the-Art Performance: Min-kNN achieves the highest AUC across all evaluated models, with an average score of 0.70.
Significant Gains: Our approach provides a 17% relative improvement over the strongest existing baseline.
Stability Across Scales: The detection signal remains robust for models ranging from 1.5B to 32B parameters and across various RL algorithms like GRPO and DAPO.
Fully Black-Box: Unlike probability-based methods, Min-kNN operates in a sampling-only setting, requiring no access to token log probabilities or internal model states.

BibTeX

@article{zhang2026detecting,
 title={Detecting RLVR Training Data via Structural Convergence of Reasoning},
 author={Zhang, Hongbo and Yue, Yang and Yan, Jianhao and Bao, Guangsheng and Zhang, Yue and Zhang, Yue},
 journal={Preprint},
 year={2026}
}