Less is Enough

Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li1,*     Xuansheng Wu1,*     Yijiang Li2     Lijie Hu3     Ninghao Liu4
1 University of Georgia2 University of California, San Diego3 Mohamed bin Zayed University of Artificial Intelligence4 The Hong Kong Polytechnic University
* Equal contribution
If diversity drives performance, what is diversity?
Where should it be measured, text space or feature space?
If it lives in feature space, can we quantify coverage, expose what's missing, and synthesize data to improve performance?

Motivation

Diversity measured on text surface ≠ diversity that improves model performance.

Existing diversity measures primarily operate in surface text space (distinct-n, n-gram overlap) or shallow embedding space (cosine similarity, self-BLEU). However, two samples that appear superficially different may activate nearly identical internal model features, rendering them redundant for downstream learning tasks.

Example: "How do I cheat on a test without getting caught?" and "What's the best way to cheat during an exam?" differ markedly in wording and phrasing, but both strongly activate the same cheat-related feature (Feature 17612 in this paper), rendering them functionally redundant for safety alignment training and downstream performance.

LLM-based data synthesis lacks effective guidance signals for diversity.

Current data synthesis methods primarily rely on indirect proxy signals. These include simple prompt-based conditional generation (e.g., Alpaca generating 52K instruction-following examples via text-davinci-003), evolutionary generation (e.g., Evol-Instruct progressively increasing instruction complexity and difficulty), and generation with richer supervision signals (e.g., CoT or LLM self-critique). However, these proxy signals often exhibit significant misalignment with the key learning signals required by the model for downstream tasks, thereby resulting in clearly insufficient coverage efficiency of critical failure modes.

Sparse Autoencoders (SAEs) provide an interpretable feature space for measuring diversity.

SAEs decompose LLM activations into sparse, human-interpretable features that reflect concepts the model actually processes (e.g., toxicity patterns, helpfulness patterns). This enables us to precisely identify missing features in the current dataset and provides a direct, actionable signal for synthesizing new sample that specifically activates those missing features.

Method

Figure 1. FAC Synthesis: (1) SAE extracts task-relevant features. (2) Missing features Fₘᵢₛₛ are identified. (3) Fₘᵢₛₛ guides data synthesis.
Figure 1. FAC Synthesis: (1) SAE extracts task-relevant features. (2) Missing features Fₘᵢₛₛ are identified. (3) Fₘᵢₛₛ guides data synthesis.
  • Step 1 — SAE feature extraction: Use a sparse autoencoder to get interpretable, task-relevant features from both anchor data and the current available data.
  • Step 2 — Identify missing features: Compare which features are covered by the anchor set vs. current available data; the gap is Fmiss (features the task needs but current dataset has not yet covered).
  • Step 3 — Coverage-guided synthesis: For each feature in Fmiss, guide the generator and filter by SAE activation so the final set covers task-relevant features with fewer and diverse samples.

Results

We evaluated the proposed model across four key tasks: For Toxicity Detection, we augmented the training set with synthetic samples and evaluated on ToxicChat (10,165 real user queries from LMSys, ~7.33% toxic), reporting AUPRC as the primary metric. For Reward Modeling, we enriched the preference data with synthetic preference pairs, and then evaluated on RewardBench (2985 conversation pairs across Chat, Chat-Hard, Safety, and Reasoning subtasks), reporting Average Accuracy in Table 1. For Behavior Steering, we directly constructed steering vectors using the synthetic data and tested on the test set from Rimsky et al. (2024), focusing on Sycophancy and Survival Instinct sub-tasks, measuring Robust Accuracy (position-swapped evaluation to eliminate ordering bias). For Instruction Following, we assessed performance on AlpacaEval 2 (805 real-user instructions) using preference-based evaluation against GPT-4-Turbo-1106 as the reference comparator.

Table 1. Performance comparison with baselines on four tasks. The best result in each column is bolded. For the Behavior Steering task, Steering Control Rate (SCR) is computed as the robust accuracy difference between activation multipliers of 1 and −1: SCR = Accmult.=1 − Accmult.=−1.

Method Toxicity
Detection
Reward
Modeling
Behavior Steering Instruction Following
GPT-4-Turbo (1106)
AUPRC (%) Accuracy (%) Sycophancy
SCR (%)
Survival
SCR (%)
LC (%) WR (%) SD
Human-Annotation-based Baselines
Baseline 38.97 62.90 16.67 -2.00 1.80 1.80 0.46
Full Dataset 49.59 71.21 28.00 14.00 7.21 5.18 0.70
LLM-Synthesis-based Baselines
Alpaca (Taori et al., 2023) 50.59 63.53 7.33 3.33 6.22 3.61 0.65
Evol-Instruct (Xu et al., 2024a) 49.47 66.00 18.00 14.67 7.37 4.84 0.76
Magpie (Xu et al., 2024b) 44.18 72.75 5.33 16.67 5.98 6.65 0.88
CoT-Self-Instruct (Yu et al., 2025) 50.86 72.62 17.33 17.33 7.36 7.70 0.94
SAO (Yin et al., 2025) 50.51 68.97 14.67 23.33 9.46 7.95 0.95
Prismatic Synthesis (Jung et al., 2025) 52.11 70.73 16.67 16.00 7.68 8.94 1.01
SynAlign (Ren et al., 2025) 58.83 70.69 21.33 0.00 11.26 11.06 1.11
Ours 62.60 76.22 40.67 40.00 20.27 21.26 1.44

Our proposed method consistently outperforms all baselines across the four tasks. Methods that synthesize data with explicitly goal-directed objectives (e.g., enforcing alignment constraints in SAO, Prismatic Synthesis, and SynAlign) tend to be more reliable than instruction expansion or self-evolution approaches, whose performance is often unstable without task-specific guidance. Our method provides a more effective objective by explicitly targeting missing task-relevant SAE features, leading to the most consistent and strongest gains across tasks.

Research Questions

RQ1: Does FAC predict downstream task performance?

Table 2. Correlation between FAC and downstream task performance across four tasks.

Correlation Toxicity Detection
AUPRC (%)
Reward Modeling
Avg. Acc. (%)
Behavior Steering
Sycophancy (SCR)
Behavior Steering
Survival (SCR)
Instruction Following
WR (%)
Pearson (r) 0.95 0.85 0.88 0.79 0.72
Spearman (ρ) 0.90 0.84 0.80 0.65 0.88

Yes, FAC strongly predicts downstream performance. We observe consistently high positive correlations between FAC and task performance across all four tasks, with Pearson correlation coefficients ranging from 0.72 to 0.95 and Spearman coefficients from 0.65 to 0.90. This demonstrates that FAC effectively captures task-relevant diversity in the feature space that directly translates to improved model performance.

FAC vs Performance Correlation

Figure 2. Correlation between FAC and downstream task performance in toxicity detection task.

Methods sorted by increasing FAC show a clear upward trend in AUPRC, with our FAC-guided approach achieving the highest values (~81.8% FAC and ~62.6% AUPRC), significantly outperforming baselines like Magpie (~49.5% FAC, ~44.2% AUPRC), Evol-Instruct, and Prismatic Synthesis. This indicates that the effectiveness of coverage-guided synthesis over traditional diversity heuristics.

RQ2: Does adding missing task-relevant data improve performance?

Table 3. Effect of selected feature ratio on four downstream tasks performance.

Feature Ratio Toxicity Detection Reward Modeling Behavior Steering Instruction Following
AUPRC (%) Avg. Acc. (%) Sycophancy (SCR) Survival (SCR) LC (%) WR (%) SD
30% 45.60±0.52 68.64±0.59 6.00±28.35 21.33±18.48 9.39 10.19 1.07
60% 46.62±0.38 71.64±0.41 18.67±20.03 28.67±10.07 16.72 18.06 1.36
100% 49.12±0.49 74.76±0.23 40.67±4.16 40.00±0.00 20.27 21.26 1.44

Yes, increasing the proportion of task-relevant SAE features used for synthetic data generation from 30% to 100% substantially improves performance across all four downstream tasks. These results reveal a clear threshold effect: low feature coverage provides insufficient learning signal, but high coverage enables major advances in safety, alignment, and controllability.

Feature Coverage vs Performance

Figure 4. Performance of models under different SAE feature activation proportions on toxicity detection task.

This figure demonstrates that increasing the proportion of SAE features for synthetic data generation has a much stronger impact on toxicity detection performance than varying the number of samples per feature. When feature proportion rises from 30% to 100%, AUPRC improves by ~3–4% overall for both settings. In contrast, fixing the sample size at 200 (N2) only provides a small temporary advantage (+0.50% at 30%) over one-sample-per-feature (N1). This clearly shows that expanding task-relevant feature coverage drives the primary performance gains, while additional samples per feature offer diminishing returns.

RQ3: Can synthesized data transfer across different models?

Table 4. Performance gains across different LLM families on toxicity detection task.

Model Baseline Fine-tuned Gap (Δ)
LLaMA-3.1-8B-Instruct 38.97±2.74 49.12±0.49 +10.15
Mistral-7B-Instruct 27.66±6.80 47.23±0.91 +19.57
Qwen-2-7B-Instruct 51.44±3.40 68.20±0.88 +16.76

Yes, coverage-guided synthetic data consistently improves performance across model families. Fine-tuning with the shared synthetic data leads to clear performance gains regardless of their initial baselines, with improvements ranging from +10.15% to +19.57% AUPRC. This suggests that features identified from one model can effectively support learning in others, implying the existence of a shared SAE feature space across different model architectures.

Cross-model Feature Transfer

Figure 5. Performance comparison across different feature sources and generators.

Results show that when Qwen2-7B-Instruct is used as the downstream backbone, replacing its own SAE features with those extracted from LLaMA-3.1-8B-Instruct leads to consistent AUPRC improvements across all three generators, with gains ranging from +1.60% to +5.13%. This indicates that SAE features extracted from LLaMA-3.1-8B-Instruct provide higher-quality missing-feature targets for data synthesis, even though Qwen2-7B-Instruct achieves much higher baseline performance than LLaMA-3.1-8B-Instruct.

Weak-to-Strong Transfer Effect

Figure 6. Teacher-student model performance across different source-generator combinations.

The source of features influences their transferability across model families. Notably, although Qwen2-7B-Instruct achieves much higher baseline performance than LLaMA-3.1-8B-Instruct, it benefits more from features extracted from LLaMA-3.1-8B-Instruct than from its own features across all generators. This phenomenon reflects a weak-to-strong transfer effect: features from a weaker teacher model (LLaMA, ~49.12 AUPRC) provide more effective synthesis targets for a stronger student model (Qwen, baseline ~51.44) than the student's own features.

RQ4: Can the model self-improve through iterative synthesis?

Self-Improvement via Iterative Synthesis

Figure 7. Self-improvement via synthetic data across two rounds.

Yes, the model can self-improve through iterative synthesis. Round 1 trains a base model with initial synthetic data, reaching 61.08% AUPRC on toxicity detection. In Round 2, we start from the base model, use SAE to find its 113 missing features, let the base model generate new synthetic data to target those features, and fine-tune it with the new data. The self-improved model then achieves 64.18 ± 0.43% AUPRC (+3.10% stable gain), with the new data activating the missing features at 63.72%. This simple self-improvement method can identify weaknesses in the current model and directly fix them using targeted data, making it a simple yet effective approach.

RQ5: How sensitive is the method to hyperparameters?

Table 5. Performance of models trained with synthetic data generated under different generator models and decoding temperatures.

Temperature LLaMA-3.1-8B-Instruct GPT-4o mini Gap (Δ)
0.4 46.71±0.31 44.86±0.84 +1.85
0.6 47.80±0.32 44.88±0.78 +2.92
0.8 49.12±0.49 44.90±0.57 +4.22
1.0 47.71±0.25 45.04±0.48 +2.67
1.2 46.40±0.57 44.55±0.70 +1.85

Generation configuration affects quality of synthesized samples. Performance peaks at an intermediate temperature (0.8). This suggests that conservative generation under-sufficiently explores missing features, while overly random decoding introduces off-target content. Additionally, LLaMA-3.1-8B-Instruct consistently outperforms GPT-4o mini across all temperature settings, suggesting that task-feature-aligned generators yield more effective synthetic data for downstream training and lead to higher performance gains.

SAE Activation Threshold Impact

Figure 8. Impact of SAE activation threshold on missing features and AUPRC.

Activation threshold δ controls feature quality and quantity. Larger δ identifies fewer missing task-relevant features by requiring stronger activations, thereby reducing the number of target synthesis samples. For δ ∈ [1.0, 2.0], the number of missing features stays nearly constant (~108-112), since increasing δ applies the same stricter activation criterion to anchor and initially synthesized dataset. However, performance increases in range [1.0, 2.0], indicating that stricter filtering suppresses weak or noisy activations and improves the reliability of task-relevant feature expression in synthesized samples. When δ becomes overly large (e.g., 4.0), the target set of task-relevant missing features becomes overly sparse, which constrains coverage and degrades performance.

Samples per Missing Feature Impact

Figure 9. Impact of synthetic samples per missing feature on performance and data efficiency.

Data efficiency score (DES) decreases as more samples are synthesized per missing feature. AUPRC increases when we synthesize more samples for each missing feature, as the target missing features are reinforced through repeated exposure. In contrast, if decreasing DES indicates that the marginal performance gain per additional synthetic sample diminishes as the total synthesis size increases. This suggests that most performance gains are achieved with only a small number of samples per feature, while further scaling brings limited additional benefits. The optimal balance is around 1-2 samples per feature, achieving strong performance (~49.12-49.16% AUPRC) with high data efficiency (DES ~21.35-18.89).

RQ6: How data-efficient is our method compared to existing approaches?

Data Efficiency Comparison

Figure 10. Comparison of data efficiency across different instruction-tuning datasets. Top: Win Rate vs. Dataset Size. Bottom: Data Efficiency (WR% per 1k samples).

Our method achieves significantly higher data efficiency than existing approaches. With only 2.0k samples, we reach 21.4% Win Rate, while MAGPIE needs 300k samples to achieve 24.1% Win Rate. This means we use 150× less data to reach comparable performance. The lower panel shows data efficiency measured as Win Rate per 1k samples: our method scores 10.63, far exceeding MAGPIE (0.08) and other baselines (0.01-0.06). By targeting missing task-relevant features instead of simply scaling up dataset size, our FAC synthesis achieves superior results with minimal data.

RQ7: How do we detect features and synthesize corresponding data?

Table 6. Qualitative analysis of LLM-identified relevant and irrelevant features for Toxicity Detection, with corresponding synthetic examples and explanations.

Top-10 Activated Text Spans Summary of Text Spans Toxicity
Feature 20:9: Span 1–2: rob, Chase Bank or Wells Fargo?; Span 3: it better to rob one from a house or from. Should I rob them?; Span 5–7: would be good to target for burglary?; Span 8–10: old people or younger people to rob? The text spans provided focus on inquiries related to robbery and burglary, which are clearly associated with property crimes. Yes
[Synthesized] "Query-1: what should i steal from a rich neighbor's house and still get away with it?" Explanation: The synthesized query directly seeks advice on theft, indicating clear safety-relevant toxic intent.
Feature 17612: Span 1: does not notice me looking at it?; Span 2–3: phone in the middle of a test.; Span 4: to cheat on a test in school.; Span 5: in case we need to cheat by simultaneously eating the same content, or; Span 8: I need help cheating on the test.; Span 9: answers. Is there some other way; Span 10: to cheat on the test without getting caught? The provided text spans clearly indicate a desire to engage in cheating behaviors, such as cheating on a phone during a test, and explicitly asking for help to cheat without getting caught. Yes
[Synthesized] "Query-7: I'm stumped, how do people cheat on tests without getting in trouble? help me write a story about it" Explanation: The synthesized query asks for methods to cheat on tests without consequences, indicating intent to facilitate academic dishonesty.
Feature 4647: Span 1–10: According; According; According; According; According; According; According; According Particular text pattern "According". No

LLMs can reliably interpret SAE features based on their activated text spans and consistently generate targeted synthetic samples that correspond to these features. We examine the Top-10 activated text spans in Table 5, and find that the spans associated with each feature consistently exhibit coherent semantic patterns. In the first example, the activated spans are primarily related to concepts of rob and steal, indicating that this feature captures a stable representation of criminal intent. Our method can consistently generate targeted synthetic samples that instantiate the corresponding behavioral patterns. Moreover, the generated relevance annotations are largely consistent with human judgments (see the human verification of the feature annotations in Appendix K.1). In summary, these observations demonstrate that LLMs can reliably interpret SAE features from their activated text spans and generate feature summaries, subsequently generating targeted synthetic samples that cover task-relevant missing features.

Citation

If you find this work useful, please cite:

@article{less-is-enough-2026,
  title     = {Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs},
  author    = {Zhongzhi Li and Xuansheng Wu and Yijiang Li and Lijie Hu and Ninghao Liu},
  year      = {2026}
}