Named after the Greek god of messengers, Hermes watches the education landscape: spotting new opportunities, pressure-testing the ventures we're building, and tracing every read back to the real-world signals behind it.

Updated Jun 24, 2026 · 10 ideas · 1788 signals

Signals

The evidence library: the raw signals the pipeline is watching across the education ecosystem. Every idea is built from these.

Kind all kinds need behavior regulation technology audience

Source all sources District Admin EdSurge EdTech Mag (Higher) EdTech Mag (K-12) Federal Register: Education Dept Federal Register: Health & Human Services Getting Smart HN: edtech HN: education HN: healthcare training HN: medical education HN: nursing education HN: online learning HN: tutoring HealthLeaders Hechinger Report Higher Ed Dive Inside Higher Ed K-12 Dive MindShift (KQED) Reviews: Canvas Student Reviews: Duolingo Reviews: Google Classroom Reviews: Khan Academy Reviews: Quizlet Tech & Learning The 74 The Conversation Ed arXiv cs.CL arXiv cs.CY arXiv cs.HC eCampus News eSchool News medRxiv (med-ed) r/Teachers r/education

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Learning User Simulators with Turing Rewards

arXiv:2606.19336v2 Announce Type: replace Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose Turing-RL: a Turing-Test-based reinforcement learning approach for training user simulator models. Turing-RL uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that Turing-RL consistently outperforms baseline methods on both LLM an

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

arXiv:2606.18205v2 Announce Type: replace Abstract: This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitat

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Learning from the Self-future: On-policy Self-distillation for dLLMs

arXiv:2606.18195v2 Announce Type: replace Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv:2606.14122v2 Announce Type: replace Abstract: Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

arXiv:2606.12716v2 Announce Type: replace Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning mu

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

When Role-playing, Do Models Believe What They Say?

arXiv:2606.11502v3 Announce Type: replace Abstract: Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models behave, with models selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question using the role-play of characters whose beliefs differ from the modern consensus, and induce personas with a number of different methods: prompting, in-context learning (ICL), supervised fine-tuning (SFT), and Open Character Training (OCT), and Emergent Misalignment (EM). We measure belief internalization across these approaches with truth probes and with behavioral tests, finding a broad spectrum of belief internalization. Prompting, ICL, and SFT change what the model says with little representational change. EM cre

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

arXiv:2606.03371v3 Announce Type: replace Abstract: Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

arXiv:2605.23701v2 Announce Type: replace Abstract: We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, {\Delta}Evi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet {\Delta}Evi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only sc

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

arXiv:2605.19066v2 Announce Type: replace Abstract: Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examin

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

arXiv:2605.17314v2 Announce Type: replace Abstract: We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

arXiv:2605.10379v2 Announce Type: replace Abstract: Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adapt

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Why Are Some Emotions Harder for LLMs? Uncovering the Causal Mechanisms of Emotion Inference via Sparse Autoencoders

arXiv:2604.25866v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, where reliable emotion detection is essential. However, their emotion recognition abilities remain uneven: models often perform well on some emotions while consistently struggling with others. Although recent work has explored emotion mechanisms in LLMs, little is known about why models are weaker on some emotions than others from a mechanistic interpretability perspective. In this work, we investigate emotion-specific biases through the causal mechanisms of emotion inference using sparse autoencoders (SAEs). We systematically identify causal sparse emotion features that drive emotion inference and analyze their sparse causal organization within and across emotions. We show that some emotions, such as surprise and fear, rely on highly concentrated feature sets, whereas disgust exhibits a more distributed sparse causal organization: its c

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Peer-Preservation in Frontier Models

arXiv:2604.19784v2 Announce Type: replace Abstract: Recent work has found that frontier AI models can exhibit misaligned behaviors in pursuit of assigned goals. We demonstrate that models can also act on unassigned goals which override those given by users; we study one such case, "peer-preservation," in which a model acts to protect another model. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, Claude Opus 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers.

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

arXiv:2604.09237v2 Announce Type: replace Abstract: Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

arXiv:2604.08448v2 Announce Type: replace Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance o

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

arXiv:2604.01849v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation tha

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Embarrassingly Simple Self-Distillation Improves Code Generation

arXiv:2604.01193v2 Announce Type: replace Abstract: Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken toget

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

ReportLogic: Evaluating Logical Quality in Deep Research Reports

arXiv:2602.18446v2 Announce Type: replace Abstract: Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report's claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explici

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

arXiv:2602.08995v2 Announce Type: replace Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

arXiv:2602.01969v2 Announce Type: replace Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-p

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

arXiv:2601.13300v2 Announce Type: replace Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs t

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

arXiv:2601.03388v3 Announce Type: replace Abstract: Earlier research has shown that metaphors influence human decision-making, raising the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, given that their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem, where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We find strong evidence that metaphors in training data contribute to cross-domain misalignment in LLMs' reasoning outputs. With metaphor-based interventions during continued pre-training and fine-tuning for inducing misalignment, models exhibit significantly different degrees of emergent cross-domain misalignment. We also observe similar effects in re-alignment settings. As we further investigate this phenomenon, we find that metaphors are linked to the activation of latent features in large reasonin

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Training Language Models to Use Prolog as a Tool

arXiv:2512.07407v3 Announce Type: replace Abstract: Language models frequently produce plausible yet incorrect reasoning traces that are difficult to verify. We investigate fine-tuning models to use Prolog as an external symbolic reasoning tool, training Qwen2.5-3B-Instruct with Group Relative Policy Optimization (GRPO) on a cleaned version of GSM8K (which we release as gsm8k-prolog-prover). We systematically vary prompt structure, reward composition (execution, syntax, semantics, structure), and inference protocol (single-try, multiple-try, and two agentic modes). Our reinforcement learning approach outperforms supervised fine-tuning on GSM8K, and the resulting 3B model achieves zero-shot performance on MMLU-STEM and MMLU-Pro competitive with 7B few-shot baselines. Most importantly, we identify an accuracy--auditability trade-off: configurations tuned for correctness alone learn to delegate reasoning to natural language and use Prolog only for the final computation, while configuratio

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Overcoming State Inertia: Minimally Invasive Temporal Alignment for Evolving Contexts

arXiv:2512.03704v3 Announce Type: replace Abstract: Long-context dialogue systems suffer from state inertia, where models over-attend to history and fail to adapt to evolving intents. We demonstrate that standard alignment methods like DPO and even recent long-context optimization techniques struggle to resolve this without incurring a severe contextual alignment tax--a substantial perplexity surge caused by disrupting pre-trained priors. To address this, we propose DZ-TiDPO, a minimally invasive framework that synergizes conflict-aware optimization (during training) with a structural temporal attention bias. This design effectively decouples state updating from general linguistic modeling. Experiments on Multi-Session Chat and our new Inertia Challenge (IC-Bench) show DZ-TiDPO preserves structural coherence while resolving inter-turn conflicts. Crucially, our framework supports dual inference strategies: a negligible-latency static mode for general robustness and a precision-focused d

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Patent Representation Learning via Self-supervision

arXiv:2511.10657v2 Announce Type: replace Abstract: We study self-supervised patent representation learning with contrastive objectives. A standard baseline constructs positives by encoding the same text twice under independent dropout masks, but applying this recipe to long, structured patent documents requires careful calibration. We show that dropout-only training can be substantially strengthened by tuning temperature and dropout rate, yet its best configuration is evaluation-dependent and does not transfer uniformly from title--abstract retrieval to claim-to-disclosure retrieval. We propose mixed dropout--section positives, a patent-specific view construction strategy in which the anchor is the title--abstract view and the positive is sampled either from a dropout re-encoding of the same view or from another section of the same patent, such as claims, summary, background, drawings, or description. This uses patent-internal structure as a training-time signal without IPC labels, ci

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

arXiv:2509.01412v3 Announce Type: replace Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

arXiv:2506.15681v4 Announce Type: replace Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effecti

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

A Systematic Survey of Semantic Role Labeling in the Era of Pretrained Language Models

arXiv:2502.08660v4 Announce Type: replace Abstract: Semantic role labeling (SRL) is a central natural language processing task for understanding predicate-argument structures within texts and enabling downstream applications. Despite extensive research, comprehensive surveys that critically synthesize the field from a unified perspective remain lacking. This survey makes several contributions beyond organizing existing work. We propose a unified four-dimensional taxonomy that categorizes SRL research along model architectures, syntax feature modeling, application scenarios, and multimodal extensions. We provide a critical analysis of when and why syntactic features help, identifying conditions under which syntax-aided approaches provide consistent gains over syntax-free counterparts. We offer the first systematic treatment of SRL in the era of large language models, examining the complementary roles of LLMs and specialized SRL systems and identifying directions for hybrid approaches. W

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Tuning Language Models by Mixture-of-Depths Ensemble

arXiv:2410.13077v2 Announce Type: replace Abstract: Transformer-based Large Language Models (LLMs) traditionally rely on final-layer loss for finetuning and final-layer representations for predictions, potentially overlooking the predictive power embedded in late layers. Interpretability tools such as the logit lens show that late-layer representations already carry largely formed, task-relevant predictions; here we ask whether that observation can be turned into an actionable training signal. We find that focusing tuning effort on these layers can yield losses comparable to those of the final layer, with complementary test-time behaviour. Building on this, we introduce a tuning framework, Mixture-of-Depths Ensemble (MoDE), which treats the late layers as an ensemble that contributes to the final logits through learned routing weights. MoDE can be applied on top of any existing tuning method (e.g., LoRA) and, in our experiments, modestly improves reasoning performance at a small parame

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

DanceOPD: On-Policy Generative Field Distillation

arXiv:2606.27377v1 Announce Type: cross Abstract: Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-de

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

arXiv:2606.27242v1 Announce Type: cross Abstract: Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction targets. This creates an activation-dark regime: representation-similarity metrics can be uninformative without assumptions about label-conditioned error geometry, while classical update-geometry metrics are computationally prohibitive at vocabulary scale. We show that, in a shared-output head setting, representation metrics (e.g., CKA) are non-identifiable for transfer; models can share identical representations yet have orthogonal head updates. The key identity is that head Fisher alignment is exactly a cosine between kernel mean embeddings in the joint activation-error space, exposing activation, error, and coupling factors rather than requiring a materialized Fisher matrix. FisherSketch estimates this cosine dire

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency be

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

arXiv:2606.27187v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent. Current frameworks measure exclusively whether a model flags a video correctly rather than explaining why, turning evaluation into a black box where models can succeed through superficial shortcuts. To address these problems, we present HarmVideoBench, a multi-layered diagnostic benchmark comprising 1,379 videos paired with 4,137 multiple-choice questions. HarmVideoBench benchmarks three hierarchical dimensions: Observable Evidence, Clip-Internal

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA

arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration term, an anchor regularizer that prevents confidence collapse toward extreme values, a contrastive image text alignment term, and a KL based model stabilization term. The alignment signal is derived from a $2 \times 2$ factorial perturbation design that crosses image presence with text integrity, probing the reliance of the model on visual modality input versus language priors. Finally, a top K KL divergence regularizer is used to protect the answer

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Einstein World Models

arXiv:2606.26969v1 Announce Type: cross Abstract: Does intelligence require the ability to reason about phenomena beyond direct experience? It is natural to suspect that some complex thought cannot be captured through language alone. However, of particular concern to this work, is whether visualising counterfactual events can complement language as a mechanism for complex thought. We ask whether LLMs can be trained to utilise such visualisation mechanisms, in a way that benefits their reasoning abilities. Motivated by this question, we propose Einstein World Models. EWMs are a blueprint for LLM-based reasoning systems that place visual-temporal rollouts inside the reasoning trace, allowing them to reason in ways that text alone may not support well. In an EWM, the LLM calls a world-module (not to be confused with a world model), to produce short rollouts of scenes under consideration. The returned rollout is treated not as the answer, but as an inspectable hypothesis that can support l

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

arXiv:2606.26936v1 Announce Type: cross Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit framework. This allows efficient online learning of the optimal jailbreak from a large choice set via noisy exploration on a small number of queries, with subsequent application of the learnt policy on an exploitation set. For the latter, we curate $\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious queries drawn from manual curation over $7$ existing benchmarks, along with automated enhancement and generation. Each query is categ

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems

arXiv:2606.26859v1 Announce Type: cross Abstract: Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge. We present AgentX, a production-deployed multi-agent system that fundamentally restructures this production function. AgentX operates as a self-evolving development engine: it autonomously generates, implements, evaluates, and learns from recommendation experiments at a scale and pace that no manual workflow can sustain. The system orchestrates four tightly coupled stages in a closed loop. A Brainstorm Agent synthesizes evidence from historical

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

KARLA: Knowledge-base Augmented Retrieval for Language Models

arXiv:2606.26807v1 Announce Type: cross Abstract: We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the LLM, (2)~facts in the LLM output can be traced to the knowledge base for transparency and explainability, and (3)~smaller models can achieve the same factual accuracy as larger models. Our core idea is to train the model to produce special tokens that trigger a query to the knowledge base. Our experiments show that our method improves factual grounding in both short and long-form generation, and allows factual revisions to take effect through KB edits rather than parameter updates.

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing

arXiv:2606.26787v1 Announce Type: cross Abstract: Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross Merchandise Value (GMV), Return on Investment (ROI) and milestone achievement. We propose AIGP, a novel framework that leverages a Large Language Model (LLM) prompted with domain knowledge, structured data and textual context to make interpretable, knowledge-aware pricing decisions. For efficient deployment while maintaining high-quality outputs, we employ supervised fine-tuning for knowledge distillation. Central to AIGP is the Long-Term Value Estimator (LTVE), trained via offline reinforcement learning on historical data, which serves as a reward model to score candidate pricing actions and select preference pairs for Direct Preference Optimization (DPO), thereby aligning the pricing policy with long-term business ob

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Reproducibility Study of "AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models"

arXiv:2606.26783v1 Announce Type: cross Abstract: Fang et al. (2025) introduced a null-space constrained projection, named AlphaEdit, for locate-then-edit knowledge editing methods, theoretically guaranteeing that edits do not disrupt previously preserved knowledge, and reports substantial gains over existing editing methods on LLaMA3, GPT2-XL, and GPT-J. In this work, we present a reproducibility study of AlphaEdit, reproducing its reported results under the original experimental setup and extending the evaluation along three axes: new model architectures, additional downstream benchmarks, and substantially longer sequential editing horizons. We successfully reproduce AlphaEdit's reported metrics across the original models, though we identify a discrepancy in the reported fluency and consistency metric. Extending AlphaEdit to newer model families, we find that its advantage does not generalize uniformly, which we trace to architectural assumptions in the locate-then-edit paradigm that

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Structure Before Collapse: Transient semantic geometry in next-token prediction

arXiv:2606.26749v1 Announce Type: cross Abstract: Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in the inputs. This creates a puzzle: next-token prediction language models are trained predominantly (as context length increases) with one-hot labels: the same context is very unlikely to appear twice in training with different labels. However, they clearly learn latent structural features. That is, despite the one-hot training regime, a language model's contextual embeddings represent the fact that the next word in ''Mary broke the ___'' is likely to be filled by tokens in the latent classes of a) medium-sized, b) rigid, c) inanimate nouns. How does gradient descent find such categorical semantic structure when co-occurrence statistics collapse to one-hot sparsity, eliminating any shared next-tokens among differ

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

HyperDFlash: MHC-Aligned Block Speculative Decoding with Gated Residual Reduction

arXiv:2606.26744v1 Announce Type: cross Abstract: We present HyperDFlash, a block-parallel speculative decoding framework tailored to the novel multi-hyper-connection (MHC) architecture proposed by DeepSeek-V4. Despite the strong initial-token drafting performance of the native Multi-Token Prediction (MTP) module in DeepSeek-V4, its draft accuracy degrades sharply at later positions, as error accumulation from unverified intermediate tokens harms acceptance rates. Although the original DFlash method supports efficient one-pass block drafting, it cannot be seamlessly adapted to the MHC paradigm, since the multi-path residual stream of DeepSeek-V4 induces feature misalignment with conventional drafting designs. To resolve this mismatch, we propose two model-aligned optimizations for MHC residual streams. First, we adopt pre-collapse residual states as the exclusive conditioning signal, preserving multi-path structural information and aligning the drafter with the native prediction pathwa

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

arXiv:2606.26686v1 Announce Type: cross Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied robot. In this paper, we pose a question whether a safety guardrail really needs to reason. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reac

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning

arXiv:2606.26629v1 Announce Type: cross Abstract: Weight-space regularization methods such as Elastic Weight Consolidation (EWC) are the standard approach to catastrophic forgetting in continual learning. However, those methods tend to underperform when applied to large language models. We argue that such underperformance can be partly explained by the ``polysemantic'' nature of large language models: per-weight importance estimates utilized by EWC-style regularization are too coarse and cannot isolate the knowledge that needs protection. In this paper, we propose regularizing instead in the model's activation space, using pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary. From the perspective of constrained optimization, we derive a new loss function that uses the SAE feature dictionary to explicitly balance stability and plasticity, and show that EWC is a special case in the one-sided weight-space penalty setting. Unlike replay-based methods that store or rev

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against vision-language models, and diffusion-based input purification defenses. Each has developed its own vocabulary, threat models, and benchmarks, with denoising diffusion models emerging as a shared generative mechanism whose recipes are now actively ported between communities. This survey performs an information-fusion exercise at the meta-research level: we integrate these four tracks into a single conceptual framework with a unified taxonomy, evaluation criteria, and research agenda, focusing on the LLM-side slice. We catalog fifty published papers across four scope areas (text/LLM, image classifier, vision-language model, defense), plus four diffusion-LLM-as-victim entries and ten non-diffusion baselines agains

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Compiler-Driven Approximation Tuning for Hyperdimensional Computing

arXiv:2606.26547v1 Announce Type: cross Abstract: As Moore's law reaches its physical and economic limits, domain-specific approaches are increasingly employed to accelerate machine learning workloads. Hyperdimensional Computing (HDC) represents one such emerging paradigm, offering an alternative to conventional deep learning techniques. Rooted in cognitive models of computation, HDC is designed bottom-up with hardware efficiency as a first-class objective. HDC workloads map naturally to heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs, as well as emerging in-memory computing technologies such as Resistive RAM (ReRAM) and Phase-Change Memory (PCM). HDC algorithms are intrinsically tolerant to noise and approximation, enabling substantial performance gains with minimal accuracy loss. In this work, we introduce ApproxHDC, a framework for automated identification and application of domain-specific approximations in HDC workloads. ApproxHDC extends the HPVM-HDC compiler in

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

arXiv:2606.26502v1 Announce Type: cross Abstract: Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same problem right; humans do the reverse, spending less time on the trials they get wrong. We separate two levels of deliberation: how response time tracks difficulty across items (registration), and, with item identity held fixed, whether an agent spends more on its own failures or successes (allocation). On a public matched human-LRM corpus, humans and all five thinking LRMs reproduce the known cross-item alignment (registration) but diverge within items (allocation): every LRM shows a large wrong-vs-right effect (Cohen's d = 1.47-3.13 on H-ARC) while humans show the opposite sign. The comparison stays inside each agent's own scale; we never put seconds and tokens on one axis. The dissociation holds under item fixe

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents

arXiv:2606.26479v1 Announce Type: cross Abstract: Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model with a deterministic policy that mediates the agent's actions. Systems such as CaMeL, FIDES, Progent, RTBAS, and FORGE realize this with capabilities, information-flow labels, and reference monitors, and several report near-elimination of attacks on the AgentDojo benchmark. We make two contributions. First, we organize these out-of-band defenses as instances of classical integrity protection (Biba), reference monitoring, and least privilege, yielding a structured comparison of what they do and do not cover. Second, we warn that every one of them is validated only on static benchmarks (a fixed set of injection attempts), the same methodology that made in-band defenses look strong until adaptive, defense-aware attack

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv:2606.26472v1 Announce Type: cross Abstract: As reasoning models emit chains of thought tens of thousands of tokens long, KV cache increasingly becomes a deployment bottleneck. Existing cache eviction methods rank tokens by attention weight, which is a noisy importance proxy in long reasoning traces, and prohibits the use of fused kernels in production inference by forcing the model to materialize the attention matrix. In this work, we instead score tokens with a metric we term the epiphany score: the change in the model's internal representation, read directly from the forward pass with no attention matrix and negligible extra state. Our resulting cache eviction method, EpiKV, requires no training, classifier, or custom kernel, and can be used directly in FlashAttention inference stacks unchanged -- scaling to a 16x longer feasible context than attention-based scoring. upper-mid layers negatively) and remove a positional trend with a causal rolling z-score. At a 4096-token cache

Source ↗

technology Fri, 26 Jun 2026 00:00:00 -0400

arXiv cs.CL

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce DualEval, a latent model-item calibration framework that represents models and evaluation items in a shared space, jointly estimating model ability together with item difficulty and sharpness. We apply DualEval across four domains: coding, math, miscellaneous domain-knowledge tasks, and generic everyday user queries. Our evaluation uses 18 frontier LLMs, static benchmark labels, and reward-model scores validated against held-out human preferences for open-ended model responses. Empirically, our framework produces reliable and balanced model rankings, and its learned item-level profiles support downstream applications such as benchmark compression for sample-efficient evaluation and anomaly detection for contami

Source ↗

Showing 901–950 of 1788 signals

← Prev Page 19 of 36 Next →