EdTech Discovery
Hermes

An instrument for spotting the next edtech opportunity — generated ideas, each traced to the real-world signals behind it.

Updated Jun 24, 2026 · 10 ideas · 1304 signals

Signals

The evidence library — the raw signals the pipeline is watching across the education ecosystem. Every idea is built from these.

technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

A Hybrid, Multi-Layered Pipeline for Phishing and Threat Classification: Independently Validated URL and NLP Engines with a Calibrated Multi-Channel Fusion Stage

arXiv:2606.21690v2 Announce Type: replace-cross Abstract: Phishing is a multi-modal threat. We present a hybrid pipeline that scores each modality with its own engine and fuses the results. Three engines are built, deployed, and independently benchmarked: a four-stage URL stack (Domain Guard, lexical model, threat intelligence, and an asymmetric L2 fusion sidecar); a generalization-hardened DistilBERT NLP classifier whose held-out real-phishing recall rises from 0.8% to 87.3%; and a threat-intelligence synchronizer with end-to-end OpenTelemetry instrumentation confirming 1:1 message conservation. A decision-level fusion stage, characterized on a 10,677-email whole-system benchmark, reaches F1=0.914 with a calibrated probabilistic-OR over URL, header, and phishing-probability channels while reducing held-out real-spam false positives to 3.6%. Because that benchmark uses proxy URL and header channels and an operating point still needing recalibration, we present it as a preliminary integ

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Sexualised synthetic personas encode and amplify gendered power asymmetries through voice

arXiv:2606.21366v2 Announce Type: replace-cross Abstract: This work examines sexualised AI-generated English-speaking voices offered by a popular commercial platform. New technologies may enable sexual empowerment and greater diversity in gender expression, yet toxic masculinity, heteronormativity, and the abuse of women and LGBTQ+ people remain pervasive online. Drawing on a Feminist HCI perspective, we examine how commercial voice AI systems reproduce and circulate particular performances of gender. We conducted a listening experiment with a diverse group of listeners, combining quantitative adjective selection, qualitative free-text responses, and acoustic analysis. Participants evaluated male- and female-coded voices presented with either sexualised scripts or neutral text. Results reveal a narrow range of gender expression, largely binary and heteronormative. Female-coded voices are more frequently described using sexualised and submissive terms, while male-coded voices are more o

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

arXiv:2606.18941v3 Announce Type: replace-cross Abstract: PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD progra

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

arXiv:2606.17698v2 Announce Type: replace-cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every a

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Leveraging Social Media Data for COVID-19 Studies

arXiv:2606.10459v2 Announce Type: replace-cross Abstract: Nowadays, social media networks have become widely preferred sources of information. Especially during the time of the Coronavirus disease 2019 COVID 19 pandemic, social media has been one of the most used platforms to get the latest news and information related to COVID 19. Social media are popular because they offer free access to their registered users and allow them to do posting, disseminate information, and respond to others postings. With almost 4.6 billion social media users worldwide, it is not surprising the significant amount of information shared through these platforms could affect how people perceive and cope with the pandemic that we are facing right now. With decent use, social media can be a beneficial digital tool to spread reliable news and public awareness for patients, clinicians, and society. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. Thu

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking

arXiv:2605.00348v2 Announce Type: replace-cross Abstract: Recent multi-bit watermarking methods for large language models (LLMs) prioritize capacity over reliability, often conflating decoding with detection. Our analysis reveals that existing ECC-based extractors suffer from catastrophic false positive rates (FPR), and applying rejection thresholds merely collapses detection sensitivity (TPR) to random guessing. To resolve this structural limitation, we propose BREW (Block-wise Reliable Embedding for Watermarking), a framework shifting the paradigm to designated verification. BREW employs a two-stage mechanism: (i) blind message estimation via independent block voting, followed by (ii) window-shifting verification that rigorously validates the payload against local edits. Experiments demonstrate that BREW achieves a TPR of 0.965 with an FPR of 0.02 under 10% synonym substitution, demonstrating that the high-FPR issue is not an inherent trade-off of multi-bit watermarking, but a solvab

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

arXiv:2603.04840v2 Announce Type: replace-cross Abstract: Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances. The sourc

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

The $\mathbf{P}$-Completeness of Inverted Index Traversal: On the Complexity of Evaluating Boolean Query DAGs

arXiv:2601.18747v2 Announce Type: replace-cross Abstract: Modern AI agents increasingly rely on search infrastructure to execute complex, neuro-symbolic reasoning workflows. These workflows often compile into deeply nested, non-monotonic Boolean queries over text fields. However, standard query evaluation strategies over inverted indices face severe theoretical limits when handling these structures. Stateful iterator models (Document-at-a-Time) are structurally bounded by $\text{NC}^1$ formula evaluation, suffering a worst-case $O(2^{|Q|})$ exponential blowup in query complexity when unrolling re-convergent logic. Conversely, recursive materialization models (Term-at-a-Time) incur an $\Omega(|U|)$ space complexity penalty (the Universal Scan) when evaluating logical negation over the document universe. In this paper, we establish the theoretical boundaries of executing complex logic natively over an inverted index. We formalize a retrieval language ($\mathcal{L}_R$) based on Directed A

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Bilevel Data Curation for LLM Fine-tuning: Offline Selection and Online Self-Refining Generation

arXiv:2511.21056v2 Announce Type: replace-cross Abstract: Supervised fine-tuning (SFT) datasets are critical to the downstream performance of large language models, yet they often contain low-quality or harmful question-response pairs. To improve SFT data quality, we develop a unified bilevel framework that combines offline data selection with the online self-refining generation. In the offline setting, bilevel data selection (BDS) selects question-response pairs from the offline SFT dataset to maximize the validation performance. We theoretically show that the optimal model given by BDS outperforms direct data mixing approach in useful data coverage. Moreover, we provide a global convergence analysis for gradient-based BDS approach for one-layer Transformer, showing that the epsilon-global optimum of offline BDS is achievable in finite time. Although efficient, offline BDS discards potentially harmful questions together with responses, thereby reducing question diversity. We address t

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Removing Noise, not Finding Gold: Quality Filtering for Large-Scale Pretraining

arXiv:2510.00866v3 Announce Type: replace-cross Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF ca

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection

arXiv:2508.18684v2 Announce Type: replace-cross Abstract: Signature-based Intrusion Detection Systems (IDS) detect malicious activity by matching network or host events against predefined rules. Security analysts manually develop these rules from Cyber Threat Intelligence (CTI). As threats evolve, this manual pipeline faces two bottlenecks. Before authoring a new rule, an analyst must reconcile the incoming CTI with the existing rule base and determine whether to create, update, or retire one. This process is challenging due to the representational differences between the CTI and Rule formats. This gap limits the effectiveness of keyword- and embedding-based search, making rule reconciliation cognitively demanding and, in turn, contributing to "rule bloat". Second, automated verification of a new rule is inherently difficult as zero-day threats lack ground truth from simulated testing. Hence, standard metrics cannot prove that a rule semantically adheres to the CTI, and the use of LLMs

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training

arXiv:2507.01752v4 Announce Type: replace-cross Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black-box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for robustness to data poisoning attacks an

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

LangMAP: A Language-Adaptive Approach to Tokenization

arXiv:2606.23566v2 Announce Type: replace Abstract: Language-specific tokenizers improve tokenization quality and the downstream performance of models on those languages. However, using such a tokenizer comes at a cost: either a new model must be trained from scratch, or the vocabulary of an existing pretrained model must be adapted. We propose Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, a tokenization scheme that extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Notably, LangMAP can be used when training a multilingual language model from scratch or to adapt a pretrained model's tokenizer to individual languages without changing its vocabulary. While language labels are required at training time, a key feature of the algorithm is that it then performs language-specific tokenization at inference without knowledge of the input's language. Across 14 open-source tokenizers, 9 natural la

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge: A Controlled Study on Clinical Question Answering

arXiv:2606.22419v2 Announce Type: replace Abstract: A recent Nature Medicine study reports that general-purpose frontier LLMs outperform specialized retrieval-augmented clinical tools on medical benchmarks, and that retrieval can hurt strong models. We ask the natural follow-up: does structured knowledge-graph (KG) grounding change this, and when does grounding help at all? We contribute two results. First, a reproduction: the study's headline HealthBench score (~88) is the Consensus variant, not full HealthBench, where frontier models and ideal completions both score ~46-47 under a physician-calibrated grader (agreement 82.5%); we reproduce GPT-5.2 Consensus =90.9 and flag a score-deflating grader bug. Second, a knowledge-boundary result. Using a graph+vector engine (samyama-graph) over the public biomedical KG PrimeKG, neither naive triple retrieval nor an agentic natural-language-to-Cypher loop (82% successful queries) improves MedQA across a weak-to-strong model ladder (all |Delta|

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Precision Recall Controllable Radiology Report Generation via Hybrid Natural Language and Clinical Reward Learning

arXiv:2606.21447v2 Announce Type: replace Abstract: Automated radiology report generation (RRG) has gained increasing attention because it can reduce the heavy workload of clinical report writing. However, most existing methods mainly optimize for natural language generation (NLG) metrics that focus on language fluency, while providing little control over clinically important factors such as precision and recall. As consequence, generated reports may be fluent but not well aligned with different clinical needs. To address this challenge, we propose a reinforcement learning framework for precision recall controllable RRG, where a control parameter explicitly adjusts the trade-off between clinical precision and recall during inference. This design allows the model to flexibly generate reports according to different clinical requirements. To ensure clinical correctness, we introduce a clinical reward into the training objective, which helps improve clinical efficacy (CE) beyond standard l

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

arXiv:2606.19910v2 Announce Type: replace Abstract: Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

arXiv:2606.16845v2 Announce Type: replace Abstract: Small-scale Large Language Models (LLMs) natively default to literal semantic interpretations, making few-shot irony detection a persistent challenge in noisy, user-generated text. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set ($N=734$), RDS achieves $78.1\%$ accuracy and a Macro F1 of $0.777$, matching the absolute performance ceiling of a fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters $22.5\%$ of out-of-distribution hallucinations, yielding a few-shot Macro F1 of $0.6726$ and Ironic F1 of $0.4821$, outperforming multiple heavily supervised SemEval transformer ensembles. Statistical ablation confirms this structural synergy: while adding the symbolic prior to the neural baseline yields

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

arXiv:2606.13189v2 Announce Type: replace Abstract: Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval,

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Teaching Diffusion to Speculate Left-to-Right

arXiv:2606.11552v2 Announce Type: replace Abstract: Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

arXiv:2606.07560v2 Announce Type: replace Abstract: Function-vector (FV) heads are identified by the magnitude of their causal contribution to in-context rule tasks, and the resulting top set is treated as a single functional class. We show this hides a sign structure. Under a sign-preserving criterion (refined direct logit attribution, validated head by head with path patching) the FV population splits into two opposing groups: writers push the rule-correct logit up, cancellers push it down, and ablating both together moves the readout less than the sum of the two. The split is causal and reproducible. It holds in all but two of the fifteen (model, task) cells we test, spanning three architectures and six Pythia scales, and a sign-shuffle null rejects the single-class account in all but one of the six main cells. It is also invisible to magnitude-only ranking, which surfaces whichever group locally dominates and misses the other, so any function vector or ablation built that way silen

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

arXiv:2606.07555v3 Announce Type: replace Abstract: Glossaries, technical specifications, and system prompts routinely ask language models to use familiar words in unfamiliar ways. When this works, the local rule does not overwrite the old meaning; the pretrained prior keeps operating underneath, and its strength still shows through. We test this with a Stroop-style paradigm: a remapping rule (doctor means forest) pitted against the query word's lexical-prior distractor (hospital), with matched neutral controls. Across 11 open-weight models spanning four families and 1B-9B parameters, lexical-prior strength predicts interference even after item-level controls for answer prior, frequency, tokenization, and prompt wording. Activation patching on five models then locates where the override is repaired internally. Restoring three source positions that carry the redefinition (the definition subject, its new target, and the query word) almost fully recovers the effect (aggregate $R \in [0.92

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

arXiv:2606.04883v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and the cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by 28.9% over a fixed-step baseline on

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

arXiv:2606.03867v2 Announce Type: replace Abstract: Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

arXiv:2605.30852v2 Announce Type: replace Abstract: Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens within single sequence in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves significantl

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

arXiv:2605.13076v2 Announce Type: replace Abstract: The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with adva

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

arXiv:2604.21481v2 Announce Type: replace Abstract: Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model st

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

arXiv:2604.19678v2 Announce Type: replace Abstract: Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\to$X direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. We further find that the highest-gain tokens span multiple languages and that translation FVs across directions share most of their top-ranked heads, indicating that the FV encodes a largely language-agnostic translation signal rather than a language-pair-specific mapping.

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv:2604.08558v2 Announce Type: replace Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and leng

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

arXiv:2603.16131v2 Announce Type: replace Abstract: The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 6

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement

arXiv:2603.03742v2 Announce Type: replace Abstract: Despite the remarkable performance of large language models (LLMs) in text-to-SQL (SQL generation), correctly producing SQL queries remains challenging during initial generation. The SQL refinement task is subsequently introduced to correct syntactic and semantic errors in generated SQL queries. However, existing paradigms face two major limitations: (i) self-debugging becomes increasingly ineffective as modern LLMs rarely produce explicit execution errors that can trigger debugging signals; (ii) self-correction exhibits low detection precision due to the lack of explicit error modeling grounded in the question and schema, and suffers from severe hallucination that frequently corrupts correct SQLs. In this paper, we propose ErrorLLM, a framework that explicitly models text-to-SQL Errors within a dedicated LLM for text-to-SQL refinement. Specifically, we represent the user question and database schema as structural features, employ sta

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

arXiv:2601.22548v4 Announce Type: replace Abstract: Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which behaviors are explained by narcissism versus experimental confounds. Specifically, LLM evaluators may deliver self-preferring verdicts when comparing responses to questions they fail on; these verdicts may not depend on the identity of the author, but on evaluator quality. We correct this by directly comparing the judge's voting distribution in cases where it evaluates itself versus another model. This evaluator quality baseline reveals that only 51% of examples in previous findings retain statistical significance against this null hypothesis, covering 89.6% of total self-preference probability mass. Finally, we compare the entropy of voting distributions, suggesting uncertainty-driven overlap, and show tha

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

arXiv:2601.11957v4 Announce Type: replace Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating this decision process is crucial yet challenging. Scheduling logistics can drain hours, and human delegation often fails at scale, which motivates us to ask: Can we trust large language models (LLMs) or language agents to manage time? To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. In CalConflictBench, conflicts are presented to agents round-by-round over a calendar year, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has an average error rate of 35%. To address this gap, we propose PEARL, a reinforce

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

arXiv:2601.06395v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and eval

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Selective Rotary Position Embedding

arXiv:2511.17388v3 Announce Type: replace Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

arXiv:2511.07397v2 Announce Type: replace Abstract: Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller, real-time models meet the latency bar but cannot match foundation models on complex tasks, leaving current voice agents to trade away either responsiveness or capability. We introduce conversational infill, where a small talker model both immediately generates contextually grounded responses to hide the latency of an external reasoner model and fluently integrates streamed reasoner knowledge into its responses during inference. We curate a 290,571-example synthetic dataset spanning six domains and demonstrate that this task is learnable across seven widely used small language models ranging from 135M to 1.7B parameters. Our system implementation, ConvFill, sustains millisecond-level time-to-first-response while

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

arXiv:2510.24295v2 Announce Type: replace Abstract: As many benchmarks have become saturated, it has become increasingly important to create new datasets that evaluate the generalization capacity of current state-of-the-art models in reasoning. However, designing high-quality reasoning datasets is challenging, as their manual construction is costly, and their automatic generation is unreliable, often leading to synthetic data with limited scope. In this paper, we propose the Minimal Expression-Replacement GEneralization (MERGE) test that evaluates the robustness of reasoning models against non-adversarial variants of existing evaluation datasets. We automatically obtain high-quality variants from the original instances with Minimal Expression REplacement (MERE) generation, which uses Masked Language Models (MLMs) and safeguarding filters. We apply the MERGE test to Natural Language Inference (NLI), a popular task of reasoning. We generate new NLI datasets from two widely used existing

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

arXiv:2509.03647v2 Announce Type: replace Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitim

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

arXiv:2508.10971v2 Announce Type: replace Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models' performance, we conduct a human evaluation of generated explanations on correctn

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

arXiv:2507.13563v3 Announce Type: replace Abstract: We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER consensus decoding, while retaining optional word-level timestamps, followed by automatic quality and speaker-purity filtering. The text is further enriched with punctuation restoration, lexical stress and "\textipa{e}/\textipa{\H{e}}" normalization, and IPA phonemes. Using Balalaika, we build a 5.1k-hour multi-source Russian corpus with rich annotations, and show consistent gains under equalized training budgets for both speech denoising and TTS; ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering. The datasets are publicly available at \href{https://huggingface.co/collections/lab260/balalaika-dataset}{\underline{\textbf{HuggingFace}}}

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

arXiv:2506.00869v3 Announce Type: replace Abstract: Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs' causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

arXiv:2505.18542v4 Announce Type: replace Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional texts, such as recipes, it has insufficiently addressed the complex logical structures, including conditional branching and parallel execution, that are pervasive in real-world regulatory and administrative documents. Furthermore, existing benchmarks are limited by simplistic schemas and shallow logical dependencies, restricting progress toward logic-aware large language models.To bridge this Logic Gap, we introduce BREX, a carefully curated benchmark comprising 409 real-world business documents and 2,855 expert-annotated rules. Unlike prior datasets centered on narrow service scenarios, BREX spans over 30 vertical domains, covering scientific, industrial, administrative, and financial regulations. We further prop

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

arXiv:2504.17768v3 Announce Type: replace Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective: larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) for the training-free methods we study, fine-grained per-query importance estimation during prefilling remains impractical-due to both the cost of estimation an

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

arXiv:2503.13505v3 Announce Type: replace Abstract: Generative Pretrained Transformers (GPTs) are foundational Large Language Models (LLMs) for text generation. However, individual LLMs often produce inconsistent outputs and exhibit biases, limiting their representation of diverse language patterns. The closed-source nature of many powerful LLMs further restricts industry applications due to data privacy concerns. Inspired by successes in text generation, LLM ensemble techniques are now increasingly explored for code generation. This article reviews these emerging ensemble approaches to enhance understanding, encourage further research, and promote practical implementation in both text and code generation. We categorize LLM ensembles into seven main methods - weight merging, knowledge fusion, mixture-of-experts, reward ensemble, output ensemble, routing, and cascading - analyzing capabilities of those approaches. Our findings highlight key benefits such as improved diversity representa

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

arXiv:2501.11790v5 Announce Type: replace Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's genuine reasoning capability is reflected through its

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an eva

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

arXiv:2403.04890v4 Announce Type: replace Abstract: In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. Additionally, we implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning, reaching a correct response to medical questions. We empirically demonstrate how CLINICR outperforms the state-of-the-art 5-shot CoT-based prompt (Li\'evin et al., 2022). We also present an approach that mirrors real-life clinical practice by first exploring multiple differential diagnoses through MCQ-CLINICR and subsequently narrowing down to a final diagnosis using MCQ-ELIMINATIVE. Finally, emphasizing the importance of response verification in medical settings, we utilize a reward model mechanism, replacing the elimination process performed by MCQ-ELIMINATIVE.

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

arXiv:2606.24841v1 Announce Type: cross Abstract: Prompt-based learning has emerged as a dominant paradigm in natural language processing. This study explores the impact of diverse pre-training objectives on the performance of encoder-decoder pre-trained language models across generation and question answering tasks, with a focus on commonsense knowledge retrieval and completion. We highlight the benefits of incorporating multiple objectives during both pre-training and fine-tuning stages. We introduce the Match Task to Objective (MTO) framework and methods for determining the appropriate objective for a given task. This framework offers automated methods to prepare task-related data for adaptation through unsupervised training, based on the identified objective. In the fine-tuning stage, we design novel templates that align with the objectives of the pre-training and adaptation stages. When aligned with task requirements, these strategies can achieve a performance gain of over 120\% c

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions: Style, Rate, Emphasis, Age, and Gender. Our experiments show that current LALM judges still lag behind human judgments by 32%p on average and exhibit severe calibration failures, particularly in Tie cases where the correct decision is to abstain. To further analyze lexical versus acoustic reliance, the benchmark includes both same-transcript and cross-transcript conditions. ParaPairAudioBench enables multi-dimensional, calibration-aware assessment of the reliability of LALM-as-a-Judge for paralinguistic speech evaluation.

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreem

Source ↗
technology Wed, 24 Jun 2026 00:00:00 -0400
arXiv cs.CL

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

arXiv:2606.24510v1 Announce Type: cross Abstract: Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, co

Source ↗
Showing 1–50 of 261 signals
← Prev Page 1 of 6 Next →