Named after the Greek god of messengers, Hermes watches the education landscape: spotting new opportunities, pressure-testing the ventures we're building, and tracing every read back to the real-world signals behind it.
The evidence library: the raw signals the pipeline is watching across the education ecosystem. Every idea is built from these.
arXiv:2606.26487v1 Announce Type: new Abstract: Large language models (LLMs) are attractive for context-aware time series forecasting because they can integrate heterogeneous textual signals, yet their discrete, language-oriented tokenization and embedding interfaces are misaligned with continuous numerical values, often harming numerical ordering and forecasting reliability. We propose TempoWave, a plug-and-play temporal wavelet digit interface that maps each scalar observation into digit-wise embeddings constructed from multi-wavelet, multi-scale coefficients. By directly overriding standard token representations, TempoWave seamlessly exposes both fine-grained local fluctuations and macro global structures in a transformer-compatible form, ensuring that precise numerical formatting, distinct digit identity, and robustness to common normalization operations are maintained throughout the LLM pipeline. Experiments across five context-enriched forecasting benchmarks demonstrate that Temp
arXiv:2606.26481v1 Announce Type: new Abstract: Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data
arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively language-agnostic, but generation becomes increasingly language-specific as models commit to discrete output tokens. This is problematic because language-specific lexical choices can cause semantically equivalent reasoning paths to diverge across languages. These divergences motivate searching for a cross-lingual alignment signal that is less tied to any single vocabulary item or script. We propose SOLAR, an auxiliary objective for supervised fine-tuning that aligns soft-token representations across languages, using English as a pivot. Soft tokens are probability-weighted mixtures over the vocabulary embeddings, yielding continuous representations that can aggregate information from semantically related tokens across languages
arXiv:2606.26452v1 Announce Type: new Abstract: To minimize privacy concerns and inference latency on edge devices like smartphones, lightweight on-device models remain important for end-user applications. Many of these applications involve natural language classification, but deploying multiple specialized models creates a memory footprint challenge. We investigate: Can a single lightweight architecture solve multiple Speech-Adjacent (SA) classification tasks through reduction to a nuanced text similarity formulation? We propose AnySimLite, a lightweight similarity encoder that combines word-level and character-level channels. Together with a dataset transformation strategy, we evaluate AnySimLite across multiple SA classification tasks and show that it consistently achieves state-of-the-art (SOTA) or SOTA-competitive performance in few-shot settings while maintaining a low memory footprint. Even in the worst case, the performance drop remains below 7% while using $<\frac{1}{250}^{\ma
arXiv:2606.26449v1 Announce Type: new Abstract: Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that decomposes transparency in multi-hop question answering into three independently measurable layers: answer correctness, citation fidelity against benchmark supporting evidence, and per-document influence under leave-one-resource-out intervention. Targeting the HotpotQA distractor benchmark through a seven-stage pipeline covering data normalisation, retrieval indexing, citation-aware answer generation, attribution auditing, ablation-based influence estimation, batch evaluation, and interactive inspection, ProvenAI evaluates 7,405 validation examples drawn from a canonical corpus of 509,300 passages. The system achieves 53.53% answer accuracy alongside a mean citation-fidelity score of 71.55%, and a worked example su
arXiv:2606.26437v1 Announce Type: new Abstract: Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric. Experiments show that ConflictScore effectively detects overconfident claims ac
arXiv:2606.26403v1 Announce Type: new Abstract: Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or redistribute responsibly, while independently generated fake fields rarely preserve the cross-field and temporal consistency needed for controlled evaluation. We present PROFILEFOUNDRY, a deterministic generator and fixed reference release of 100,000 adult synthetic Person Objects across eight locales. Each object combines a typed current snapshot, household, family, and employer links, snapshot-aligned events, normalized relational views, and generation provenance. The release contains 709,228 events, 40,338 households, 52,491 employers, and 518,564 directed relationship edges. We report evidence in separate categories: selected population-marginal comparisons, per-object invariant checks, release-wide referential a
arXiv:2606.26360v1 Announce Type: new Abstract: The neutral, or floating, tone of Mandarin Chinese is a tone with an enigmatic set of properties. It has been described as a reduced tone, or as a tone that sometimes is lexically fixed but that can also be toneless. In two-syllable words, it is found only on the second syllable, but single-syllable words can also have the neutral tone. We present a corpus-based study of the phonetic realization of the neutral tone in spontaneous conversational speech corpora of Beijing Mandarin and Taiwan Mandarin. We show that the neutral tone has its own tonal target, just as the four lexical tones of Mandarin. We also show that disyllabic words with a neutral tone have pitch contours that have a pitch component that depends on the tone on the first syllable, just as has been observed for two-syllable words with a lexical tone on the second syllable (Chuang et al., 2026). Furthermore, words with a floating tone have word-specific pitch signatures, whic
arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-st
arXiv:2606.26130v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used to guide research methodology, yet their default methodological tendencies under minimal prompting remain unclear. Here, we prompt GPT-5.1, Gemini 3 Pro, and DeepSeek-V3.2 with an LLM-extracted research question from each of 1,000 recent arXiv computer-science papers and compare the resulting methodology suggestions against a paper-derived experimental inventory. Since we provide only the research question, the differences we measure reflect initial suggestions and not how optimal those suggestions are. We extract structured method features from both sources, map them into a shared taxonomy, and quantify divergence across multiple taxonomy dimensions including model provider, dataset task type, and evaluation metric type. The strongest imbalance appears in provider choice, with Jensen-Shannon divergence about 3-5x larger than any other taxonomy dimension. Other/Academic single-occurrence
arXiv:2606.26120v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity scales on the order of L cubed with the sequence length L. This poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose Dynamic-dLLM, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically c
arXiv:2606.26112v1 Announce Type: new Abstract: Low-resource languages face a critical challenge in AI development: creating specialized conversational systems without access to massive training corpora. We present a systematic methodology for transforming structured linguistic resources into specialized AI systems, demonstrating that expert-curated lexical databases can serve as effective foundations for conversational AI development. Our approach converts Hindi WordNet into 1.25 million diverse instruction-response pairs, fine-tunes a 12B-parameter language model using resource-efficient LoRA with 4-bit quantization. Evaluation through a Hindi language learning chatbot demonstrates that structured-knowledge-based systems achieve superior pedagogical effectiveness (91.0 vs. 79.4-83.6 for general-purpose models) while maintaining competitive semantic performance and exceptional consistency. The complete pipeline demonstrates a proof-of-concept methodology using Hindi for developing spe
arXiv:2606.26108v1 Announce Type: new Abstract: Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we observe stable performance gaps: averaged over datasets, Qwen3-32B outperforms Qwen3-8B by 6.43%, while GPT-OSS-120B exceeds GPT-OSS-20B by 7.38%. To study the reasoning differences behind these gains, we develop AdvCluster, an automated framework that identifies questions where the larger model shows a stable advantage, extracts fine-grained advantage descriptions from paired reasoning traces produced by larger and smaller models, and organizes them through semantic clustering with quantitative evaluation and selection guided by a reviewer model. Our analysis yields a systematic taxonomy of larger model reasoning advantages, spanning both common advantages that recur across domains and specialized advantages as
arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a proof-of-concept multimodal framework that demonstrates the feasibility of generating emotion-conditioned Nepali Sign Language avatars from spoken input. As a preliminary investigation, we focus on four common Nepali words ("thank you", "hello", "house", "me") across three emotional states (happy, neutral, sad) to validate our core technical approach. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers. The system demonstrates 37% parameter efficiency compared to separate model architectures while maintaining
arXiv:2606.26106v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or policy-violating content, less attention has been paid to conversational behaviors that may unintentionally escalate conflict. In this paper, we investigate whether LLMs can be guided toward more de-escalating dialogue behavior through lightweight prompt-level constraints derived from Nonviolent Communication (NVC). We reformulate NVC principles as process-oriented guidelines that discourage blame attribution, emphasize attention to users' emotional experiences, and encourage clarification before advice. Using a dual-agent simulation framework across multiple instruction-tuned models and user resistance levels, we show that NVC-constrained prompting consistently reduces conversational escalation and stabilizes
arXiv:2606.26105v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce ContextForge, a system for context recycling that maintains task-relevant information across turns by combining structured query generation, external memory retrieval, and controlled synthesis. The system enables efficient reuse of prior computation without relying on full context replay, reducing token overhead while preserving answer quality. We evaluate ContextForge using a 15-turn conversational benchmark that tests multi-turn reasoning, back-references, and domain shifts across structured healthcare queries. Compared to a baseline agent using identical underlying models, ContextForge demonstrates improved consistency and reduced token consumption, while maintaining comparable response accuracy. These results suggest
arXiv:2606.26104v1 Announce Type: new Abstract: Animal-welfare advocates produce a lot of writing, and increasingly that writing trains the language models that millions of people then ask about animal welfare. Using vocabulary-matched stance-contrast probes on a held-out animal-welfare benchmark, we measure how each of ten linguistic features changes Llama-3.2-1B's preference for pro-animal-welfare reasoning when used as fine-tuning data. Eight of the ten features produce statistically significant shifts. Seven move the model toward stronger pro-animal-welfare reasoning: assertive certainty, explicit moral vocabulary, emotion words, evaluative claims, narrative structure, depicted harm severity, and immediate temporal framing. Two move it the other way: hedged language and concrete sensory description both dilute the pro-animal-welfare stance. First-person perspective has no statistically significant effect. The practical recommendation for anyone writing animal-welfare text that may
arXiv:2606.26103v1 Announce Type: new Abstract: Large Language Models (LLMs) have rapidly influenced many aspects of society, particularly education, due to their demonstrated ability to complete assignments and examinations across a wide range of subjects. Although prior studies have examined the educational impact of LLMs, much of the existing work relies on public or open problem datasets and lacks topic-specific analysis. In engineering education, especially within mechanical engineering, systematic investigations of LLM performance on specific problem types remain limited. Instead of using traditional methods that directly ask textbook questions to an LLM tool, our study adopts a model distillation process to evaluate LLM capabilities in solving statics problems. By distilling ChatGPT, we extracted 25 text-only statics questions and further constructed two additional datasets by adding diagrams and modifying their numerical values. Experimental results show that while LLMs perform
arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser. We evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. The benchmark is not solved by generic non-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models expose a selective but incomplete transition from answ
arXiv:2606.26100v1 Announce Type: new Abstract: Media bias detection is a critical task for ensuring fair and balanced information dissemination, yet existing sentence-level approaches classify each sentence independently, ignoring inter-sentence contextual signals that human annotators naturally exploit. We present \textbf{HierBias}, a hierarchical context-conditioned media bias detector that formally models document context in bias prediction. We introduce the \emph{context-conditioned bias probability} and prove theoretically that leveraging document context strictly reduces the Bayes error of sentence-level classification when inter-sentence mutual information is non-zero. A multi-task generalization bound further establishes that jointly training binary bias detection and fine-grained bias type classification improves sample efficiency on small annotated corpora. Architecturally, HierBias pairs a sentence-level RoBERTa encoder with a cross-sentence Transformer aggregator and dual
arXiv:2510.26518v2 Announce Type: replace-cross Abstract: Human feedback is critical for aligning AI systems to human values. As AI capabilities improve and AI is used to tackle more challenging tasks, verifying quality and safety becomes increasingly challenging. This paper explores how we can leverage AI to improve the quality of human oversight. We focus on an important safety problem that is already challenging for humans: fact-verification of AI outputs. We find that combining AI ratings and human ratings based on AI rater confidence is better than relying on either alone. Giving humans an AI fact-verification assistant further improves their accuracy, but the type of assistance matters. Displaying AI explanation, confidence, and labels leads to over-reliance, but just showing search results and evidence fosters more appropriate trust. These results have implications for Amplified Oversight -- the challenge of combining humans and AI to supervise AI systems even as they surpass hu
arXiv:2606.09843v2 Announce Type: replace Abstract: Large language models (LLMs) give stable answers to personality questionnaires, yet these self-reports fail to predict how the models actually behave. Is this gap an artifact of forcing human trait categories onto LLMs, or something deeper about LLM self-report itself? To find out, we built the first psychometric instrument whose dimensions are derived bottom-up from LLM behavior rather than borrowed from human psychology. Administering 300 items (240 Likert + 60 scenario) to 25 LLMs across 17 model families, 30 times each, exploratory factor analysis revealed five replicable, highly reliable factors: Responsiveness, Deference, Boldness, Guardedness, and Verbosity (all Tucker $\phi \geq .957$, all $\alpha \geq .930$). We then collected 2,500 open-ended behavioral samples and had them rated by 151 humans and a three-judge LLM ensemble. Humans and judges agreed about model behavior ($\bar{r} = .51$), but self-report predicted neither: t
arXiv:2604.03501v5 Announce Type: replace Abstract: Experimental evidence suggests that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. To explore the consequences of this tradeoff, we develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool's productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, a decision-maker who fully anticipates skill erosion still rationally adopts AI when front-loaded gains outweigh long-run skill costs, lowering long-run productivity. The decomposition sorts deployments into five regimes by their long-run effect, separating beneficial from harmful adoption. Second, the tradeoff introduces the potential for misaligned incentives. When the decision-maker does not bear
arXiv:2604.01741v2 Announce Type: replace Abstract: Modeling users' cognitive states (e.g., cognitive load and decision confidence) is essential for building adaptive AI in high-stakes decision-making. While eye tracking provides non-invasive behavioral signals correlated with cognitive effort, prior work has not systematically examined how AI assistance contexts, specifically varying advice reliability and user heterogeneity, can alter the mapping between gaze signals and cognitive states. We conducted a within-subject lab eye-tracking study (N=54) on factual verification tasks under three conditions: No-AI, Correct-AI advice, and Incorrect-AI advice. We analyze condition-dependent changes in self-reports and eye-tracking patterns and evaluate the robustness of eye-tracking-based user modeling. Results show that AI advice increases decision confidence compared to No-AI, while Correct-AI is associated with lower perceived cognitive load and more efficient gaze behavior. Crucially, pred
arXiv:2601.00570v2 Announce Type: replace Abstract: Cognitive reappraisal is a well-studied emotion regulation strategy that helps individuals reinterpret stressful situations to reduce their impact. Many digital mental health tools struggle to support this process because rigid scripts fail to accommodate how users naturally describe stressors. This study examined the feasibility of an LLM-based single-session intervention (SSI) for workplace stress reappraisal. We assessed short-term changes in stress-related outcomes and examined design tensions during use. We conducted a feasibility study with 100 employees at a large technology company who completed a structured cognitive reappraisal session delivered by a GPT-4o-based chatbot. Pre-post measures included perceived stress intensity, stress mindset, perceived demand, and perceived resources. These outcomes were analyzed using paired Wilcoxon signed-rank tests with correction for multiple comparisons. We also examined sentiment and s
arXiv:2509.05219v5 Announce Type: replace Abstract: Conversational AI systems are increasingly being used in place of traditional search engines to help users complete information-seeking tasks. This has raised concerns in the political domain, where biased or hallucinated outputs could misinform voters or distort public opinion. However, in spite of these concerns, the extent to which conversational AI is used for political information-seeking, as well the potential impact of this use on users' political knowledge, remains uncertain. Here, we address these questions: First, in a representative national survey of the UK public (N = 2,499), we find that in the week before the 2024 election as many as 32% of chatbot users - and 13% of eligible UK voters - have used conversational AI to seek political information relevant to their electoral choice. Second, in a series of randomised controlled trials (N = 2,858 total) we find that across issues, models, and prompting strategies, task-direc
arXiv:2606.27258v1 Announce Type: cross Abstract: A core principle of object orientation -- that the functionality of a system can be partitioned amongst objects that correspond to individuals in the problem domain -- has influenced how software has been specified, designed and implemented for more than fifty years. Later developments in software engineering sought to build on this principle. But in fact this partitioning is neither natural nor straightforward, and the problems that these later developments sought to mitigate -- the fragmentation and conflation of functionality -- were often, in fact, the inevitable consequences of this founding principle. An easier path to addressing these problems therefore starts by going back, abandoning object orientation, and replacing it with an alternative approach that decouples the individuals of the problem domain from the modules that partition functionality.
arXiv:2606.26842v1 Announce Type: cross Abstract: Labeling speaker diarization data is costly, yet annotation tools rarely measure that cost. We present voxmap-studio, an open-source, React-based diarization annotation tool integrated with the pyannote-based diarization ecosystem. Its canvas is initialized by a fast stride-accelerated diarization engine so that the annotator corrects a hypothesis rather than drawing every speaker turn by hand, and the tool records annotation cost - typed edit-operation counts and time - as a first-class output, enabling quantitative comparison of how much different forms of assistance actually help. Export is gated on per-segment human confirmation and guarded by injected "phantom" attention checks, which prevent unverified automatic output from being released as ground truth. In a preliminary study on nine AMI audio files, unassisted manual annotation was the costliest and least accurate, and automatic initialization shifted the work from creating tur
arXiv:2606.26721v1 Announce Type: cross Abstract: AI coding agents are changing the bottleneck in software collaboration: code is increasingly cheap, while understanding intent, negotiating scope, and governing long-term project responsibility remain costly. This paper proposes \emph{Knowledge-Based Pull Requests} (KPR), a trusted workflow for agent-mediated software collaboration across trust boundaries, including open source, enterprise, vendor, contractor, and customer-driven settings. In KPR, an external collaborator's local code, tests, and cleaned agent interaction trace are treated as knowledge sources rather than as the default merge candidate. Agents distill these sources into a human-confirmed knowledge package and render it into reviewer-facing forms such as design memos, risk checklists, test plans, or implementation briefs. A project-owned inner trusted coding agent then regenerates candidate code inside the receiving project's environment under repository context, enginee
arXiv:2606.26505v1 Announce Type: cross Abstract: Modern software development increasingly involves the use of large language models (LLMs) to generate code. Despite their rapid advancement, LLMs remain prone to errors and hallucinations, emphasizing the importance of careful code inspection. However, in practice, developers' trust in LLM-generated code and their willingness to review it thoroughly may differ from these recommendations. How developers actually behave when reviewing LLM-generated code remains largely unexplored. In this study, we conduct a Wizard-of-Oz experiment to examine how software engineers behave when code is explicitly labeled as LLM-generated during a code review task. We collect both behavioral data and participant feedback through eye-tracking and exit interviews. Combining Bayesian data analysis with qualitative analysis, we found that while the thoroughness of code review did not change for participants, they spent more time fixating on LLM-labelled code, i
arXiv:2606.26485v1 Announce Type: cross Abstract: Microblogging platforms generate massive amounts of short, noisy, and dispersed user content, making automatic keyphrase extraction (AKE) an important but challenging task. Prior studies have used eye-tracking signals to improve microblog-based AKE because such signals reflect readers' attention to salient words. However, eye tracking alone is limited by physiological, acquisition, and feature-decoding constraints. To address this issue, we investigate whether electroencephalogram (EEG) signals can complement eye-tracking signals for AKE. Using the ZuCo cognitive language processing corpus, we select 8 EEG features and 17 eye-tracking features and incorporate them into microblog-based AKE models. To reduce possible distortion of cognitive signals by model structures, we inject these features into the input of the soft-attention layer and the query vectors of the self-attention layer. We then evaluate different combinations of cognitive
arXiv:2606.26382v1 Announce Type: cross Abstract: Social-physical human-robot interaction (spHRI) has grown rapidly across robotics, human-computer interaction, human-robot interaction, and haptics. Yet, fragmented terminology and inconsistent methodologies make systematic synthesis difficult. To support scalable review practices, we evaluated the extent to which small language models (SLMs; < 1.5B parameters) can assist with title and abstract screening for a large spHRI systematic review. While no SLMs matched human reviewers' performance, the models operated locally and screened papers orders of magnitude faster. The combined SLM ensemble identified 39 papers reviewers missed, representing 10.29% of the final relevant dataset. These results demonstrate that SLMs can augment, rather than replace, expert reviewers and make large-scale literature reviews accessible and sustainable.
arXiv:2606.27302v1 Announce Type: new Abstract: AI healthcare chatbots are increasingly used to support health information seeking and self-management, yet their performance and impact on users remains to be studied. This study examines over 15,000 user reviews from 59 AI healthcare chatbot apps to explore how these systems function in everyday informational and emotional contexts. Topic modeling and interpretive analysis identify three recurring breakdowns: access barriers and service unreliability, user experience and interaction quality, and billing and customer support issues. Privacy and security concerns are associated with the most negative experiences. By framing AI healthcare chatbots as information infrastructures, our findings highlight how failures in access, usability, and trust affect users, offering actionable insights for designers, policymakers, and information professionals aiming to improve digital health systems.
arXiv:2606.27301v1 Announce Type: new Abstract: Electronic monitoring (EM) systems are increasingly used in community corrections to enforce spatial, temporal, and behavioral rules through continuous sensing. While prior work has examined EM as a criminal justice tool or as a mechanism for compliance, less is known about how sensed data become meaningful in everyday practice. This poster examines EM as a dual-sided sensing system in which supervised individuals and authorities reason about the same data stream from different positions. Based on semi-structured interviews with 26 supervised individuals and 12 authorities in China's community corrections system, we show that supervised individuals infer system logic from outcomes with limited visibility into how data are interpreted, while authorities reconstruct behavior from ambiguous traces using contextual knowledge, professional experience, and institutional procedures. We call this structural divergence interpretive misalignment. I
arXiv:2606.27284v1 Announce Type: new Abstract: Gay dating applications have become critical platforms for sexual minority men to seek relationships and community, yet they also expose users to deceptive interactions that remain underexplored in HCI and CSCW research. This study examines how gay male users in China experience, identify, and respond to deception on dating applications. Through semi-structured interviews with 22 participants across platforms including Blued, Aloha, Fanka, and Soul, we make three contributions. First, we identify a typology of deceptive practices extending beyond profile misrepresentation to encompass relational, emotional, financial, and commercial forms of deception. Second, we document the layered, probabilistic verification strategies users develop through long-term platform use, showing that trust assessment operates as a multi-signal, provisional process rather than a binary judgment. Third, we demonstrate that risk recognition is a collaborative pr
arXiv:2606.27111v1 Announce Type: new Abstract: The broadcast of disinformation in online social networks (OSN) is a growing concern examined across several disciplines, including human-computer interaction (HCI). The pervasive issue has been prompting novel approaches to identify the malicious actors behind the dissemination of deceptive and fabricated content. Analyzing the characteristics and activities of these actors, we designed a taxonomy informed by collaboration with subject matter experts (SMEs) and a review of the academic literature. Our study explores how to distinguish the characteristics, activities, and strategies of malicious actors on OSN and examines how they contribute to the spread of disinformation. We describe the design process and the application of the taxonomy in a case study analyzing anti-migration discourse in social media channels, and reflect on its potential to aid researchers and practitioners in the responsible design of network systems.
arXiv:2606.27077v1 Announce Type: new Abstract: The presented study investigates events influencing public transportation experience in both urban (Hamburg) and rural (Tuttlingen) areas in Germany, with the aim of identifying events that affect travel experience and as a result travel behavior. Using a mobile application, 21 participants in Tuttlingen and 70 participants in Hamburg tracked everyday trips, providing real-time evaluations of travel experiences along with situational data. Multi-level regression analyses were applied to assess the impact of events such as punctuality, capacity offer, information about public transportation and others on the ontrip experience. Results indicate that a sufficient public transportation capacity offer has the strongest positive effect in Tuttlingen, whereas a lack of punctuality and low personal well-being have the strongest negative effects. In Hamburg, a lack of punctuality and a negative information event have the largest impacts. These ide
arXiv:2606.27067v1 Announce Type: new Abstract: Generative AI (GenAI) holds promise for democratizing creative literacy, yet whether it benefits all children equally remains unclear. Using a child-centric GenAI storytelling system for children aged 7-12, we conducted a mixed-methods within-subjects experiment (N = 40, Grades 2-6) comparing GenAI-assisted and traditional storyboard conditions. Three findings emerged. First, the GenAI-assisted condition was associated with a floor-raising convergence pattern, with the quality gap narrowing by 83.5%, driven by lower-end support and upper-end constraint mechanisms. This convergence was dimension-selective, improving creativity and richness while leaving coherence and narrative structure tied to baseline performance. Second, younger children more often selected semantically distant keywords while older children preferred semantically closer ones, although engagement orientation varied across individuals regardless of age. Third, image regen
arXiv:2606.26951v1 Announce Type: new Abstract: Brain-computer interfaces (BCIs) offer promising avenues for cerebral palsy (CP) rehabilitation at home and in the clinic, using games that promote engagement and sustained training effort. Nonetheless, the design constraints of BCI-based CP rehabilitation remain unclear, especially how individuals with CP experience a sense of control through BCI, and how they experience computer-mediated game assistance. To address this gap, we present preliminary clinical and user perspectives on BCI-based CP rehabilitation, drawing on in-clinic insights from a CP therapist and experiential accounts from ten individuals with CP engaging with BCI game prototypes. Sporadic help in BCI games eased monotony, but also fostered doubts regarding agency. The therapist saw BCI rehabilitation as complementary to traditional training, facilitating the transition from playful exercises to autonomous, self-managed training. We outline key challenges and opportuniti
arXiv:2606.26937v1 Announce Type: new Abstract: The engineering of adaptive user interfaces has traditionally relied on either rule-based systems encoding designer intuitions about user needs or machine learning approaches requiring substantial historical data before achieving effective personalization. We present a technical architecture that leverages Large Language Models as behavioral synthesis engines to enable immediate adaptation from sparse, heterogeneous user signals. Our system integrates three distinct behavioral channels, i) explicit micro-feedback on individual interface elements, ii) spatial priority inferred from manual widget reorganization through drag-and-drop interaction, iii) and attentional investment measured through dwell time during hover events, within a structured prompt engineering framework that continuously regenerates dashboard layouts while maintaining explanatory coherence. The architecture addresses the technical challenge of translating low-level inter
arXiv:2606.26925v1 Announce Type: new Abstract: The way games dynamically convey information through feedback is critical to players' ability to perform, learn, and improve. However, it is poorly understood how performance metrics impact player performance and perception in core game tasks like pointing or steering. With a virtual reality pointing task we systematically explored how three performance metrics driving the feedback affected players when rewarding short completion times, straight movements, or high peak speed. across different points in time - continuously, at end-of-action, or at end-of-task. On average the dynamic feedback helped people point more straight and faster, while for others it had small or opposite effect. The study quantitatively compared dynamic feedback across three forms with the metrics driving the form as the intended locus of quantitative comparison. Our work improves game designers basis for crafting dynamic feedback by helping them know when to employ
arXiv:2606.26886v1 Announce Type: new Abstract: Artificial intelligence (AI) systems for automated Critical View of Safety (CVS) assessment in laparoscopic cholecystectomy are nearing clinical translation. Beyond algorithmic performance, clinical safety and effectiveness depend on the quality of the human-machine interface (HMI). This work examines how AI-generated predictions should be presented and controlled intraoperatively. Seventeen surgeons, including residents, attending surgeons, and professors, took part in a mixed-methods, user-centered design study to optimize an intraoperative HMI for AI-assisted safe laparoscopic cholecystectomy. Interviews explored interaction modalities, timing of assistance, visualization strategies, and control mechanisms across surgical roles, and were analyzed using reflexive thematic analysis and human-factors heuristics. Most surgeons (16/17) supported the use of AI for intraoperative decision support while rejecting autonomous decision-making. At
arXiv:2606.26884v1 Announce Type: new Abstract: We present MedSWFlow, an open-source, model-agnostic LLM workflow for drafting medical social work case plans. The framework translates professional case-planning tasks into six stages: assessment, problem analysis, goal setting, intervention planning, risk anticipation, and planned effect evaluation. Drawing on established social work and behavioral frameworks, MedSWFlow standardizes case inputs, builds structured case profiles, and generates reviewable assessment forms and service plans through staged prompting. The system is released as an open-source research framework for reproducible case-plan generation across LLM providers. Outputs are intended as practitioner-reviewed drafts rather than final service decisions. Source code: https://github.com/santhiyacw-droid/MedSWFlow/tree/main.
arXiv:2606.26729v1 Announce Type: new Abstract: Generative artificial intelligence (GenAI) has intensified pressure on universities to redesign assessment while maintaining integrity, equity, and validity. Structured frameworks such as the Artificial Intelligence Assessment Scale (AIAS) offer one response, but evidence of how staff experience their implementation remains limited. This qualitative study examines AIAS implementation at a private international university in Vietnam and a public university in the United Kingdom. Data from five focus groups with 30 academic staff were analysed using hybrid thematic analysis, with Critical AI Literacy used as a sensitising concept. Six themes were developed: recognising and integrating AI, facilitating conditions, building capacity, pathways to adoption, ethics in practice, and reframing pedagogy. Staff valued the AIAS as a shared language for legitimising GenAI use, clarifying boundaries, and prompting reflection on assessment design. Howev
arXiv:2606.26725v1 Announce Type: new Abstract: This paper introduces a computational cognitive model to investigate how information grouping impacts visual search, a key consideration in user interface design. The model uses computational rationality to view user behavior as an adaptation to cognitive and task constraints. Our work highlights that humans use hierarchical task representations, exploiting semantic and visual structures to improve search efficiency within the constraints of the visual system. We validate this model with data from two human studies focused on visual search and semantic categorization, demonstrating that semantic grouping improves search performance when it aligns with spatial grouping. Our model replicates task durations and eye movement patterns. By improving understanding of how hierarchical memory structures are utilized in human cognition, the model extends previous visual search models. We showcase our model in the rapid prototyping and evaluation of
arXiv:2606.26672v1 Announce Type: new Abstract: Artificial intelligence-mediated communication (AI-MC) is conceptualized as applying AI to augment or generate message content (Hancock et al., 2020). However, advances in generative AI have expanded its use beyond generating content to guiding individuals' communication strategies, that is, AI-guided communication, yet theoretical and empirical understandings of this emerging use pattern and its consequences remains limited. To address this gap, this study conducted 26 in-depth interviews with individuals who have used AI to develop their communication strategies. Findings suggest participants strongly preferred using AI to analyze challenging scenarios in close relationships, because it fostered self-reflection, eased emotions, prevented conflict escalation, offered multiple perspectives, and provided a safe, nonjudgmental space for self-disclosure. Participants also stated that AI-guided communication enhanced their empathy and communi
arXiv:2606.26641v1 Announce Type: new Abstract: Current dialogue systems, powered by large language models, often treat empathy as essential without assessing its true impact, especially in behavior change, where motivation and adherence often depend on subtle user-chatbot dynamics. We examine this assumption by building three WhatsApp physical-activity (PA) coaching chatbots that differ only in empathy level and evaluating them in a six-week within-subject study (N = 13). Participants struggled to distinguish between the empathy conditions, and the non-empathetic version was often rated as more engaging and useful. However, higher-empathy variants were still associated with a larger overall average increase in step counts and faster improvement in intention to follow advice. These results suggest empathy's role is nuanced: it may be hard for lay users to identify explicitly, but it can still shape motivation and trust that support sustained change. We interpret this pattern through th
arXiv:2606.26626v1 Announce Type: new Abstract: Current AI-powered creativity support tools (AI-CSTs) primarily use text prompting to generate solution-oriented outputs. However, the potential value of multimodal prompting in designer-AI interaction, specifically the introduction of productive friction to encourage iteration and reflection, has not been fully explored. To address this, we developed SketchifAI, a prototype AI-CST, and evaluated it with design students. In a mixed-methods, within-participants study, we examined how different input modalities (text, sketch, and sketch-plus-tags) affected design students' perceived ability to express their intent, their perception of creativity support, and their divergent thinking performance. Our preliminary findings suggest that the sketch modality tended to enhance fluency, with inconclusive evidence for differences in variety, originality, or quality compared to text modality. Yet, paradoxically, participants showed a strong preferenc
arXiv:2606.26614v1 Announce Type: new Abstract: Large language model (LLM) agents enable natural language interaction for scientific visualization (SciVis). Still, prior systems have essentially prioritized autonomy over human analytical control, thereby limiting transparency and human oversight. We present HiLSVA, a human-in-the-loop agentic system that supports mixed-initiative SciVis workflows. HiLSVA integrates a plan-first multi-agent architecture with explicit human oversight, stepwise provenance tracking, and learn-at-test-time adaptation from user feedback. The system supports fluid handoff between humans and agents through both natural language and direct manipulation of visualizations, while sandboxed execution ensures safe, reproducible workflows. In doing so, HiLSVA reframes agentic SciVis as a collaborative process that augments, rather than replaces, human analytical reasoning. We evaluate HiLSVA through representative case studies and a controlled user study with twelve
arXiv:2606.26565v1 Announce Type: new Abstract: Artificial Intelligence (AI) education is increasingly important, yet adults outside higher education receive less attention. We report a case study of an AI education session with 54 adults (48 in-person and 6 virtual) in a predominantly African American community on the east side of a major Midwestern city. We ask: "What does AI education for adults outside formal educational systems look like in practice?" and "What does this AI education session reveal about AI literacy at the community level?" Through a co-designed session developed with community partners, we found that concerns about AI persisted but shifted to specific, locally grounded questions about AI design and deployment. We also discuss AI literacy from a community capacity perspective and argue for AI literacy frameworks grounded in local community contexts that strengthen community capacity.