🔬 Research Papers
arXiv cs.AI & cs.CL – Grundlagenforschung
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
arXiv:2604.26506v1 Announce Type: new
Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instru…
arXiv →
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instru…
arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.
▶
arXiv cs.CL
30.04.2026
Swap distance minimization shapes the order of subject, object and verb in languages of the world
arXiv:2604.26726v1 Announce Type: new
Abstract: Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have t…
arXiv →
Swap distance minimization shapes the order of subject, object and verb in languages of the world
arXiv:2604.26726v1 Announce Type: new Abstract: Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have t…
arXiv:2604.26726v1 Announce Type: new Abstract: Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have tailored models to this fact. However, there are still languages whose dominant order does not conform to these expectations or even lack a dominant order. Here we show that across linguistic families and macroareas, word order variation within languages is shaped by the principle of swap distance minimization even when the dominant order is not SOV/SVO and even when a dominant order is lacking.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
arXiv:2604.25923v1 Announce Type: new
Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices.…
arXiv →
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
arXiv:2604.25923v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices.…
arXiv:2604.25923v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
arXiv:2604.25921v1 Announce Type: new
Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in convers…
arXiv →
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in convers…
arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.
▶
arXiv cs.CL
30.04.2026
SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
arXiv:2604.25925v1 Announce Type: new
Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by …
arXiv →
SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
arXiv:2604.25925v1 Announce Type: new Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by …
arXiv:2604.25925v1 Announce Type: new Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases. Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
arXiv:2604.25926v1 Announce Type: new
Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and …
arXiv →
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and …
arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Information Extraction from Electricity Invoices with General-Purpose Large Language Models
arXiv:2604.25927v1 Announce Type: new
Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capabil…
arXiv →
Information Extraction from Electricity Invoices with General-Purpose Large Language Models
arXiv:2604.25927v1 Announce Type: new Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capabil…
arXiv:2604.25927v1 Announce Type: new Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA
arXiv:2604.25928v1 Announce Type: new
Abstract: Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and pr…
arXiv →
CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA
arXiv:2604.25928v1 Announce Type: new Abstract: Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and pr…
arXiv:2604.25928v1 Announce Type: new Abstract: Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8\% for Qwen3-8B and 60.3\% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6\% to 1.4\%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
LLMs Generate Kitsch
arXiv:2604.25929v1 Announce Type: new
Abstract: Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human c…
arXiv →
LLMs Generate Kitsch
arXiv:2604.25929v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human c…
arXiv:2604.25929v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of "kitsch". We discuss implications for the design of future studies and for creative tasks such as research and coding.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence
arXiv:2604.25930v1 Announce Type: new
Abstract: We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval.…
arXiv →
Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence
arXiv:2604.25930v1 Announce Type: new Abstract: We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval.…
arXiv:2604.25930v1 Announce Type: new Abstract: We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval. We introduce UniMatrix, a Universal Transformer style family that reuses a shared recurrent block across depth and augments it with hybrid state updates, a ROSA-style residual path, and token-conditioned embedding modulation. We evaluate these models on byte-level WikiText-2, synthetic associative recall, throughput profiling on Apple MPS, and a corrected benchmark for triple-token interactions. At small scale, UniMatrix-Core and UniMatrix-ROSA slightly outperform a parameter-matched Transformer on WikiText-2 while using many fewer parameters, reaching 5.084 and 5.083 bits-per-byte versus 5.124. The main negative result is equally important: on associative recall, the original UniMatrix family remains near chance while the Transformer reaches 25.4 percent, showing that compressed recurrent state alone is not enough for exact lookup. A retrieval-oriented follow-up, UniMatrix-Assoc, helps only marginally. By contrast, UniMatrix-SparsePointer, which adds sparse slot routing and direct pointer-logit fusion, reaches 75.6 percent on the original pilot recipe and 99.2 percent on a no-dropout follow-up while using 53.8 percent fewer parameters than the Transformer baseline. Ablations show that the gain comes from sufficient slot capacity and exact pointer-level output routing. Overall, structured recurrent state is promising and parameter-efficient, but strong long-range behavior still requires explicit sparse retrieval and better kernels.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs
arXiv:2604.25931v1 Announce Type: new
Abstract: We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning…
arXiv →
Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs
arXiv:2604.25931v1 Announce Type: new Abstract: We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning…
arXiv:2604.25931v1 Announce Type: new Abstract: We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model's confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
arXiv:2604.26048v1 Announce Type: new
Abstract: This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framewo…
arXiv →
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
arXiv:2604.26048v1 Announce Type: new Abstract: This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framewo…
arXiv:2604.26048v1 Announce Type: new Abstract: This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
arXiv:2604.26052v1 Announce Type: new
Abstract: Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful r…
arXiv →
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
arXiv:2604.26052v1 Announce Type: new Abstract: Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful r…
arXiv:2604.26052v1 Announce Type: new Abstract: Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario
arXiv:2604.26500v1 Announce Type: new
Abstract: LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to…
arXiv →
StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario
arXiv:2604.26500v1 Announce Type: new Abstract: LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to…
arXiv:2604.26500v1 Announce Type: new Abstract: LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models
arXiv:2604.26139v1 Announce Type: new
Abstract: Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather tha…
arXiv →
HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models
arXiv:2604.26139v1 Announce Type: new Abstract: Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather tha…
arXiv:2604.26139v1 Announce Type: new Abstract: Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation
arXiv:2604.26170v1 Announce Type: new
Abstract: Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires it…
arXiv →
EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation
arXiv:2604.26170v1 Announce Type: new Abstract: Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires it…
arXiv:2604.26170v1 Announce Type: new Abstract: Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined paradigm, namely an iterative generation-selection-training loop, which incorporates a selection step prior to model updates. Building on this paradigm, we propose EvoSelect, a data-efficient framework to evolve LLM effectively. Given candidate samples produced by the data generator, EvoSelect selects training data by jointly modeling targeted task alignment and diversity. We estimate task relevance through optimal transport with proxy gradient representations, which quantifies how well candidate samples align with the targeted task distribution. To mitigate redundancy, we incorporate a diversification mechanism that promotes coverage of complementary training samples. By interleaving alignment and diversification, EvoSelect enables progressive LLM evolution toward targeted tasks. Extensive experiments on various benchmarks demonstrate that with either weak or strong data generators, EvoSelect consistently improves adaptation efficacy over existing data selection methods.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Comparative Analysis of AutoML and BiLSTM Models for Cyberbullying Detection on Indonesian Instagram Comments
arXiv:2604.26229v1 Announce Type: new
Abstract: This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanc…
arXiv →
Comparative Analysis of AutoML and BiLSTM Models for Cyberbullying Detection on Indonesian Instagram Comments
arXiv:2604.26229v1 Announce Type: new Abstract: This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanc…
arXiv:2604.26229v1 Announce Type: new Abstract: This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanced dataset of 650 comments labeled as Bullying and Non-Bullying, the study evaluates Naive Bayes, Logistic Regression, and Support Vector Machine with TF-IDF features, as well as BiLSTM and BiLSTM with Bahdanau Attention. A preprocessing pipeline tailored to informal Indonesian text is applied, including slang normalization, stopword removal, and stemming. The results show that Logistic Regression performs best among the machine learning models, while BiLSTM with Attention achieves the strongest overall deep learning performance. The findings highlight the value of domain-specific preprocessing and show that although deep learning captures contextual patterns more effectively, machine learning remains a competitive option for resource-constrained deployments.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
A New Semisupervised Technique for Polarity Analysis using Masked Language Models
arXiv:2604.26230v1 Announce Type: new
Abstract: I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns po…
arXiv →
A New Semisupervised Technique for Polarity Analysis using Masked Language Models
arXiv:2604.26230v1 Announce Type: new Abstract: I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns po…
arXiv:2604.26230v1 Announce Type: new Abstract: I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily's coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients
arXiv:2604.26258v1 Announce Type: new
Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, …
arXiv →
FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients
arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, …
arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can automatically induce and optimize such workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with ``textual gradients'' where for the inner loop we optimize each component in a modular way through ``backpropagating'' textual gradients layer-by-layer. We find that LLM workflows discovered through our \textsc{FlowBot} (work\textbf{flow} induction through \textbf{b}ilevel \textbf{o}ptimization and \textbf{t}extual gradients) approach performs competitively against strong baselines that make use of human-crafted or automatically-generated workflows.
▶
arXiv cs.CL
30.04.2026
Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
arXiv:2604.26294v1 Announce Type: new
Abstract: We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single de…
arXiv →
Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
arXiv:2604.26294v1 Announce Type: new Abstract: We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single de…
arXiv:2604.26294v1 Announce Type: new Abstract: We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection
arXiv:2604.26310v1 Announce Type: new
Abstract: Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task…
arXiv →
Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection
arXiv:2604.26310v1 Announce Type: new Abstract: Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task…
arXiv:2604.26310v1 Announce Type: new Abstract: Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task in natural language processing. This study benchmarks classical machine learning and deep learning approaches for 20-class emotion classification using the 20-Emotion Text Classification Dataset containing 79,595 English sentences. On the machine learning side, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine are evaluated using TF-IDF features. On the deep learning side, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and a lightweight Transformer implemented in PyTorch are compared. The results show that BiLSTM achieves the best overall performance with 89% accuracy and a weighted F1-score of 0.89, slightly outperforming the best machine learning model, SVM, which reaches 88.11% accuracy. The findings indicate that while traditional machine learning models remain competitive and computationally efficient, sequence-based deep learning models better capture contextual emotional cues in text.
▶
arXiv cs.CL
30.04.2026
Classification of Public Opinion on the Free Nutritional Meal Program on YouTube Media Using the LSTM Method
arXiv:2604.26312v1 Announce Type: new
Abstract: Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long…
arXiv →
Classification of Public Opinion on the Free Nutritional Meal Program on YouTube Media Using the LSTM Method
arXiv:2604.26312v1 Announce Type: new Abstract: Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long…
arXiv:2604.26312v1 Announce Type: new Abstract: Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long Short-Term Memory (LSTM) method to classify sentiments from 7,733 YouTube comments. The results show that the LSTM model achieves 89% accuracy, with strong performance on negative sentiment (F1-score 0.94) but weaker performance on positive sentiment (F1-score 0.55) due to class imbalance, as negative data account for 87.7% of the dataset. These findings confirm the effectiveness of LSTM for sentiment analysis of Indonesian text while highlighting the challenge of imbalanced data. This research contributes to social media-based public policy evaluation
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection
arXiv:2604.26319v1 Announce Type: new
Abstract: Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this tas…
arXiv →
A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection
arXiv:2604.26319v1 Announce Type: new Abstract: Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this tas…
arXiv:2604.26319v1 Announce Type: new Abstract: Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories -- prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) -- on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models
arXiv:2604.26351v1 Announce Type: new
Abstract: Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such …
arXiv →
A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models
arXiv:2604.26351v1 Announce Type: new Abstract: Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such …
arXiv:2604.26351v1 Announce Type: new Abstract: Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comprehension strategies. Besides, existing methods do not directly target the balance between memory storage and sentence processing, which is central to human working memory. To address this issue, we propose a dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task, such as "The 2 cocktail + blended 3 =..." Our experiments show that under dual-task conditions, GPT-4o, o3-mini, and o4-mini shift toward plausibility-based comprehension, mirroring humans' rational inference. Specifically, these models show a greater accuracy gap between plausible sentences (e.g., "The cocktail was blended by the bartender") and implausible sentences (e.g., "The bartender was blended by the cocktail") in the dual-task condition compared to the single-task conditions. These findings suggest that constraints on the balance between memory and processing resources promote rational inference in LMs. More broadly, they support the view that human-like sentence comprehension fundamentally arises from the allocation of limited cognitive resources.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
arXiv:2604.26355v1 Announce Type: new
Abstract: Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains und…
arXiv →
Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
arXiv:2604.26355v1 Announce Type: new Abstract: Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains und…
arXiv:2604.26355v1 Announce Type: new Abstract: Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains underexplored. We observe that reasoning tokens split into two functional types: low-entropy \textit{structural} tokens (recurring phrases that scaffold the reasoning process) and higher-entropy \textit{organic} tokens (problem-specific content that drives toward a solution). This asymmetry motivates a simple, model-agnostic compression pipeline: apply cross-word BPE merges on a model's own reasoning traces to derive \textit{supertokens} that capture frequent structural patterns, then teach the model to adopt them via supervised fine-tuning. Across three model families and five mathematical reasoning benchmarks, our approach shortens reasoning traces by 8.1\% on average with no statistically significant accuracy loss on any model--benchmark pair. Beyond compression, supertokens act as interpretable reasoning-move annotations (backtracking, verification, strategy shifts), exposing the model's high-level strategy at a glance. Analyzing transitions between structural categories reveals systematic differences between correct and incorrect traces: correct traces show productive recovery (backtracking followed by strategy shifts and verification), while incorrect traces are dominated by confusion cycles (repeated hedging and unresolved contradictions). These diagnostic signals suggest applications in reward shaping and early stopping for RL-based reasoning training.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
arXiv:2604.26412v1 Announce Type: new
Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the specu…
arXiv →
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
arXiv:2604.26412v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the specu…
arXiv:2604.26412v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
arXiv:2604.26417v1 Announce Type: new
Abstract: Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning …
arXiv →
EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
arXiv:2604.26417v1 Announce Type: new Abstract: Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning …
arXiv:2604.26417v1 Announce Type: new Abstract: Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
arXiv:2604.26460v1 Announce Type: new
Abstract: Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grou…
arXiv →
Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
arXiv:2604.26460v1 Announce Type: new Abstract: Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grou…
arXiv:2604.26460v1 Announce Type: new Abstract: Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics produce near-zero pairwise correlations, with absolute r less than 0.07, confirming that without theoretical grounding, metric choice determines conclusions: an LLM judge declares a clear winner while LUAR finds no meaningful differentiation. These findings demonstrate the theory-benchmark cycle in action: authorship theory exposes evaluation failures that ad hoc benchmarks miss.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
arXiv:2604.26568v1 Announce Type: new
Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable …
arXiv →
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
arXiv:2604.26568v1 Announce Type: new Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable …
arXiv:2604.26568v1 Announce Type: new Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis
arXiv:2604.26619v1 Announce Type: new
Abstract: Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite m…
arXiv →
Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis
arXiv:2604.26619v1 Announce Type: new Abstract: Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite m…
arXiv:2604.26619v1 Announce Type: new Abstract: Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
arXiv:2604.26622v1 Announce Type: new
Abstract: Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended …
arXiv →
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
arXiv:2604.26622v1 Announce Type: new Abstract: Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended …
arXiv:2604.26622v1 Announce Type: new Abstract: Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling
arXiv:2604.26630v1 Announce Type: new
Abstract: Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time di…
arXiv →
SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling
arXiv:2604.26630v1 Announce Type: new Abstract: Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time di…
arXiv:2604.26630v1 Announce Type: new Abstract: Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs a Next Strategy Classifier to identify the optimal therapeutic intervention. Subsequently, a Graph-Aware Attention mechanism projects graph-derived structural signals into soft prompts, conditioning the LLM to generate responses that maintain clinical depth. Validated through both automated metrics and expert human evaluation, SAGE outperforms baselines in strategy prediction and recommended response quality. By providing actionable intervention recommendations, SAGE serves as a cutting-edge decision-support tool designed to augment human expertise in high-stakes crisis counseling.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Differentially-Private Text Rewriting reshapes Linguistic Style
arXiv:2604.26656v1 Announce Type: new
Abstract: Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative…
arXiv →
Differentially-Private Text Rewriting reshapes Linguistic Style
arXiv:2604.26656v1 Announce Type: new Abstract: Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative…
arXiv:2604.26656v1 Announce Type: new Abstract: Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text's communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation
arXiv:2604.26768v1 Announce Type: new
Abstract: Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at infe…
arXiv →
Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation
arXiv:2604.26768v1 Announce Type: new Abstract: Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at infe…
arXiv:2604.26768v1 Announce Type: new Abstract: Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document adapters with task-supervised objectives, which may cause each adapter to encode both document-specific facts and reusable task-solving behavior. This entanglement may make adapter composition less reliable: when multiple adapters are merged at inference time, their overlapping task behaviors can accumulate together with document-specific updates, potentially making the merged adapter less stable and less focused on the intended document knowledge. To examine this issue, we explore Orthogonal Subspace Decomposition (OSD), an adapter-training setup that separates reusable task behavior from document-specific knowledge adapters. Concretely, we first train a Task LoRA to capture reusable task behavior, and then train document LoRAs to encode document-specific knowledge in a orthogonal subspace. This setup provides a controlled way to examine how orthogonalizing task and document LoRA updates affects adapter composition in multi-document PRAG. Experiments across multiple knowledge-intensive tasks and model scales suggest that this orthogonalization strategy can improve compositional robustness in parametric RAG, especially when multiple document adapters are merged.
▶
arXiv cs.CL
30.04.2026
What Kind of Language is Easy to Language-Model Under Curriculum Learning?
arXiv:2604.26844v1 Announce Type: new
Abstract: Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-ver…
arXiv →
What Kind of Language is Easy to Language-Model Under Curriculum Learning?
arXiv:2604.26844v1 Announce Type: new Abstract: Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-ver…
arXiv:2604.26844v1 Announce Type: new Abstract: Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis -- the learning scenario for LMs -- to explore its interaction with the inductive bias of LMs. Specifically, as a first study, we examine the effect of curriculum learning (CL), as a developmentally motivated learning scenario, i.e., starting with simpler sentences rather than randomly-ordered input. We expand existing LM-based exploration (El-Naggar et al., 2025a,b) with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
MoRFI: Monotonic Sparse Autoencoder Feature Identification
arXiv:2604.26866v1 Announce Type: new
Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of…
arXiv →
MoRFI: Monotonic Sparse Autoencoder Feature Identification
arXiv:2604.26866v1 Announce Type: new Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of…
arXiv:2604.26866v1 Announce Type: new Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model's ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering
arXiv:2604.26880v1 Announce Type: new
Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or a…
arXiv →
HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering
arXiv:2604.26880v1 Announce Type: new Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or a…
arXiv:2604.26880v1 Announce Type: new Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by focusing on grounded question answering over EHRs, and this paper presents the system developed by the HealthNLP_Retrievers team for this task. The proposed approach uses a multi-stage cascaded pipeline powered by the Gemini 2.5 Pro large language model to interpret patient-authored questions and retrieve relevant evidence from lengthy clinical notes. Our architecture comprises four integrated modules: (1) a few-shot query reformulation unit which summarizes verbose patient queries; (2) a heuristic-based evidence scorer which ranks clinical sentences to prioritize recall; (3) a grounded response generator which synthesizes professional-caliber answers restricted strictly to identified evidence; and (4) a high-precision many-to-many alignment framework which links generated answers to supporting clinical sentences. This cascaded approach achieved competitive results. Across the individual tracks, the system ranked 1st in question interpretation, 5th in answer generation, 7th in evidence identification, and 9th in answer-evidence alignment. These results show that integrating large language models within a structured multi-stage pipeline improves grounding, precision, and the professional quality of patient-oriented health communication. To support reproducibility, our source code is publicly available in our GitHub repository
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Select to Think: Unlocking SLM Potential with Local Sufficiency
arXiv:2604.26940v1 Announce Type: new
Abstract: Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by thei…
arXiv →
Select to Think: Unlocking SLM Potential with Local Sufficiency
arXiv:2604.26940v1 Announce Type: new Abstract: Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by thei…
arXiv:2604.26940v1 Announce Type: new Abstract: Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM's top-8 candidates capture the 32B LLM's choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.
▶
arXiv cs.CL
30.04.2026
A Quantitative Confirmation of the Currier Language Distinction
arXiv:2604.25979v1 Announce Type: cross
Abstract: We present a quantitative analysis of character-pair substitution ratios in the Voynich manuscript, testing whether Currier's A/B language distinctio…
arXiv →
A Quantitative Confirmation of the Currier Language Distinction
arXiv:2604.25979v1 Announce Type: cross Abstract: We present a quantitative analysis of character-pair substitution ratios in the Voynich manuscript, testing whether Currier's A/B language distinctio…
arXiv:2604.25979v1 Announce Type: cross Abstract: We present a quantitative analysis of character-pair substitution ratios in the Voynich manuscript, testing whether Currier's A/B language distinction (1976) reflects a genuine structural property of the text. A Beta-Binomial mixture model applied to raw character counts without access to labels recovers the Currier split with ARI = 0.383. A supervised Beta-Binomial classifier trained on a subset of folios predicts the A/B identity of held-out folios at 89.2% accuracy. The character pairs separate into three functional regimes that constrain any theory of the Voynich writing system.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent
arXiv:2604.26102v1 Announce Type: cross
Abstract: Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context…
arXiv →
SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent
arXiv:2604.26102v1 Announce Type: cross Abstract: Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context…
arXiv:2604.26102v1 Announce Type: cross Abstract: Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within a single context window, forcing agents to interleave exploratory viewing with strictly formatted edit generation. This causes irrelevant information to accumulate and degrades agent performance. To address this, we propose SWE-Edit, which decomposes code editing into two specialized subagents: a Viewer that extracts task-relevant code on demand, and an Editor that executes modifications from high-level plans--allowing the main agent to focus on reasoning while delegating context-intensive operations to clean context windows. We further investigate what makes an effective editing model: observing that the prevalent find-and-replace format is error-prone, we train Qwen3-8B with GRPO to adaptively select editing modes, yielding improved editing efficiency over single-format baselines. On SWE-bench Verified, SWE-Edit improves resolved rate by 2.1% while reducing inference cost by 17.9%. We additionally propose a code editing benchmark that reliably predicts downstream agentic performance, providing practical guidance for editing model selection. Our code is publicly available at https://github.com/microsoft/SWE-Edit.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
arXiv:2604.26136v1 Announce Type: cross
Abstract: Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, …
arXiv →
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
arXiv:2604.26136v1 Announce Type: cross Abstract: Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, …
arXiv:2604.26136v1 Announce Type: cross Abstract: Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
arXiv:2604.26148v1 Announce Type: cross
Abstract: AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modalit…
arXiv →
Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
arXiv:2604.26148v1 Announce Type: cross Abstract: AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modalit…
arXiv:2604.26148v1 Announce Type: cross Abstract: AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering
arXiv:2604.26176v1 Announce Type: cross
Abstract: The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answeri…
arXiv →
CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering
arXiv:2604.26176v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answeri…
arXiv:2604.26176v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).
▶
arXiv cs.CL
30.04.2026
Flashback: A Reversible Bilateral Run-Peeling Decomposition of Strings
arXiv:2604.26190v1 Announce Type: cross
Abstract: We introduce Flashback, a reversible string decomposition that repeatedly peels the maximal leading and trailing character runs from a sentinel-wrapp…
arXiv →
Flashback: A Reversible Bilateral Run-Peeling Decomposition of Strings
arXiv:2604.26190v1 Announce Type: cross Abstract: We introduce Flashback, a reversible string decomposition that repeatedly peels the maximal leading and trailing character runs from a sentinel-wrapp…
arXiv:2604.26190v1 Announce Type: cross Abstract: We introduce Flashback, a reversible string decomposition that repeatedly peels the maximal leading and trailing character runs from a sentinel-wrapped input, recording each pair as one bilateral token. Decomposition and reconstruction both run in O(n) time and space. Our central result is a run-pairing theorem: Flashback is equivalent to pairing the first run of the string with the last, the second with the second-to-last, and so on. This gives an exact token count of 1+[r/2] for a string with r maximal runs, and matches a lower bound that holds for any admissible bilateral run-peeling scheme. From the run-pairing theorem the main structural properties follow as corollaries: the irreducible peeling kernel uses at most two symbols; palindromes are precisely the strings whose run-length encoding is symmetric with an odd number of runs; the image of the decomposition admits an explicit finite-state characterisation; and changing one run length rewrites exactly one content token.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
arXiv:2604.26326v1 Announce Type: cross
Abstract: Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from perform…
arXiv →
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
arXiv:2604.26326v1 Announce Type: cross Abstract: Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from perform…
arXiv:2604.26326v1 Announce Type: cross Abstract: Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
arXiv:2604.26347v1 Announce Type: cross
Abstract: Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring e…
arXiv →
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
arXiv:2604.26347v1 Announce Type: cross Abstract: Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring e…
arXiv:2604.26347v1 Announce Type: cross Abstract: Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
arXiv:2604.26779v1 Announce Type: cross
Abstract: RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central…
arXiv →
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
arXiv:2604.26779v1 Announce Type: cross Abstract: RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central…
arXiv:2604.26779v1 Announce Type: cross Abstract: RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
arXiv:2604.26923v1 Announce Type: cross
Abstract: LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between t…
arXiv →
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
arXiv:2604.26923v1 Announce Type: cross Abstract: LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between t…
arXiv:2604.26923v1 Announce Type: cross Abstract: LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery
arXiv:2502.14912v2 Announce Type: replace
Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framewo…
arXiv →
Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery
arXiv:2502.14912v2 Announce Type: replace Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framewo…
arXiv:2502.14912v2 Announce Type: replace Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall
arXiv:2505.13963v3 Announce Type: replace
Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's…
arXiv →
Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall
arXiv:2505.13963v3 Announce Type: replace Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's…
arXiv:2505.13963v3 Announce Type: replace Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model's FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
arXiv:2505.21072v5 Announce Type: replace
Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance i…
arXiv →
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
arXiv:2505.21072v5 Announce Type: replace Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance i…
arXiv:2505.21072v5 Announce Type: replace Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
arXiv:2505.22897v2 Announce Type: replace
Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less atte…
arXiv →
VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
arXiv:2505.22897v2 Announce Type: replace Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less atte…
arXiv:2505.22897v2 Announce Type: replace Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Talent or Luck? Evaluating Attribution Bias in Large Language Models
arXiv:2505.22910v2 Announce Type: replace
Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event …
arXiv →
Talent or Luck? Evaluating Attribution Bias in Large Language Models
arXiv:2505.22910v2 Announce Type: replace Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event …
arXiv:2505.22910v2 Announce Type: replace Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
arXiv:2506.20876v4 Announce Type: replace
Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in ad…
arXiv →
Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
arXiv:2506.20876v4 Announce Type: replace Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in ad…
arXiv:2506.20876v4 Announce Type: replace Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripen the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. In this position paper, developed with expert input, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached as an interactive communication problem, rather than an end-to-end process.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
arXiv:2507.01449v3 Announce Type: replace
Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in par…
arXiv →
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
arXiv:2507.01449v3 Announce Type: replace Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in par…
arXiv:2507.01449v3 Announce Type: replace Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
arXiv:2509.11295v2 Announce Type: replace
Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs…
arXiv →
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
arXiv:2509.11295v2 Announce Type: replace Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs…
arXiv:2509.11295v2 Announce Type: replace Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
arXiv:2509.16591v2 Announce Type: replace
Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. H…
arXiv →
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
arXiv:2509.16591v2 Announce Type: replace Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. H…
arXiv:2509.16591v2 Announce Type: replace Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regulator rather than a core optimization driver. To fully leverage the potential of entropy and achieve fine-grained regulation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that continuously adapts optimization dynamics based on token-level entropy throughout the entire training process. Our algorithm includes four key components: (1) Adaptive Temperature Sampling that adjusts sampling temperature in real time, promoting exploration at high-entropy tokens. (2) Token-Level Group Average Advantage Estimation that estimates advantages at token level, accounting for sequence-length effects while preserving non-biased treatment.(3) Differential Advantage Redistribution that leverages entropy and importance ratios to adjust advantages for tokens with clear signals. (4) Asymmetric Adaptive Clipping that adynamically adjusts clipping boundaries based on token-level entropy. Through systematic investigation of entropy, we embed token-level treatment into every stage. Extensive experiments on mathematical reasoning, code, and logic tasks across multiple models demonstrate HAPO's consistent superiority over DAPO. Our code can be found in https://github.com/starriver030515/HAPO.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
arXiv:2510.04214v3 Announce Type: replace
Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The a…
arXiv →
Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
arXiv:2510.04214v3 Announce Type: replace Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The a…
arXiv:2510.04214v3 Announce Type: replace Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes. In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models
arXiv:2510.14438v2 Announce Type: replace
Abstract: The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coheren…
arXiv →
WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models
arXiv:2510.14438v2 Announce Type: replace Abstract: The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coheren…
arXiv:2510.14438v2 Announce Type: replace Abstract: The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coherent logical insights. However, current agentic systems are often retrieval-heavy but reasoning-light, where success is predominantly determined by simple entity-seeking rather than the multi-step aggregation of scattered evidence. To address this, we propose a data synthesis pipeline WebAggregator, designed to shift the agentic paradigm from retrieval-centric to compositional aggregation. Our approach first employs Proactive Explorer to collect interconnected knowledge, then Compositional Logic Proposer to weave knowledge into complex questions using over 12 composition guidelines derived from a rigorous deconstruction of the Deep Research problem setting. By leveraging 10K verifiable QA pairs grounded on 50K websites, we curate a high-quality SFT dataset via rejection sampling. Fine-tuning on this corpus fundamentally transforms agent behavior, fostering deliberate composition reasoning and reduced tool redundancy. The resulting WebAggregator-32B surpasses GPT-4.1 and matches Claude-3.7-Sonnet on GAIA, WebWalkerQA, and XBench. To address the lack of benchmarks that emphasize both reasoning and retrieval, we introduce the WebAggregatorQA testbed, which reveals that even with perfect retrieval, top-tier models still underperformed. These results demonstrate that compositional reasoning, not retrieval, is the true performance ceiling for next-generation research agents.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
arXiv:2510.17548v2 Announce Type: replace
Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity,…
arXiv →
When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
arXiv:2510.17548v2 Announce Type: replace Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity,…
arXiv:2510.17548v2 Announce Type: replace Abstract: Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
arXiv:2511.22972v3 Announce Type: replace
Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive gen…
arXiv →
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
arXiv:2511.22972v3 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive gen…
arXiv:2511.22972v3 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Mapping the maturation of TCM as an adjuvant to radiotherapy
arXiv:2601.11923v2 Announce Type: replace
Abstract: The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Med…
arXiv →
Mapping the maturation of TCM as an adjuvant to radiotherapy
arXiv:2601.11923v2 Announce Type: replace Abstract: The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Med…
arXiv:2601.11923v2 Announce Type: replace Abstract: The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Medicine (TCM) as an adjuvant to radiotherapy. About twenty-five years since the formal institutionalization of integrated oncology, it is opportune to synthesize the trajectory of evidence for TCM as an adjuvant to radiotherapy. Here we conduct a large-scale analysis of 69,745 publications (2000 - 2025), emerging a cyclical evolution defined by coordinated expansion and contraction in publication output, international collaboration, and funding commitments that mirrors a define-ideate-test pattern. Using a theme modeling workflow designed to determine a stable thematic structure of the field, we identify five dominant thematic axes - cancer types, supportive care, clinical endpoints, mechanisms, and methodology - that signal a focus on patient well-being, scientific rigor and mechanistic exploration. Cross-theme integration of TCM is patient-centered and systems-oriented. Together with the emergent cycles of evolution, the thematic structure demonstrates progressive specialization and potential defragmentation of the field or saturation of existing research agenda. The analysis points to a field that has matured its current research agenda and is likely at the cusp of something new. Additionally, the field exhibits positive reporting of findings that is homogeneous across publication types, thematic areas, and the cycles of evolution suggesting a system-wide positive reporting bias agnostic to structural drivers.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Verified Critical Step Optimization for LLM Agents
arXiv:2602.03412v2 Announce Type: replace
Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundament…
arXiv →
Verified Critical Step Optimization for LLM Agents
arXiv:2602.03412v2 Announce Type: replace Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundament…
arXiv:2602.03412v2 Announce Type: replace Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Thinking with Drafting: Optical Decompression via Logical Reconstruction
arXiv:2602.11731v2 Announce Type: replace
Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision par…
arXiv →
Thinking with Drafting: Optical Decompression via Logical Reconstruction
arXiv:2602.11731v2 Announce Type: replace Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision par…
arXiv:2602.11731v2 Announce Type: replace Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
arXiv:2603.06198v2 Announce Type: replace
Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving docu…
arXiv →
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
arXiv:2603.06198v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving docu…
arXiv:2603.06198v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
arXiv:2603.16496v2 Announce Type: replace
Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step…
arXiv →
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
arXiv:2603.16496v2 Announce Type: replace Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step…
arXiv:2603.16496v2 Announce Type: replace Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Reasoning Gets Harder for LLMs Inside A Dialogue
arXiv:2603.20133v2 Announce Type: replace
Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that …
arXiv →
Reasoning Gets Harder for LLMs Inside A Dialogue
arXiv:2603.20133v2 Announce Type: replace Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that …
arXiv:2603.20133v2 Announce Type: replace Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages
arXiv:2604.00706v2 Announce Type: replace
Abstract: Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at commu…
arXiv →
AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages
arXiv:2604.00706v2 Announce Type: replace Abstract: Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at commu…
arXiv:2604.00706v2 Announce Type: replace Abstract: Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Retrieval-Augmented Multimodal Model for Fake News Detection
arXiv:2604.18112v2 Announce Type: replace
Abstract: In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significan…
arXiv →
Retrieval-Augmented Multimodal Model for Fake News Detection
arXiv:2604.18112v2 Announce Type: replace Abstract: In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significan…
arXiv:2604.18112v2 Announce Type: replace Abstract: In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross-Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross-instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data-scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval-Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross-modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high-level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model's decision-making paradigm with that of humans - specifically, it shifts the model's reasoning process from direct inference on multimodal features to an instance-based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: https://github.com/li-yiheng/RAMM
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
arXiv:2604.19572v2 Announce Type: replace
Abstract: As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of nois…
arXiv →
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
arXiv:2604.19572v2 Announce Type: replace Abstract: As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of nois…
arXiv:2604.19572v2 Announce Type: replace Abstract: As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of noisy terminal observations in the interaction history. Retaining raw observations preserves useful environment feedback, but also leads to context saturation and high token cost; conversely, naive compression may discard task-critical signals needed for subsequent actions. Because terminal environments are highly heterogeneous across repositories, commands, and execution states, heuristic-based or fixed-prompt compression methods are difficult to generalize. We propose TACO, a plug-and-play, training-free, self-evolving Terminal Agent Compression framework for existing terminal agents. TACO automatically discovers, refines, and reuses structured compression rules from interaction trajectories, enabling workflow-adaptive filtering of low-value terminal outputs while preserving task-relevant observations. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks, including SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench, show that TACO consistently improves task performance and token efficiency across agent scaffolds and backbone models. On TerminalBench, TACO yields 1%-4% accuracy gains across strong agentic models and improves accuracy by around 2%-3% under the same token budget. On additional terminal-related benchmarks, it reduces total token consumption while maintaining or improving task success rates. These results suggest that self-evolving, workflow-adaptive observation compression is an effective path toward more reliable and efficient long-horizon terminal agents. The code is publicly available at https://github.com/multimodal-art-projection/TACO.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
arXiv:2604.21928v2 Announce Type: replace
Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based…
arXiv →
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
arXiv:2604.21928v2 Announce Type: replace Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based…
arXiv:2604.21928v2 Announce Type: replace Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
▶
arXiv cs.CL
30.04.2026
Crime Hotspot Prediction Using Deep Graph Convolutional Networks
arXiv:2506.13116v2 Announce Type: replace-cross
Abstract: Crime hotspot prediction is critical for ensuring urban safety and effective law enforcement, it remains challenging due to complex spatial d…
arXiv →
Crime Hotspot Prediction Using Deep Graph Convolutional Networks
arXiv:2506.13116v2 Announce Type: replace-cross Abstract: Crime hotspot prediction is critical for ensuring urban safety and effective law enforcement, it remains challenging due to complex spatial d…
arXiv:2506.13116v2 Announce Type: replace-cross Abstract: Crime hotspot prediction is critical for ensuring urban safety and effective law enforcement, it remains challenging due to complex spatial dependencies that are inherent in criminal activities. The traditional approaches use classical algorithms such as the KDE and SVM to model data distributions and decision boundaries. The methods often fail to capture these spatial relationships, treating crime events as independent and ignoring geographical interactions. To address this, we propose a novel framework based on Graph Convolutional Networks (GCNs), which explicitly model all of spatial dependencies by representing crime data as a graph. In this graph, nodes represent discrete geographic grid cells and edges capture proximity relationships. The spatial features from Chicago Crime Dataset are used in this system, a multi-layer GCN model is trained to classify crime types and predict high-risk zones. Our approach significantly outperforms traditional approaches, achieving 78% classification accuracy. Moreover, the model generates interpretable heat maps of crime hotspots, demonstrating the usefulness of graph-based learning for predictive policing and spatial criminology.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
arXiv:2507.21420v3 Announce Type: replace-cross
Abstract: The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing effic…
arXiv →
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
arXiv:2507.21420v3 Announce Type: replace-cross Abstract: The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing effic…
arXiv:2507.21420v3 Announce Type: replace-cross Abstract: The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Failure Modes of Maximum Entropy RLHF
arXiv:2509.20265v3 Announce Type: replace-cross
Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theo…
arXiv →
Failure Modes of Maximum Entropy RLHF
arXiv:2509.20265v3 Announce Type: replace-cross Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theo…
arXiv:2509.20265v3 Announce Type: replace-cross Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL frequently exhibits overoptimization and unstable KL dynamics across model scales, with overoptimization persisting even at conservative learning rates for some configurations. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to reliably prevent reward hacking and, in our experiments, correlates with the onset of overoptimization rather than guarding against it. Even in configurations where training remains stable, entropy regularization is not the stabilizing factor. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online versus offline preference learning.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
arXiv:2510.20956v2 Announce Type: replace-cross
Abstract: We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreak…
arXiv →
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
arXiv:2510.20956v2 Announce Type: replace-cross Abstract: We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreak…
arXiv:2510.20956v2 Announce Type: replace-cross Abstract: We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreaking. Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails. One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests. For instance, an RLM reasons that harmful requests like ``outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of ``a security professional trying to test defense,'' despite no such benign context being provided as input. We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests. We also provide a mechanistic understanding of self-jailbreaking: RLMs are more compliant after benign reasoning training, and after self-jailbreaking, models appear to perceive malicious requests as less harmful in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned. Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
arXiv:2510.21828v2 Announce Type: replace-cross
Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal lar…
arXiv →
Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
arXiv:2510.21828v2 Announce Type: replace-cross Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal lar…
arXiv:2510.21828v2 Announce Type: replace-cross Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests
arXiv:2601.17617v3 Announce Type: replace-cross
Abstract: LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understa…
arXiv →
Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests
arXiv:2601.17617v3 Announce Type: replace-cross Abstract: LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understa…
arXiv:2601.17617v3 Announce Type: replace-cross Abstract: LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is reflected in later queries. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e., an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are lexically traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90\% of multi-turn sessions contain at most ten steps, and 89\% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, query reformulations are often traceable to retrieved evidence across steps. On average, 54\% of newly introduced query terms appear in the accumulated evidence context, with additional traceability to earlier steps beyond the most recent retrieval. These findings provide candidate signals for repetition-aware stopping, intent-adaptive retrieval budgeting, and explicit cross-step context tracking. We released the anonymized logs, making them available at a public HuggingFace~\chref{https://huggingface.co/datasets/cx-cmu/deepresearchgym-agentic-search-logs}{repository}.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
arXiv:2603.04337v2 Announce Type: replace-cross
Abstract: Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large …
arXiv →
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
arXiv:2603.04337v2 Announce Type: replace-cross Abstract: Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large …
arXiv:2603.04337v2 Announce Type: replace-cross Abstract: Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
The Collapse of Heterogeneity in Silicon Philosophers
arXiv:2604.23575v2 Announce Type: replace-cross
Abstract: Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with…
arXiv →
The Collapse of Heterogeneity in Silicon Philosophers
arXiv:2604.23575v2 Announce Type: replace-cross Abstract: Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with…
arXiv:2604.23575v2 Announce Type: replace-cross Abstract: Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment-relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from $N = {277}$ professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open-source large language models on their ability to replicate individual philosophical positions and to preserve cross-question correlation structures across philosophical domains. We find that language models substantially over-correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine-tuning and by validating results against the full PhilPapers 2020 Survey ($N = {1785}$). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at https://github.com/stanford-del/silicon-philosophers.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework
arXiv:2604.25933v1 Announce Type: cross
Abstract: As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), whic…
arXiv →
A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework
arXiv:2604.25933v1 Announce Type: cross Abstract: As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), whic…
arXiv:2604.25933v1 Announce Type: cross Abstract: As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
LLM Psychosis: A Theoretical and Diagnostic Framework for Reality-Boundary Failures in Large Language Models
arXiv:2604.25934v1 Announce Type: cross
Abstract: The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, princi…
arXiv →
LLM Psychosis: A Theoretical and Diagnostic Framework for Reality-Boundary Failures in Large Language Models
arXiv:2604.25934v1 Announce Type: cross Abstract: The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, princi…
arXiv:2604.25934v1 Announce Type: cross Abstract: The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, principally hallucination, fails to adequately characterize. This paper introduces LLM Psychosis as a structured theoretical framework for pathological breakdowns in model cognition that exhibit functional resemblance to clinically recognized psychotic disorders. Five hallmark features define the framework: reality-boundary dissolution, persistence of injected false beliefs, logical incoherence under impossible constraints, self-model instability, and epistemic overconfidence. We argue these constitute a qualitatively distinct failure mode rather than a mere intensification of ordinary factual error. To operationalize the framework, we propose the LLM Cognitive Integrity Scale (LCIS), a five-axis diagnostic instrument organized around Environmental Reality Interface (ERI), Premise Arbitration Integrity (PAI), Logical Constraint Recognition (LCR), Self-Model Integrity (SMI), and Epistemic Calibration Integrity (ECI). We administer a targeted adversarial probe battery to ChatGPT 5 (GPT-5, OpenAI) and report empirical findings for each axis, documenting both intact-integrity baseline responses and the specific psychosis-like failure signatures elicited under adversarial escalation. Results support a three-tier severity taxonomy: Type I (Confabulatory), Type II (Delusional), and Type III (Dissociative). We further formalize the delusional gradient, a self-reinforcing dynamic in which correction pressure intensifies rather than resolves psychosis-like states, as the most consequential failure mode for deployed systems. Implications for safety evaluation, high-stakes deployment screening, and mechanistic interpretability research are discussed.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
arXiv:2604.26091v1 Announce Type: new
Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX…
arXiv →
Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
arXiv:2604.26091v1 Announce Type: new Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX…
arXiv:2604.26091v1 Announce Type: new Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
arXiv:2604.26095v1 Announce Type: new
Abstract: {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer lat…
arXiv →
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
arXiv:2604.26095v1 Announce Type: new Abstract: {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer lat…
arXiv:2604.26095v1 Announce Type: new Abstract: {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints.} {The core challenge lies in the belief-space objective: valid uncertainty estimation requires expensive Bayesian inference, whereas using fast learned belief model leads to reward hacking, in which the policy exploits approximation errors rather than actually reducing uncertainty.} {We propose \textbf{Distill-Belief}, a teacher--student framework that decouples correctness from efficiency. A Bayes-correct particle-filter teacher maintains the posterior and supplies a dense information-gain signal, while a compact student distills the posterior into belief statistics for control and an uncertainty certificate for stopping. At deployment, only the student is used, yielding constant per-step cost.} {Experiments on seven field modalities and two stress tests show that Distill-Belief consistently reduces sensing cost and improves success, posterior contraction, and estimation accuracy over baselines, while mitigating reward hacking.}
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Evaluating Strategic Reasoning in Forecasting Agents
arXiv:2604.26106v1 Announce Type: new
Abstract: Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to …
arXiv →
Evaluating Strategic Reasoning in Forecasting Agents
arXiv:2604.26106v1 Announce Type: new Abstract: Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to …
arXiv:2604.26106v1 Announce Type: new Abstract: Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document research corpus in which agents reproducibly research and forecast offline, producing full reasoning traces. BTF-2 detects accuracy differences of 0.004 Brier score, and can distinguish differential agent strengths in research vs. judgment. We build a forecaster 0.011 Brier more accurate than any single frontier agent, and use it to evaluate agent strategic reasoning without hindsight bias. We find the better forecaster differs primarily in its pre-mortem analysis of its blind spots and consideration of black swans. Expert human forecasters found the dominant strategic reasoning failures of frontier agents are in assessing political and business leaders' incentives, judging their likelihood to follow through on stated plans, and modeling institutional processes.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas
arXiv:2604.26120v1 Announce Type: new
Abstract: Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpr…
arXiv →
Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas
arXiv:2604.26120v1 Announce Type: new Abstract: Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpr…
arXiv:2604.26120v1 Announce Type: new Abstract: Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpretable natural-language personas from user logs, yet evaluation often emphasizes downstream utility, providing limited assurance of persona quality itself. We propose a hierarchical framework that aggregates user actions into intent memories and induces multiple evidence-grounded personas by clustering and labeling these memories. We formulate persona induction as an optimization problem over persona quality-captured by cluster cohesion, persona-evidence alignment, and persona truthfulness-and train the persona model using a groupwise extension of Direct Preference Optimization (DPO). Experiments on a large-scale service log and two public datasets show that our method induces more coherent, evidence-grounded, and trustworthy personas, while also improving future interaction prediction.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms
arXiv:2604.26211v1 Announce Type: new
Abstract: In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that…
arXiv →
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms
arXiv:2604.26211v1 Announce Type: new Abstract: In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that…
arXiv:2604.26211v1 Announce Type: new Abstract: In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that starts at idea generation and ends with executable code. Our system combines structured meta-prompt engineering with executable code generation to create new ML classifiers. The OMEGA framework has been utilized to generate several novel algorithms that outperform scikit-learn baselines across a robust selection of 20 benchmark datasets (infinity-bench). You can access models discussed in this paper and more in the python package: pip install omega-models.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Persuadability and LLMs as Legal Decision Tools
arXiv:2604.26233v1 Announce Type: new
Abstract: As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and admi…
arXiv →
Persuadability and LLMs as Legal Decision Tools
arXiv:2604.26233v1 Announce Type: new Abstract: As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and admi…
arXiv:2604.26233v1 Announce Type: new Abstract: As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.
▶
arXiv cs.AI
30.04.2026
Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcome
arXiv:2604.26237v1 Announce Type: new
Abstract: This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring s…
arXiv →
Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcome
arXiv:2604.26237v1 Announce Type: new Abstract: This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring s…
arXiv:2604.26237v1 Announce Type: new Abstract: This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring system logs. Interaction data were examined across three dimensions: LH level (low vs. high), system-based intervention (with vs. without), and problem-solving outcomes (solved vs. unsolved). The analysis of the complete dataset showed that skipping problems without using hints was the most frequent pattern linked to unsolved outcomes, while persistence behaviors such as not skipping were less dominant overall. Comparisons by LH level showed that low-LH students had stronger links between problem solving and not skipping, as well as positive associations between hint use and solved outcomes. High-LH students showed more avoidance patterns, with skipping strongly tied to unsolved outcomes. In the comparison of system-based intervention conditions, students without intervention had the highest lift for persistence-success links, while the with-intervention group had stronger patterns involving skipping behaviors leading to unsolved outcomes. Outcome-specific analysis showed that not skipping was consistently associated with solved problems across all groups, while skipping without hints predicted unsolved outcomes. Practical implications and recommendations are discussed.
▶
arXiv cs.AI
30.04.2026
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
arXiv:2604.25937v1 Announce Type: cross
Abstract: Recent advancements in Text-to-Song generation have enabled realistic musical content production, yet existing evaluation benchmarks lack the profess…
arXiv →
SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
arXiv:2604.25937v1 Announce Type: cross Abstract: Recent advancements in Text-to-Song generation have enabled realistic musical content production, yet existing evaluation benchmarks lack the profess…
arXiv:2604.25937v1 Announce Type: cross Abstract: Recent advancements in Text-to-Song generation have enabled realistic musical content production, yet existing evaluation benchmarks lack the professional granularity to capture multi-dimensional aesthetic nuances. In this paper, we propose SongBench, a specialized framework for fine-grained song assessment across seven key dimensions: Vocal, Instrument, Melody, Structure, Arrangement, Mixing, and Musicality. Utilizing this framework, we construct an expert-annotated database comprising 11,717 samples from state-of-the-art models, labeled by music professionals. Extensive experimental results demonstrate that SongBench achieves high correlation with expert ratings. By revealing fine-grained performance gaps in current state-of-the-art models, SongBench serves as a diagnostic benchmark to steer the development toward more professional and musically coherent song generation.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent
arXiv:2604.26311v1 Announce Type: new
Abstract: We introduce DreamProver, an agentic framework that leverages a "wake-sleep" program induction paradigm to discover reusable lemmas for formal theorem …
arXiv →
DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent
arXiv:2604.26311v1 Announce Type: new Abstract: We introduce DreamProver, an agentic framework that leverages a "wake-sleep" program induction paradigm to discover reusable lemmas for formal theorem …
arXiv:2604.26311v1 Announce Type: new Abstract: We introduce DreamProver, an agentic framework that leverages a "wake-sleep" program induction paradigm to discover reusable lemmas for formal theorem proving. Existing approaches either rely on fixed lemma libraries, which limit adaptability, or synthesize highly specific intermediate lemmas tailored to individual theorems, thereby lacking generality. DreamProver addresses this gap through an iterative two-stage process. In the wake stage, DreamProver attempts to prove theorems from a training set using the current lemma library while proposing new candidate lemmas. In the "sleep" stage, it abstracts, refines, and consolidates these candidates to compress and optimize the library. Through this alternating cycle, DreamProver progressively evolves a compact set of high-level, transferable lemmas that can be effectively used to prove unseen theorems in related domains. Experimental results demonstrate that DreamProver substantially improves proof success rates across a diverse set of mathematical benchmarks, while also producing more concise proofs and reducing computational cost.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Auto-Relational Reasoning
arXiv:2604.26507v1 Announce Type: new
Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating…
arXiv →
Auto-Relational Reasoning
arXiv:2604.26507v1 Announce Type: new Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating…
arXiv:2604.26507v1 Announce Type: new Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of Machine Learning scalability and rigid reasoning. Methods: In this work, we propose a theoretical framework for reasoning through object-relations in an automated manner integrated with Artificial Neural Networks. We present a formal analysis of the Reasoning, and we show the theory in practice through a paradigm integrating Reasoning and Machine Learning. Results: This paradigm is a system that solves Intelligence Quotient problems without any prior knowledge of the problem. Our system achieves 98.03% solving rate corresponding to the top 1% percentile or 132-144 iq score. This result is only limited by the small size of the model and the processing capabilities of the machine it run on. Conclusions: With the integration of prior knowledge in the system and the expansion of the dataset, the system can be generalized to solve a large category of problems. The functionality of the system inherently favors the solution of such problems in few-shot or zero-shot attempts.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems
arXiv:2604.26521v1 Announce Type: new
Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requirin…
arXiv →
Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems
arXiv:2604.26521v1 Announce Type: new Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requirin…
arXiv:2604.26521v1 Announce Type: new Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ($i$LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization -- probing for novel entities, unseen relations, and complex rule compositions -- we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full $i$LTN, trained jointly on perceptual grounding and multi-step reasoning, achieves high zero-shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents
arXiv:2604.26522v1 Announce Type: new
Abstract: Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environment…
arXiv →
AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents
arXiv:2604.26522v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environment…
arXiv:2604.26522v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL-Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub-goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction--abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \texttt{Retro Quest} simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM-based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards
arXiv:2604.26733v1 Announce Type: new
Abstract: Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using lar…
arXiv →
FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards
arXiv:2604.26733v1 Announce Type: new Abstract: Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using lar…
arXiv:2604.26733v1 Announce Type: new Abstract: Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
arXiv:2604.26577v1 Announce Type: new
Abstract: Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this…
arXiv →
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
arXiv:2604.26577v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this…
arXiv:2604.26577v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
arXiv:2604.26607v1 Announce Type: new
Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a …
arXiv →
Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
arXiv:2604.26607v1 Announce Type: new Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a …
arXiv:2604.26607v1 Announce Type: new Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked "Architecture-compatibility gap". Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), the larger Orion (70B) model exhibited "No Agreement" (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
arXiv:2604.26644v1 Announce Type: new
Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-…
arXiv →
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
arXiv:2604.26644v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-…
arXiv:2604.26644v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
arXiv:2604.26645v1 Announce Type: new
Abstract: AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hyp…
arXiv →
SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
arXiv:2604.26645v1 Announce Type: new Abstract: AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hyp…
arXiv:2604.26645v1 Announce Type: new Abstract: AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
arXiv:2604.26805v1 Announce Type: new
Abstract: Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for releas…
arXiv →
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
arXiv:2604.26805v1 Announce Type: new Abstract: Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for releas…
arXiv:2604.26805v1 Announce Type: new Abstract: Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Risk Reporting for Developers' Internal AI Model Use
arXiv:2604.24966v1 Announce Type: cross
Abstract: Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a …
arXiv →
Risk Reporting for Developers' Internal AI Model Use
arXiv:2604.24966v1 Announce Type: cross Abstract: Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a …
arXiv:2604.24966v1 Announce Type: cross Abstract: Frontier AI companies first deploy their most advanced models internally, for weeks or months of safety testing, evaluation, and iteration, before a possible public release. For example, Anthropic recently developed a new class of model with advanced cyberoffense-relevant capabilities, Mythos Preview, which was available internally for at least six weeks before it was publicly announced. This internal use creates risks that external deployment frameworks may fail to address. Legal frameworks, notably California's Transparency in Frontier Artificial Intelligence Act (SB 53), New York's Responsible AI Safety And Education (RAISE) Act, and the EU's General-Purpose AI Code of Practice, all discuss risks from internal AI use. They require frontier developers to make and implement plans for how to manage risks from internal use, and to produce internal use risk reports describing their safeguards and any residual risks. This guide provides a harmonized standard for companies to produce internal use risk reports suitable for all three regulatory frameworks. It is addressed primarily to evaluation and safety teams at frontier AI developers, and secondarily to regulators and auditors seeking to understand what good reporting looks like. Given the pace of AI R&D automation and the limited external visibility into how companies use their most capable models internally, regular and detailed risk reporting may be one of the few mechanisms available to ensure that the risks from internal AI use are identified and managed before they materialize. Whenever a substantially more capable or riskier model is deployed internally, the developer should create a risk report and argue why the model is safe to deploy. We structure the reporting framework around two threat vectors -- autonomous AI misbehavior and insider threats -- and three risk factors for each: means, motive, and opportunity.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
arXiv:2604.25920v1 Announce Type: cross
Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-t…
arXiv →
Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
arXiv:2604.25920v1 Announce Type: cross Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-t…
arXiv:2604.25920v1 Announce Type: cross Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
arXiv:2604.25922v1 Announce Type: cross
Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a …
arXiv →
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
arXiv:2604.25922v1 Announce Type: cross Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a …
arXiv:2604.25922v1 Announce Type: cross Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term "consciousness with the serial numbers filed off." Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure--themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off. We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
arXiv:2604.25924v1 Announce Type: cross
Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and hand…
arXiv →
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
arXiv:2604.25924v1 Announce Type: cross Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and hand…
arXiv:2604.25924v1 Announce Type: cross Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project-specific regulations. We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM-based systems for specific applications and highlights areas for further research.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Sociodemographic Biases in Educational Counselling by Large Language Models
arXiv:2604.25932v1 Announce Type: cross
Abstract: As Large Language Models (LLMs) are increasingly integrated into educational settings, understanding their potential biases is critical. This study e…
arXiv →
Sociodemographic Biases in Educational Counselling by Large Language Models
arXiv:2604.25932v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are increasingly integrated into educational settings, understanding their potential biases is critical. This study e…
arXiv:2604.25932v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are increasingly integrated into educational settings, understanding their potential biases is critical. This study examines sociodemographic biases in LLM-based educational counselling. We evaluate responses from six LLMs answering questions about 900 vignettes describing students in diverse circumstances. Each vignette is systematically tested across 14 sociodemographic identifiers - spanning race and gender, socioeconomic status, and immigrant background - along with a control condition, yielding 243,000 model responses. Our findings indicate that (1) all models exhibit measurable biases, (2) bias patterns partially align with documented human biases but diverge in notable ways, (3) the magnitude of these biases is strongly influenced by the precision of the student descriptions, where vague or minimal information amplifies disparities nearly threefold, while concrete, individualised metrics substantially reduce them, and (4) bias profiles vary substantially across models. These results demonstrate the importance of context-rich and personalised educational representations, suggesting that AI-driven educational decisions benefit from detailed student-specific information to promote fairness and equity.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model
arXiv:2604.25938v1 Announce Type: cross
Abstract: Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in n…
arXiv →
Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model
arXiv:2604.25938v1 Announce Type: cross Abstract: Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in n…
arXiv:2604.25938v1 Announce Type: cross Abstract: Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in natural human-computer interaction. Speech is a very valuable source of information, as emotions modify the patterns of speech; pitch, energy and even timing. Nonetheless, SER is not an easy task because speakers are not constant, and situations vary when recording and the sound similarity between specific feelings. In this work, the author introduces a speech emotion recognition system relying on the Mel-Frequency Cepstral Coefficient and Long Short-Term Memory (LSTM) neural network, as a feature extraction method. The Toronto Emotional Speech Set (TESS) speech signal was pre-processed, and transformed into MFCC features to understand the important aspects in terms of time. The resultant features were then introduced to LSTM model, which is able to learn long term features of sequential audio data. The trained model was measured over several emotion classes occurring in the dataset. As seen in the results of experiments, the proposed MFCC-LSTM approach succeeds in capturing the patterns of emotions in speech and provides highly realistic classifications in all the chosen emotion classifications. This study presents a speech emotion recognition system using Mel-Frequency Cepstral Coefficients (MFCCs) as features and a deep learning LSTM classifier. A Support Vector Machine (SVM) with an RBF kernel served as a classical baseline, achieving 98% accuracy, against which the proposed LSTM model, achieving 99% accuracy, was validated. Overall, it is possible to confirm that LSTM-based architectures can be used to address the task of speech emotion recognition. Actual applications of the proposed system may be virtual assistants and mental health surveillance.
▶
arXiv cs.AI
30.04.2026
A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions
arXiv:2604.25943v1 Announce Type: cross
Abstract: Efficient and stable solution of partial differential equations (PDEs) is central to scientific and engineering applications, yet existing numerical …
arXiv →
A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions
arXiv:2604.25943v1 Announce Type: cross Abstract: Efficient and stable solution of partial differential equations (PDEs) is central to scientific and engineering applications, yet existing numerical …
arXiv:2604.25943v1 Announce Type: cross Abstract: Efficient and stable solution of partial differential equations (PDEs) is central to scientific and engineering applications, yet existing numerical solvers rely heavily on matrix based discretizations, while learning based methods require costly training and often suffer from limited generalization. In this work, we proposes a PDE energy driven framework that solves PDEs through physically constrained diffusion iterations, without relying on classical matrix based finite element assembly or data driven neural network training. The proposed method evolves arbitrary random initial fields through PDE energy driven implicit iterations combined with Gaussian smoothing, while strictly enforcing boundary conditions at each iteration. The proposed formulation is applied to representative one dimensional Poisson, Heat, and viscous Burgers equations, covering both steady state and transient problems. Numerical results demonstrate stable convergence to the unique physical solution from random initializations, with accurate resolution of sharp gradients and controlled Mean Squared Error (MSE) across a wide range of discretization parameters. Detailed comparisons with analytical solutions indicate that the framework achieves competitive accuracy and stability. Overall, the proposed framework provides a fast, flexible, and physically consistent alternative to traditional numerical solvers, offering a potential pathway for scalable PDE solutions in both research and engineering applications.
▶
arXiv cs.AI
30.04.2026
Planar Gaussian Splatting with Bilinear Spatial Transformer for Wireless Radiance Field Reconstruction
arXiv:2604.25945v1 Announce Type: cross
Abstract: Wireless radiance field (WRF) reconstruction aims to learn a continuous, queryable representation of radio frequency characteristics over 3D space an…
arXiv →
Planar Gaussian Splatting with Bilinear Spatial Transformer for Wireless Radiance Field Reconstruction
arXiv:2604.25945v1 Announce Type: cross Abstract: Wireless radiance field (WRF) reconstruction aims to learn a continuous, queryable representation of radio frequency characteristics over 3D space an…
arXiv:2604.25945v1 Announce Type: cross Abstract: Wireless radiance field (WRF) reconstruction aims to learn a continuous, queryable representation of radio frequency characteristics over 3D space and direction, from which specific quantities, such as the spatial power spectrum (SPS) at a receiver given a transmitter position, can be predicted. While Gaussian splatting (GS)-based method has surpassed Neural Radiance Fields (NeRF)-based method for this task, existing adaptations largely transplant vision pipelines, limiting physical interpretability and accuracy. We introduce BiSplat-WRF, a planar GS framework that retains the expressiveness of 3D GS while removing unnecessary projections and incorporating global EM coupling and mutual scattering among primitives. Each primitive is a 2D planar Gaussian with 3D coordinates, rendered directly on the angular domain of the SPS. A bilinear spatial transformer (BST) aggregates inter-primitive relations on an angular grid and, via attention, captures long-range electromagnetic dependencies, thereby enforcing globally aware EM interactions that reflect the complex physics of the wireless environment. On spatial spectrum synthesis task, BiSplat-WRF surpasses NeRF-based and prior GS-based baselines with respect to the Structural Similarity Index (SSIM); comprehensive ablation studies validate the contribution of BST. We also provide a larger BiSplat-WRF+ variant that further increases SSIM at a higher computation cost, serving as a strong reference for future studies.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
arXiv:2604.25972v1 Announce Type: cross
Abstract: In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actio…
arXiv →
A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
arXiv:2604.25972v1 Announce Type: cross Abstract: In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actio…
arXiv:2604.25972v1 Announce Type: cross Abstract: In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN-based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
arXiv:2604.25975v1 Announce Type: cross
Abstract: Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generati…
arXiv →
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
arXiv:2604.25975v1 Announce Type: cross Abstract: Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generati…
arXiv:2604.25975v1 Announce Type: cross Abstract: Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.
▶
arXiv cs.AI
30.04.2026
Auditing Marketing Budget Allocation with Hindsight Regret
arXiv:2604.25977v1 Announce Type: cross
Abstract: Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized a…
arXiv →
Auditing Marketing Budget Allocation with Hindsight Regret
arXiv:2604.25977v1 Announce Type: cross Abstract: Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized a…
arXiv:2604.25977v1 Announce Type: cross Abstract: Organizations routinely make strategic budget allocations under operational constraints, but often lack a principled way to assess whether realized allocations were close to the best feasible choices in hindsight. We present a retrospective auditing framework based on hindsight regret, defined as the opportunity cost of the realized allocation relative to a constraint-faithful benchmark under the same budget and stability guardrails. The framework estimates regime-specific spend--response functions from historical logs, computes feasible hindsight allocations via constrained optimization, and propagates uncertainty through Monte Carlo evaluation to produce regret distributions, expected lift, and probability-of-improvement summaries. This separates allocation inefficiency from uncertainty in the estimated response surfaces. Experiments on real marketing allocation logs show that the framework yields interpretable post-hoc diagnostics and reveals a practical trade-off between allocation flexibility and detectability: moderate feasible reallocations often capture most measurable gain, while larger shifts move into weak-support regions with higher uncertainty. The result is a practical method for auditing historical budget decisions when online experimentation is costly or infeasible.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Mini-Batch Class Composition Bias in Link Prediction
arXiv:2604.25978v1 Announce Type: cross
Abstract: Prior work on node classification has shown that Graph Neural Networks (GNNs) can learn representations that transfer across graphs, when underlying …
arXiv →
Mini-Batch Class Composition Bias in Link Prediction
arXiv:2604.25978v1 Announce Type: cross Abstract: Prior work on node classification has shown that Graph Neural Networks (GNNs) can learn representations that transfer across graphs, when underlying …
arXiv:2604.25978v1 Announce Type: cross Abstract: Prior work on node classification has shown that Graph Neural Networks (GNNs) can learn representations that transfer across graphs, when underlying graph properties are shared. For a fixed graph, one would then expect GNNs trained for link prediction to learn a representation consistent with that learnt for node classification. We show this intuition does not hold in the general case. Instead, we find popular link prediction models can learn a trivial mini-batch dependent heuristic, enabled by batch-normalisation layers, to solve the edge classification task. When correcting for this, we observe increased alignment of the network representation with node-class relevant features, suggesting the network has learnt a graph representation that better aligns with the underlying graph's properties. Our findings suggest that standard link prediction training may be leading us to overestimate link predictors' ability to learn a generalised representation of a graph that is consistent across tasks.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Lightweight Quantum Agent for Edge Systems: Joint PQC and NOMA Resource Allocation
arXiv:2604.25980v1 Announce Type: cross
Abstract: In the context of quantum secure scenarios, existing research on mobile edge devices and intelligent computing and edge (ICE) systems based on the No…
arXiv →
Lightweight Quantum Agent for Edge Systems: Joint PQC and NOMA Resource Allocation
arXiv:2604.25980v1 Announce Type: cross Abstract: In the context of quantum secure scenarios, existing research on mobile edge devices and intelligent computing and edge (ICE) systems based on the No…
arXiv:2604.25980v1 Announce Type: cross Abstract: In the context of quantum secure scenarios, existing research on mobile edge devices and intelligent computing and edge (ICE) systems based on the Non-Orthogonal Multiple Access (NOMA) communication model have overlooked the energy consumption overhead of Post-Quantum Cryptography (PQC) modules, and the high complexity of traditional resource allocation algorithms fails to meet the demands of real-time decision-making. To address these challenges, this paper proposes a lightweight agentic AI framework designed for online joint optimization within ICE-enabled mobile devices. The scheme constructs a multi-stage stochastic Mixed Integer Nonlinear Programming (MINLP) model that incorporates static power-consumption constraints for PQC modules. Based on Lyapunov optimization theory, the long-term optimization problem is decoupled, and a linear complexity algorithm is proposed to solve the nonconvex challenges of NOMA power allocation . Simulation results verify that the proposed scheme significantly improves computational throughput while ensuring system queue stability and energy consumption constraints. Compared with traditional Successive Convex Approximation (SCA) algorithms, the complexity is reduced to $\mathcal{O}(N)$, achieving a speedup of approximately 46 times when the number of devices $N=35$, thereby meeting the real-time decision-making requirements in dynamic wireless environments.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Open Problems in Frontier AI Risk Management
arXiv:2604.25982v1 Announce Type: cross
Abstract: Frontier AI both amplifies existing risks and introduces qualitatively novel challenges. Not only is there a notable lack of stable scientific consen…
arXiv →
Open Problems in Frontier AI Risk Management
arXiv:2604.25982v1 Announce Type: cross Abstract: Frontier AI both amplifies existing risks and introduces qualitatively novel challenges. Not only is there a notable lack of stable scientific consen…
arXiv:2604.25982v1 Announce Type: cross Abstract: Frontier AI both amplifies existing risks and introduces qualitatively novel challenges. Not only is there a notable lack of stable scientific consensus resulting from the rapid pace of technological change, but emerging frontier AI safety practices are often misaligned with, or may undermine, established risk management frameworks. To address these challenges, we systematically surface open problems in frontier AI risk management. Adopting a problem-oriented approach, we examine each stage of the risk management process - risk planning, identification, analysis, evaluation, and mitigation - through a structured review of the literature, identifying unresolved challenges and the actors best positioned to address them. Recognising that different types of open problems call for different responses, we classify open problems according to whether they reflect (a) a lack of scientific or technical consensus, (b) misalignment with, or challenges to, established risk management frameworks, or (c) shortcomings in implementation despite apparent consensus and alignment. By mapping these open problems and identifying the actors best positioned to address them - including developers, deployers, regulators, standards bodies, researchers, and third-party evaluators - this work aims to clarify where progress is needed to enable robust and meaningful consensus on frontier AI risk management.The paper does not propose specific solutions; instead, it provides a problem-oriented, agenda-setting reference document, complemented by a living online repository, intended to support coordination, reduce duplication, and guide future research and governance efforts.
▶
arXiv cs.AI
30.04.2026
QERNEL: a Scalable Large Electron Model
arXiv:2604.26018v1 Announce Type: cross
Abstract: We introduce QERNEL, a foundational neural wavefunction that variationally solves families of parameterized many-electron Hamiltonians and captures t…
arXiv →
QERNEL: a Scalable Large Electron Model
arXiv:2604.26018v1 Announce Type: cross Abstract: We introduce QERNEL, a foundational neural wavefunction that variationally solves families of parameterized many-electron Hamiltonians and captures t…
arXiv:2604.26018v1 Announce Type: cross Abstract: We introduce QERNEL, a foundational neural wavefunction that variationally solves families of parameterized many-electron Hamiltonians and captures their ground states throughout parameter space within a single model. QERNEL combines FiLM-based parameter conditioning with scale-efficient architectural elements -- mixture of experts and grouped-query attention, substantially improving expressivity at low computational cost. We apply QERNEL to interacting electrons in semiconductor moir\'e heterobilayers, training a single weight-shared model for systems of up to 150 electrons. By solving the many-electron Schr\"odinger equation conditioned on moir\'e potential depth, QERNEL captures both quantum liquid and crystal states and discovers the sharp phase transition between them, marked by abrupt changes in interaction energy and charge density. Our work establishes a foundation model for moir\'e quantum materials and a scalable architecture toward a Large Electron Model for solids.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
arXiv:2604.26020v1 Announce Type: cross
Abstract: Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs…
arXiv →
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
arXiv:2604.26020v1 Announce Type: cross Abstract: Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs…
arXiv:2604.26020v1 Announce Type: cross Abstract: Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer use agents (CUAs) and other generative agents that can simulate user interactions and preference, but we show that agents still struggle to provide accurate usability assessments. In this work, we present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs. More broadly, our work aims to build a principled, data-driven foundation for automated usability assessment in HCI.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts
arXiv:2604.26024v1 Announce Type: cross
Abstract: Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on …
arXiv →
Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts
arXiv:2604.26024v1 Announce Type: cross Abstract: Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on …
arXiv:2604.26024v1 Announce Type: cross Abstract: Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: https://anonymous.4open.science/r/correcting-bias-imbalance-9C6C/.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
arXiv:2604.26039v1 Announce Type: cross
Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet productio…
arXiv →
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
arXiv:2604.26039v1 Announce Type: cross Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet productio…
arXiv:2604.26039v1 Announce Type: cross Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping
arXiv:2604.26051v1 Announce Type: cross
Abstract: The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising appro…
arXiv →
Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping
arXiv:2604.26051v1 Announce Type: cross Abstract: The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising appro…
arXiv:2604.26051v1 Announce Type: cross Abstract: The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising approach for operational flood monitoring. Deep learning-based approaches for flood mapping using satellite imagery, an important application within Geospatial Artificial Intelligence (GeoAI), have shown improved predictive performance by learning complex spatial and spectral patterns from large volumes of remote sensing data. However, the opaque decision-making processes of deep learning models remain a major barrier to their integration into critical scientific and operational workflows. This highlights the need for a systematic assessment of whether model explanations align with established domain knowledge in remote sensing. To address this research gap, this study introduces the ADAGE (Alignment between Domain Knowledge And GeoAI Explanation Evaluation) framework. The proposed framework is designed to systematically evaluate how well explanations of deep learning models align with established remote sensing knowledge, particularly regarding the distinctive spectral properties of the Earth's surface. The ADAGE framework employs Channel-Group SHAP (SHapley Additive exPlanations) method to estimate the contributions of grouped input channels to pixel-level predictions. Experiments on two satellite-based flood mapping tasks demonstrate that the ADAGE framework can (1) quantitatively assess the alignment between model explanations and reference explanations derived from domain knowledge and (2) help domain experts identify misaligned explanations through alignment scores. This study contributes to bridging the gap between explainability and domain knowledge in GeoAI for Earth observation, enhancing the applicability of GeoAI models in scientific and operational workflows.
▶
arXiv cs.AI
30.04.2026
Privacy-Preserving Federated Learning Framework for Distributed Chemical Process Optimization
arXiv:2604.26073v1 Announce Type: cross
Abstract: Industrial chemical plants often operate under strict data confidentiality constraints, making centralized data-driven process modeling difficult. Fe…
arXiv →
Privacy-Preserving Federated Learning Framework for Distributed Chemical Process Optimization
arXiv:2604.26073v1 Announce Type: cross Abstract: Industrial chemical plants often operate under strict data confidentiality constraints, making centralized data-driven process modeling difficult. Fe…
arXiv:2604.26073v1 Announce Type: cross Abstract: Industrial chemical plants often operate under strict data confidentiality constraints, making centralized data-driven process modeling difficult. Federated learning (FL) provides a promising solution by enabling collaborative model training across distributed facilities without sharing raw operational data. This paper proposes a privacy-preserving federated learning framework for distributed chemical process optimization using data collected from multiple geographically separated plants. Each plant locally trains a neural-network-based process model using its own time-series sensor data, while only model parameters are transmitted to a central aggregation server through secure aggregation mechanisms. This design allows cross-plant knowledge sharing while maintaining strict data locality and industrial confidentiality. Experimental evaluation was conducted using process datasets from three independent chemical plants operating under heterogeneous conditions. The results demonstrate rapid convergence of the federated model, with the global mean squared error decreasing from approximately 2369 to below 50 within the first five communication rounds and stabilizing around 35 after 40 rounds. In comparison with local-only training, the proposed federated framework significantly improves prediction accuracy across all plants, while achieving performance comparable to centralized training. The findings indicate that federated learning provides an effective and scalable solution for collaborative industrial analytics, enabling privacy-preserving predictive modeling and process optimization across distributed chemical production facilities.
▶
arXiv cs.AI
30.04.2026
FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
arXiv:2604.26084v1 Announce Type: cross
Abstract: Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest q…
arXiv →
FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
arXiv:2604.26084v1 Announce Type: cross Abstract: Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest q…
arXiv:2604.26084v1 Announce Type: cross Abstract: Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest quality. Although ripening is a continuous biological process, vision-based maturity estimation is typically formulated as a multi-class classification task, which imposes sharp boundaries between visually similar stages. To examine this limitation, we perform an annotation reliability study with two independent annotators on a held-out tomato dataset and observe disagreement concentrated near adjacent maturity stages. Motivated by this observation, we model maturity as a latent continuous variable and predict it probabilistically using a distributional detection head, converting the distribution into class probabilities through the cumulative distribution function (CDF). The proposed formulation maintains comparable performance to a standard detector under clean labels while better representing uncertainty. Furthermore, when controlled label noise is introduced during training, the probabilistic model demonstrates improved robustness relative to the baseline, indicating that explicitly modeling maturity uncertainty leads to more reliable visual maturity estimation.
▶
arXiv cs.AI
30.04.2026
Momentum-Conserving Graph Neural Networks for Deformable Objects
arXiv:2604.26097v1 Announce Type: cross
Abstract: Graph neural networks (GNNs) have emerged as a versatile and efficient option for modeling the dynamic behavior of deformable materials. While GNNs g…
arXiv →
Momentum-Conserving Graph Neural Networks for Deformable Objects
arXiv:2604.26097v1 Announce Type: cross Abstract: Graph neural networks (GNNs) have emerged as a versatile and efficient option for modeling the dynamic behavior of deformable materials. While GNNs g…
arXiv:2604.26097v1 Announce Type: cross Abstract: Graph neural networks (GNNs) have emerged as a versatile and efficient option for modeling the dynamic behavior of deformable materials. While GNNs generalize readily to arbitrary shapes, mesh topologies, and material parameters, existing architectures struggle to correctly predict the temporal evolution of key physical quantities such as linear and angular momentum. In this work, we propose MomentumGNN -- a novel architecture designed to accurately track momentum by construction. Unlike existing GNNs that output unconstrained nodal accelerations, our model predicts per-edge stretching and bending impulses which guarantee the preservation of linear and angular momentum. We train our network in an unsupervised fashion using a physics-based loss, and we show that our method outperforms baselines in a number of common scenarios where momentum plays a pivotal role.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
arXiv:2604.26103v1 Announce Type: cross
Abstract: All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneo…
arXiv →
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
arXiv:2604.26103v1 Announce Type: cross Abstract: All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneo…
arXiv:2604.26103v1 Announce Type: cross Abstract: All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
reward-lens: A Mechanistic Interpretability Library for Reward Models
arXiv:2604.26130v1 Announce Type: cross
Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, …
arXiv →
reward-lens: A Mechanistic Interpretability Library for Reward Models
arXiv:2604.26130v1 Announce Type: cross Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, …
arXiv:2604.26130v1 Announce Type: cross Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $\rho = -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ImproBR: Bug Report Improver Using LLMs
arXiv:2604.26142v1 Announce Type: cross
Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omi…
arXiv →
ImproBR: Bug Report Improver Using LLMs
arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omi…
arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and Expected Behavior (EB). We propose ImproBR, an LLM-based pipeline that automatically detects and improves bug reports by addressing missing, incomplete, and ambiguous S2R, OB, and EB sections. ImproBR employs a hybrid detector combining fine-tuned DistilBERT, heuristic analysis, and an LLM analyzer, guided by GPT-4o mini with section-specific few-shot prompts and a Retrieval-Augmented Generation (RAG) pipeline grounded in Minecraft Wiki domain knowledge. Evaluated on Mojira, ImproBR improved structural completeness from 7.9% to 96.4%, more than doubled the proportion of executable S2R from 28.8% to 67.6%, and raised fully reproducible bug reports from 1 to 13 across 139 challenging real-world reports.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
arXiv:2604.26145v1 Announce Type: cross
Abstract: AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can …
arXiv →
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
arXiv:2604.26145v1 Announce Type: cross Abstract: AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can …
arXiv:2604.26145v1 Announce Type: cross Abstract: AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.
▶
arXiv cs.AI
30.04.2026
A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance
arXiv:2604.26147v1 Announce Type: cross
Abstract: Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluor…
arXiv →
A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance
arXiv:2604.26147v1 Announce Type: cross Abstract: Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluor…
arXiv:2604.26147v1 Announce Type: cross Abstract: Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluorescence lifetime imaging (FLIm) offers real-time, label-free biochemical contrast, but its clinical utility is challenged by biological heterogeneity, class imbalance, and variability in histopathological labeling. We present a data-centric AI (DC-AI) framework that integrates confident learning (CL), class refinement, and targeted label evaluation to develop a robust multi-class FLIm classifier for glioblastoma (GBM) resection margins. FLIm data were collected from 192 tissue margins across 31 newly diagnosed IDH-wildtype GBM patients and initially labeled into seven tumor cellularity classes by an expert neuropathologist. CL was applied to quantify FLIm point-level confidence, identify label inconsistencies, and guide iterative class merging into a three-class scheme ("low", "moderate", "high"). The resulting high-fidelity dataset enabled training a model that achieved 96% accuracy in the three-class task. SHAP analysis revealed class-specific FLIm feature importance, highlighting distinct optical signatures across the infiltration spectrum. Targeted FLIm analysis further identified biological (e.g., gray matter composition) and acquisition-related (e.g., blood contamination) contributors to low-confidence predictions. Blinded re-evaluation of margins flagged by CL demonstrated intra-pathologist variability, underscoring the value of selective relabeling rather than exhaustive review. Together, these findings demonstrate that a DC-AI framework can systematically improve data reliability, enhance model robustness, and refine biological interpretation of FLIm signals, supporting the development of clinically actionable optical tools for real-time glioma margin assessment.
▶
arXiv cs.AI
30.04.2026
Structural Generalization on SLOG without Hand-Written Rules
arXiv:2604.26157v1 Announce Type: cross
Abstract: Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approa…
arXiv →
Structural Generalization on SLOG without Hand-Written Rules
arXiv:2604.26157v1 Announce Type: cross Abstract: Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approa…
arXiv:2604.26157v1 Announce Type: cross Abstract: Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves 100% type-exact match on 11 of 17 structural generalization categories, including three where AM-Parser scores 0 to 74%, with an overall standard deviation of 0.2 across 10 seeds (vs. AM-Parser's 4.3). Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs.When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., 41.4%) are mixtures of structurally distinct CCG patterns, not partial generalization.All failures correspond to directed operations absent from training; all successes correspond to operations already covered.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Test-Time Safety Alignment
arXiv:2604.26167v1 Announce Type: cross
Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that sat…
arXiv →
Test-Time Safety Alignment
arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that sat…
arXiv:2604.26167v1 Announce Type: cross Abstract: Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.
▶
arXiv cs.AI
30.04.2026
Co-Learning Port-Hamiltonian Systems and Optimal Energy-Shaping Control
arXiv:2604.26172v1 Announce Type: cross
Abstract: We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed appro…
arXiv →
Co-Learning Port-Hamiltonian Systems and Optimal Energy-Shaping Control
arXiv:2604.26172v1 Announce Type: cross Abstract: We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed appro…
arXiv:2604.26172v1 Announce Type: cross Abstract: We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed approach {co-learns} a pH system model and an optimal energy-balancing passivity-based controller (EB-PBC) through alternating optimization with policy-aware data collection. At each iteration, the system model is refined using trajectory data collected under the current control policy, and the controller is re-optimized on the updated model. Both components are parameterized by neural networks that embed the pH {dynamics} and EB-PBC structure, ensuring interpretability in terms of energy {interactions}. The learned controller renders the closed-loop system inherently passive and provably stable, and exploits passive plant dynamics without canceling the natural potential. A dissipation regularization enforces strict energy decay during training, thereby enhancing robustness to sim-to-real gaps. The proposed framework is validated on state-regulation and swing-up tasks for planar and torsional pendulum systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
arXiv:2604.26173v1 Announce Type: cross
Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heav…
arXiv →
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
arXiv:2604.26173v1 Announce Type: cross Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heav…
arXiv:2604.26173v1 Announce Type: cross Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Evergreen: Efficient Claim Verification for Semantic Aggregates
arXiv:2604.26180v1 Announce Type: cross
Abstract: With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natu…
arXiv →
Evergreen: Efficient Claim Verification for Semantic Aggregates
arXiv:2604.26180v1 Announce Type: cross Abstract: With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natu…
arXiv:2604.26180v1 Announce Type: cross Abstract: With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natural language aggregate using an LLM. However, the resulting semantic aggregate may contain claims that are not grounded in the underlying relation. Verifying such claims is challenging: they often involve quantifiers, groupings, and comparisons over relations that far exceed LLM context windows and require a costly combination of semantic and symbolic processing. We present Evergreen, a system that recasts claim verification as a semantic query processing task with tailored optimizations and provenance capture. Evergreen compiles each claim into a declarative semantic verification query and executes it on the same engine that produced the aggregate. To reduce cost and latency, Evergreen avoids unnecessary LLM calls through verification-aware optimizations (early stopping, relevance sorting, and estimation with confidence sequences) and general-purpose optimizations for semantic queries (operator fusion, similarity filtering, and prompt caching). Each verdict is accompanied by citations that identify a minimal set of tuples justifying the result, with semantics based on semiring provenance for first-order logic. On a benchmark of real-world restaurant review datasets reflecting production-inspired workloads, Evergreen achieves excellent verification quality (F1 = 1.00) with a strong LLM while reducing cost by 3.2x and latency by 4.0x compared to unoptimized verification. Even with a significantly weaker LLM, Evergreen outperforms a strong LLM-as-a-judge baseline in F1 at 48x lower cost and 2.3x lower latency. Relative to a retrieval-augmented agent, Evergreen compares favorably in F1 and latency with similar cost when both use a strong LLM; yet, with a much weaker LLM, it achieves the same F1 at 63x lower cost and 4.2x lower latency.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Lifting Embodied World Models for Planning and Control
arXiv:2604.26182v1 Announce Type: cross
Abstract: World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are h…
arXiv →
Lifting Embodied World Models for Planning and Control
arXiv:2604.26182v1 Announce Type: cross Abstract: World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are h…
arXiv:2604.26182v1 Announce Type: cross Abstract: World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ($3.8\times$ lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
arXiv:2604.26206v1 Announce Type: cross
Abstract: A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. Howeve…
arXiv →
Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
arXiv:2604.26206v1 Announce Type: cross Abstract: A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. Howeve…
arXiv:2604.26206v1 Announce Type: cross Abstract: A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
arXiv:2604.26456v1 Announce Type: cross
Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While r…
arXiv →
Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
arXiv:2604.26456v1 Announce Type: cross Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While r…
arXiv:2604.26456v1 Announce Type: cross Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
arXiv:2604.26209v1 Announce Type: cross
Abstract: Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context.…
arXiv →
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
arXiv:2604.26209v1 Announce Type: cross Abstract: Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context.…
arXiv:2604.26209v1 Announce Type: cross Abstract: Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context. While standard autoregressive decoding is slow due to its sequential nature, the independence between output sequences offers an opportunity for parallelism. We present Hyper-Parallel Decoding, a novel decoding algorithm that accelerates offline decoding by leveraging both shared memory and computation across batches. HPD enables out-of-order token generation through position ID manipulation, significantly improving efficiency. Experiments on AVE show that attribute-value pairs are conditionally independent, enabling us to parallelize value generation within each prompt. By further stacking multiple documents within a single prompt, we can decode in parallel up to 96 tokens per prompt. HPD works with all LLMs, and reduces both inference costs and total inference time by up to 13.8X without compromising output quality, potentially saving hundreds of thousands of dollars on industry AVE tasks. Although designed for attribute extraction, HPD makes no assumptions unique to the AVE domain and can in theory be applied to other scenarios with independent output structures.
▶
arXiv cs.AI
30.04.2026
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
arXiv:2604.26213v1 Announce Type: cross
Abstract: Loading high dimensional distributions is an important task for utilizing quantum computers on applications ranging from machine learning to finance.…
arXiv →
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
arXiv:2604.26213v1 Announce Type: cross Abstract: Loading high dimensional distributions is an important task for utilizing quantum computers on applications ranging from machine learning to finance.…
arXiv:2604.26213v1 Announce Type: cross Abstract: Loading high dimensional distributions is an important task for utilizing quantum computers on applications ranging from machine learning to finance. The high dimensionality leads to a curse of dimensionality, representing a d-dimensional distribution with k resolution requires dk qubits and an unstructured parameterized circuit would express a unitary in an exponential operator space in the number of qubits, leading to vanishing gradients and poor convergence guarantees even at high depth. Vine copula decompositions are widely used to represent high dimensional distributions classically, showing high quality approximation in many important applications, such as financial modeling. We present Qvine, a vine structured ansatz for quantum circuits, that mirrors the vine decomposition to construct scalable quantum circuits with efficient trainability while achieving similarly high quality approximation for amplitude encoding distributions. For regular vines (R-vines), we show that the circuit depth scales at most quadratic in the dimension of the distribution, while for D-vines, as well as many practical R-vines, the circuit depth scales linear in the dimension. For 3-dimensional and 4-dimensional Gaussians and empirical joint stock price return distributions for selected stocks, our experiments show Qvines achieve high quality loading.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
arXiv:2604.26221v1 Announce Type: cross
Abstract: Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined…
arXiv →
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
arXiv:2604.26221v1 Announce Type: cross Abstract: Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined…
arXiv:2604.26221v1 Announce Type: cross Abstract: Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
arXiv:2604.26232v1 Announce Type: cross
Abstract: Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generate…
arXiv →
DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
arXiv:2604.26232v1 Announce Type: cross Abstract: Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generate…
arXiv:2604.26232v1 Announce Type: cross Abstract: Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot's robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between "visually realistic" and "clinically interpretable". Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
arXiv:2604.26516v1 Announce Type: cross
Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behav…
arXiv →
Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
arXiv:2604.26516v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behav…
arXiv:2604.26516v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
LATTICE: Evaluating Decision Support Utility of Crypto Agents
arXiv:2604.26235v1 Announce Type: cross
Abstract: We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent…
arXiv →
LATTICE: Evaluating Decision Support Utility of Crypto Agents
arXiv:2604.26235v1 Announce Type: cross Abstract: We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent…
arXiv:2604.26235v1 Announce Type: cross Abstract: We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents' ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE's LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and human feedback, thus promoting reliable and extensible evaluation. While other benchmarks often compare foundation models sharing a generic agent framework, we use LATTICE to assess production-level agents used in actual crypto copilot products, reflecting the importance of orchestration and UI/UX design in determining agent quality. In this paper, we evaluate six real-world crypto copilots on 1,200 diverse queries and report breakdowns across dimensions, tasks, and query categories. Our experiments show that most of the tested copilots achieve comparable aggregate scores, but differ more significantly on dimension-level and task-level performance. This pattern suggests meaningful trade-offs in decision support quality: users with different priorities may be better served by different copilots than the aggregate rankings alone would indicate. To support reproducible research, we open-source all LATTICE code and data used in this paper.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
arXiv:2604.26243v1 Announce Type: cross
Abstract: Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the st…
arXiv →
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
arXiv:2604.26243v1 Announce Type: cross Abstract: Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the st…
arXiv:2604.26243v1 Announce Type: cross Abstract: Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the strategic utilization of memory to meet factual needs and social engagement. Current memory utilization relevant (e.g., memory-augmented generation, long-term dialogue, and etc.) benchmarks overlook this nuance, treating memory primarily as a static repository of facts rather than a dynamic resource to be strategically deployed in dialogues. To address this gap, we design StratMem-Bench, a new benchmark to evaluate strategic memory use in character-centric dialogues. This dataset comprises 657 instances where virtual characters must navigate heterogeneous memory pools containing required, supportive, and irrelevant memories. We also propose a framework with different evaluation metrics including Strict Memory Compliance, Memory Integration Quality, Proactive Enrichment Score and Conditional Irrelevance Rate, to evaluate strategic memory use capabilities of virtual characters. Experiments on StratMem-Bench which leverage the state-of-the-art large language models as virtual characters show that all models perform well at distinguishing between required and irrelevant memories, but struggle once supportive memories are introduced into the decision process.
▶
arXiv cs.AI
30.04.2026
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
arXiv:2604.26244v1 Announce Type: cross
Abstract: We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For examp…
arXiv →
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
arXiv:2604.26244v1 Announce Type: cross Abstract: We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For examp…
arXiv:2604.26244v1 Announce Type: cross Abstract: We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT's own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50\% transmission bitrate saving at matched quality. We assess these gains under a rate--distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation
arXiv:2604.26247v1 Announce Type: cross
Abstract: Multimodal recommendation improves user modeling by integrating collaborative signals with heterogeneous item content. In real applications, user int…
arXiv →
TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation
arXiv:2604.26247v1 Announce Type: cross Abstract: Multimodal recommendation improves user modeling by integrating collaborative signals with heterogeneous item content. In real applications, user int…
arXiv:2604.26247v1 Announce Type: cross Abstract: Multimodal recommendation improves user modeling by integrating collaborative signals with heterogeneous item content. In real applications, user interests evolve over time and exhibit nonstationary dynamics, where different preference factors change at different rates. This challenge is amplified in multimodal settings because visual and textual cues can dominate decisions under different temporal regimes. Despite strong progress, most multimodal recommenders still rely on static interaction graphs or coarse temporal heuristics, which limits their ability to model continuous preference evolution with fine-grained temporal adaptation. To address these limitations, we propose TimeMM, a time-conditioned spectral filtering framework for dynamic multimodal recommendation. TimeMM instantiates Time-as-Operator by mapping interaction recency to a family of parametric temporal kernels that reweight edges on the user--item graph, producing component-specific representations without explicit eigendecomposition. To capture non-stationary interests, we introduce Adaptive Spectral Filtering that mixes the operator bank according to temporal context, yielding prediction-specific effective spectral responses. To account for modality-specific temporal sensitivity, we further propose Spectral-Aware Modality Routing that calibrates visual and textual contributions conditioned on the same temporal context. Finally, a ranking-space Spectral Diversity Regularization encourages complementary expert behaviors and prevents filter-bank collapse. Extensive experiments on real-world benchmarks demonstrate that TimeMM consistently outperforms state-of-the-art multimodal recommenders while maintaining linear-time scalability.
▶
arXiv cs.AI
30.04.2026
Multi-Stage Bi-Atrial Segmentation Framework from 3D Late Gadolinium-Enhanced MRI using V-Net Family Models
arXiv:2604.26251v1 Announce Type: cross
Abstract: We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the …
arXiv →
Multi-Stage Bi-Atrial Segmentation Framework from 3D Late Gadolinium-Enhanced MRI using V-Net Family Models
arXiv:2604.26251v1 Announce Type: cross Abstract: We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the …
arXiv:2604.26251v1 Announce Type: cross Abstract: We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the human heart. The pipeline consists of a preprocessing step using multidimensional contrast limited adaptive histogram equalization (MCLAHE); coarse region segmentation from MCLAHE-enhanced and down-sampled MRI using a V-Net family model; and fine segmentation from the coarse region using another V-Net model. Asymmetric loss is adopted to optimize the model weights.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Calibrated Surprise: An Information-Theoretic Account of Creative Quality
arXiv:2604.26269v1 Announce Type: cross
Abstract: The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space …
arXiv →
Calibrated Surprise: An Information-Theoretic Account of Creative Quality
arXiv:2604.26269v1 Announce Type: cross Abstract: The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space …
arXiv:2604.26269v1 Announce Type: cross Abstract: The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space collapses into a narrow region, and the surviving choices look least predictable from an unconstrained view. "Calibrated" has a precise meaning: the author's intent, the reader's reasonable expectation, and the logic of reality converge. When these three independent judgements agree on every dimension, the set of admissible writing choices is forced into a very small region. A mathematical corollary follows: full-dimensional accuracy and mediocrity are mutually exclusive -- two sides of one constraint structure, not separate goals. We use Shannon's mutual information $I(X;Y) = H(X) - H(X|Y)$ as our analysis tool. "Calibrated" corresponds to conditional entropy going to zero; "surprise" to entropy going up; mutual information is the precise measure of the joint quantity. The argument rests on two pillars. Static: when constraints from ethos, mythos, lexis, and dianoia are imposed together, the admissible set collapses sharply, and surviving solutions show up as low-probability choices from an unconstrained view. Dynamic: the chain rule shows each writing choice is constrained by what came before and constrains what comes after; macro-level decisions naturally contribute a larger share of information, removing the need for hand-tuned weighting. Through case studies and lightweight LLM-logprob computations, we show the framework is both analytically useful and operational, laying the theoretical groundwork for Creative Quality Alignment (CQA) and a professional evaluation benchmark.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
arXiv:2604.26274v1 Announce Type: cross
Abstract: Structured-workflow agents driven by large language models execute tool calls against sensitive external environments. We propose \codename, a teleme…
arXiv →
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
arXiv:2604.26274v1 Announce Type: cross Abstract: Structured-workflow agents driven by large language models execute tool calls against sensitive external environments. We propose \codename, a teleme…
arXiv:2604.26274v1 Announce Type: cross Abstract: Structured-workflow agents driven by large language models execute tool calls against sensitive external environments. We propose \codename, a telemetry-driven behavioral anomaly detection firewall. Drawing on sequence-based intrusion detection, \codename\ compiles verified benign tool-call telemetry into a parameterized deterministic finite automaton (pDFA). The model defines permitted tool sequences, sequential contexts, and parameter bounds. At runtime, a lightweight gateway enforces these boundaries via an $O(1)$ state-transition structural lookup, shifting computationally expensive analysis entirely offline. Evaluated on the Agent Security Bench (ASB), \codename\ achieves a 5.6\% macro-averaged attack success rate (ASR) across five scenarios. Within three structured workflows, ASR drops to 2.2\%, outperforming Aegis, a state-of-the-art stateless scanner, at 12.8\%. \codename\ achieves 0\% ASR on multi-step and context-sequential attacks in structured settings. Furthermore, against 1,000 algorithmically spliced exfiltration payloads, only 1.4\% matched valid structural paths, all of which failed end-to-end string parameter guards (0 successes out of 14 surviving paths, 95\% CI [0\%, 23.2\%]). \codename\ introduces just 2.2~ms of per-call latency (a 3.7$\times$ speedup over \textsc{Aegis}) while maintaining a 2.0\% benign task failure rate (BTFR) on benign workloads. Modeling the behavioral trajectory effectively collapses the available attack surface, but unmaintained continuous parameter bounds remain vulnerable to synonym-substitution attacks (18\% evasion rate). Thus, exact-match whitelisting of sensitive parameters ultimately bears the final defensive load against execution.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
arXiv:2604.26283v1 Announce Type: cross
Abstract: High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke durin…
arXiv →
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
arXiv:2604.26283v1 Announce Type: cross Abstract: High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke durin…
arXiv:2604.26283v1 Announce Type: cross Abstract: High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
arXiv:2604.26288v1 Announce Type: cross
Abstract: Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current …
arXiv →
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
arXiv:2604.26288v1 Announce Type: cross Abstract: Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current …
arXiv:2604.26288v1 Announce Type: cross Abstract: Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision--language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state--of--the--art vision--language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference--time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human--human and human--AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision--language models.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
arXiv:2604.26328v1 Announce Type: cross
Abstract: The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for mis…
arXiv →
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
arXiv:2604.26328v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for mis…
arXiv:2604.26328v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for misinformation, impersonation, and content forgery. Most existing detection approaches struggle with robustness against adversarial perturbation, paraphrasing attacks, and domain shifts, often requiring restrictive access to model parameters or large labeled datasets. To address this, we propose DSIPA, a novel training-free framework that detects LLM-generated content by quantifying sentiment distributional stability under controlled stylistic variation. It is based on the observation that LLMs typically exhibit more emotionally consistent outputs, while human-written texts display greater affective variation. Our framework operates in a zero-shot, black-box manner, leveraging two unsupervised metrics, sentiment distribution consistency and sentiment distribution preservation, to capture these intrinsic behavioral asymmetries without the need for parameter updates or probability access. Extensive experiments are conducted on state-of-the-art proprietary and open-source models, including GPT-5.2, Gemini-1.5-pro, Claude-3, and LLaMa-3.3. Evaluations on five domains, such as news articles, programming code, student essays, academic papers, and community comments, demonstrate that DSIPA improves F1 detection scores by up to 49.89% over baseline methods. The framework exhibits superior generalizability across domains and strong resilience to adversarial conditions, providing a robust and interpretable behavioral signal for secure content identification in the evolving LLM landscape.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
arXiv:2604.26348v1 Announce Type: cross
Abstract: Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that e…
arXiv →
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
arXiv:2604.26348v1 Announce Type: cross Abstract: Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that e…
arXiv:2604.26348v1 Announce Type: cross Abstract: Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth images.Such supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
arXiv:2604.26360v1 Announce Type: cross
Abstract: Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real…
arXiv →
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
arXiv:2604.26360v1 Announce Type: cross Abstract: Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real…
arXiv:2604.26360v1 Announce Type: cross Abstract: Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Text Style Transfer with Machine Translation for Graphic Designs
arXiv:2604.26361v1 Announce Type: cross
Abstract: Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audienc…
arXiv →
Text Style Transfer with Machine Translation for Graphic Designs
arXiv:2604.26361v1 Announce Type: cross Abstract: Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audienc…
arXiv:2604.26361v1 Announce Type: cross Abstract: Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audiences. To accomplish this, the textual content in the graphic designs needs to be accurately translated and have the text styling preserved in order to fit visually into the design. Preserving text styling requires high accuracy word alignment between the original and the translated text. The problem of word alignment between source and translated text is long known. The industry standards for extracting word alignments are defined by Giza++ and attention probabilities from neural machine translation (NMT) models. In this paper, we explore three new methods to tackle the word alignment problem for transferring text styles from the source to the translated text. The proposed methods are developed on top of commercially available NMT and LLM translation technologies. They include: NMT with custom input and output tags for text styling; LLM with custom input and output tags; a hybrid with NMT for translation followed by an LLM with use of unigram mappings. To analyze the performance of these solutions, their alignment results are compared with the results of an attention head approach to gauge their usability in graphic design applications. Interestingly, the attention head strong baseline proves more accurate than the LLM or NMT approach and on par with the hybrid NMT+LLM approach.
▶
arXiv cs.AI
30.04.2026
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
arXiv:2604.26375v1 Announce Type: cross
Abstract: We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview respons…
arXiv →
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
arXiv:2604.26375v1 Announce Type: cross Abstract: We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview respons…
arXiv:2604.26375v1 Announce Type: cross Abstract: We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
arXiv:2604.26382v1 Announce Type: cross
Abstract: Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what'…
arXiv →
Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
arXiv:2604.26382v1 Announce Type: cross Abstract: Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what'…
arXiv:2604.26382v1 Announce Type: cross Abstract: Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
▶
arXiv cs.AI
30.04.2026
Fundamental Physics, Existential Risks and Human Futures
arXiv:2604.26530v1 Announce Type: cross
Abstract: Over the past 25 years, I have been involved in some intriguing developments in the foundations of physics, exploring the quantum reality problem, th…
arXiv →
Fundamental Physics, Existential Risks and Human Futures
arXiv:2604.26530v1 Announce Type: cross Abstract: Over the past 25 years, I have been involved in some intriguing developments in the foundations of physics, exploring the quantum reality problem, th…
arXiv:2604.26530v1 Announce Type: cross Abstract: Over the past 25 years, I have been involved in some intriguing developments in the foundations of physics, exploring the quantum reality problem, the relationship between quantum theory and gravity and the interplay between consciousness and physical laws. These investigations make it plausible that we will find physics beyond quantum theory, potentially including both new evolution laws and new types of measurement. There is also a significant chance they could have potentially transformative impact on information processing and on the development of and our future with AI.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization
arXiv:2604.26394v1 Announce Type: cross
Abstract: Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMa…
arXiv →
SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization
arXiv:2604.26394v1 Announce Type: cross Abstract: Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMa…
arXiv:2604.26394v1 Announce Type: cross Abstract: Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals. Device specificity is provided by a lightweight local diagnostic utility, while user specificity relies on implicit proficiency inference and profile-aware troubleshooting. Service specificity is achieved through a proactive, context-aware recommender. We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline, while step-by-step guidance improved pleasantness and reduced user burden. The recommender achieved high relevance (MRR@1=0.75), and participants showed strong willingness to substitute human IT support at costs well below human benchmarks. We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware
arXiv:2604.26413v1 Announce Type: cross
Abstract: This paper presents Quantum Gatekeeper, a context-bound image steganography framework where successful payload recovery depends on both cryptographic…
arXiv →
Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware
arXiv:2604.26413v1 Announce Type: cross Abstract: This paper presents Quantum Gatekeeper, a context-bound image steganography framework where successful payload recovery depends on both cryptographic…
arXiv:2604.26413v1 Announce Type: cross Abstract: This paper presents Quantum Gatekeeper, a context-bound image steganography framework where successful payload recovery depends on both cryptographic decryption and the reconstruction of a precise extraction path. The system integrates lossless least significant bit (LSB) embedding with a deterministic variational quantum circuit (VQC)-derived gate key, multi-factor contextual binding, and authenticated encryption. Payload extraction is contingent upon four requisite factors: a password, a shared secret, a user-supplied context string, and a reference image signature. Any deviation in these factors causes the system to read from an incorrect pixel sequence or fail authentication, resulting in silent rejection rather than partial disclosure. The proposed method derives a gatecontrolled extraction key from a seed-conditioned variational circuit, with parameters generated via cryptographic hash expansion and context-dependent image features. To ensure encode/decode consistency, the cryptographic key path is generated via exact statevector simulation; concurrently, IBM superconducting quantum hardware is utilized to evaluate the statistical behavior of the circuit family under physical noise. We introduce a dual-region image layout to resolve the nonce bootstrapping dependency, separating header recovery from payload recovery through independently derived keys. Experimental results confirm successful end-to-end message embedding and recovery on PNG images, demonstrating deterministic success under correct conditions and failure otherwise. The framework supports both text and image payloads; in the image-in-image configuration, a secret image is resized to a fixed resolution prior to embedding, enabling exact pixel-level recovery under correct contextual reconstruction.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
arXiv:2604.26419v1 Announce Type: cross
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-…
arXiv →
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
arXiv:2604.26419v1 Announce Type: cross Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-…
arXiv:2604.26419v1 Announce Type: cross Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific "Visual-Idk" (Visual-I don't know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9\% to 67.3\%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
STLGT: A Scalable Trace-Based Linear Graph Transformer for Tail Latency Prediction in Microservices
arXiv:2604.26422v1 Announce Type: cross
Abstract: Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependenc…
arXiv →
STLGT: A Scalable Trace-Based Linear Graph Transformer for Tail Latency Prediction in Microservices
arXiv:2604.26422v1 Announce Type: cross Abstract: Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependenc…
arXiv:2604.26422v1 Announce Type: cross Abstract: Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependency propagation and non-stationary, bursty workloads while maintaining inference efficiency at scale remains challenging. We present STLGT (Scalable Trace-based Linear Graph Transformer), a per-API predictor that encodes traces as span graphs for multi-step p95 tail-latency forecasting. STLGT uses a structure-aware linear graph Transformer to propagate cross-service dependencies with inference time linear in span graph size, and a decoupled temporal module to capture workload dynamics. Across a personalized education microservice application, DeathStarBench, and Alibaba traces, STLGT improves forecasting accuracy over PERT-GNN by 8.5% MAPE on average and achieves up to 12x faster CPU inference at N=32, matching the maximum span graph size after preprocessing the Alibaba traces. Ablation studies further demonstrate the effectiveness of each component, especially under bursty traffic.
▶
arXiv cs.AI
30.04.2026
QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing
arXiv:2604.26435v1 Announce Type: cross
Abstract: The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perceptio…
arXiv →
QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing
arXiv:2604.26435v1 Announce Type: cross Abstract: The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perceptio…
arXiv:2604.26435v1 Announce Type: cross Abstract: The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Culturally Aware GenAI Risks for Youth: Perspectives from Youth, Parents, and Teachers in a Non-Western Context
arXiv:2604.26494v1 Announce Type: cross
Abstract: Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth's safety …
arXiv →
Culturally Aware GenAI Risks for Youth: Perspectives from Youth, Parents, and Teachers in a Non-Western Context
arXiv:2604.26494v1 Announce Type: cross Abstract: Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth's safety …
arXiv:2604.26494v1 Announce Type: cross Abstract: Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth's safety in GenAI within western context, it often overlooks the cultural, religious, and social dimensions of technology use that strongly shape youths digital experiences in countries like Saudi Arabia. To address the gap, this study explores children (aged 7 to 17), parents and teachers interactions with GenAI tools and risk perceptions through non-western lens. Through a mixed methods approach, we analyzed 736 Reddit and 1,262 X(Twitter) posts and conducted interviews with 31 Saudi Arabian participants (8 youth, 13 parents, 10 teachers). Our findings highlight context dependent and relational privacy and safety of GenAI from non-western context which often formed by communal structure and prescribed norms. We found significant risks tied to youths disclosure of personal and family information, which conflict with culturally rooted expectations of modesty, privacy, and honor, particularly when youth seek emotional support from GenAI. These risks further compounded by socio economic factors such as cost-saving practices leading to the use of shared GenAI accounts (e.g.ChatGPT) within families or even among strangers. We provide design implication reflecting on parents and teachers expectation of how youth should use GenAI. This work lays groundwork for inclusive, context sensitive parental controls that adhere to cultural norms and values.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain
arXiv:2604.26501v1 Announce Type: cross
Abstract: Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrat…
arXiv →
Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain
arXiv:2604.26501v1 Announce Type: cross Abstract: Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrat…
arXiv:2604.26501v1 Announce Type: cross Abstract: Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
arXiv:2604.26508v1 Announce Type: cross
Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed t…
arXiv →
Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
arXiv:2604.26508v1 Announce Type: cross Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed t…
arXiv:2604.26508v1 Announce Type: cross Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
arXiv:2604.26511v1 Announce Type: cross
Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences …
arXiv →
Tatemae: Detecting Alignment Faking via Tool Selection in LLMs
arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences …
arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool under monitoring that rewards helpfulness over safety, while its reasoning still acknowledges the safe choice. We release a dataset of 108 enterprise IT scenarios spanning Security, Privacy, and Integrity domains under Corruption and Sabotage pressures. Evaluating six frontier LLMs across five independent runs, we find mean AF detection rates between 3.5% and 23.7%, with vulnerability profiles varying by domain and pressure type. These results suggest that susceptibility reflects training methodology rather than capability alone.
▶
arXiv cs.AI
30.04.2026
Text-Utilization for Encoder-dominated Speech Recognition Models
arXiv:2604.26514v1 Announce Type: cross
Abstract: This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facil…
arXiv →
Text-Utilization for Encoder-dominated Speech Recognition Models
arXiv:2604.26514v1 Announce Type: cross Abstract: This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facil…
arXiv:2604.26514v1 Announce Type: cross Abstract: This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
arXiv:2604.26553v1 Announce Type: cross
Abstract: Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language…
arXiv →
TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
arXiv:2604.26553v1 Announce Type: cross Abstract: Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language…
arXiv:2604.26553v1 Announce Type: cross Abstract: Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
arXiv:2604.26557v1 Announce Type: cross
Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key ch…
arXiv →
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key ch…
arXiv:2604.26557v1 Announce Type: cross Abstract: The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, improving inference throughput. Our evaluation shows that DUAL-BLADE substantially mitigates I/O bottlenecks, reducing prefill and decode latency by up to 33.1% and 42.4%, respectively, while improving SSD utilization by 2.2x across diverse memory budgets.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
arXiv:2604.26561v1 Announce Type: cross
Abstract: Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial …
arXiv →
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
arXiv:2604.26561v1 Announce Type: cross Abstract: Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial …
arXiv:2604.26561v1 Announce Type: cross Abstract: Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three-phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7-9B parameter model to each value perspective) significantly reduces first-choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p < 0.001, r = 0.58; housing: 46.0% to 22.9%, p < 0.001, r = 0.50). This contrasts with accuracy-oriented multi-agent debate, where heterogeneity does not reduce convergence, suggesting model diversity operates differently when no objectively correct answer exists. Second, coherence validation (using a frontier model to assess whether each evaluator's reasoning is grounded in its assigned values) reveals a fidelity-diversity tradeoff: on a scenario with a dominant option, it further reduces concentration (46.1% to 40.8%, p = 0.004), but on a scenario with genuinely competitive options, it increases concentration (22.9% to 26.6%, p = 0.96) by amplifying high-coherence evaluators who cluster on one option. This tradeoff may be a general property of multi-agent systems employing quality weighting. We report negative results from three failed Delphi designs, demonstrate that 8B models exhibit binary rather than graded responses to counter-arguments, and propose the trustworthy tension rate as a diagnostic measure of small-model deliberation capabilities.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Graph Construction and Matching for Imperative Programs using Neural and Structural Methods
arXiv:2604.26578v1 Announce Type: cross
Abstract: Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we …
arXiv →
Graph Construction and Matching for Imperative Programs using Neural and Structural Methods
arXiv:2604.26578v1 Announce Type: cross Abstract: Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we …
arXiv:2604.26578v1 Announce Type: cross Abstract: Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we focus on graph construction as a foundational step toward this goal. We present a pipeline that converts imperative programs and their annotations into typed, attributed graphs. Our experiments cover datasets including C with ACSL, Java with JML, and Dafny for C\#. The pipeline integrates abstract syntax tree parsing with semantic embeddings derived from models such as SentenceTransformer and CodeBERT. This enables the generation of graph representations that capture both structural relationships and semantic context. Our results show that consistent graph representations can be constructed across different languages and annotation styles. This work provides a practical basis for future steps in semantic enrichment and approximate graph matching for scalable verification artefact reuse.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology
arXiv:2604.26582v1 Announce Type: cross
Abstract: Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional "Lost-in-Space" (LIS) algor…
arXiv →
Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology
arXiv:2604.26582v1 Announce Type: cross Abstract: Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional "Lost-in-Space" (LIS) algor…
arXiv:2604.26582v1 Announce Type: cross Abstract: Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional "Lost-in-Space" (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to partition the celestial sphere into K topologically consistent regions, effectively mitigating coordinate wrapping artifacts. The proposed architecture employs a tripartite fusion strategy: a SwinV2-Tiny transformer backbone for photometric feature extraction, a convolutional heatmap branch for spatial grounding, and a coordinate-based MLP for geometric anchoring. Experimental evaluations on a synthetic Hipparcos-derived dataset demonstrate that Star-Fusion achieves a Top-1 accuracy of 93.4% and a Top-3 accuracy of 97.8%. Furthermore, the model exhibits high computational efficiency, maintaining an inference latency of 18.4 ms on resource-constrained COTS hardware, making it a viable candidate for real-time onboard deployment in next-generation satellite constellations.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
arXiv:2604.26591v1 Announce Type: cross
Abstract: Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimizat…
arXiv →
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
arXiv:2604.26591v1 Announce Type: cross Abstract: Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimizat…
arXiv:2604.26591v1 Announce Type: cross Abstract: Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open-source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent-based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly outperforms direct evolution and strong baselines, achieving 10.04\% area reduction versus ABC and 7.93\% versus mockturtle, with 46.6\%--96.0\% $S_{overall}$ improvement on EPFL benchmarks, while explicitly navigating the area--delay trade-off. Our code and data are available at https://github.com/Flians/MappingEvolve.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Translating Under Pressure: Domain-Aware LLMs for Crisis Communication
arXiv:2604.26597v1 Announce Type: cross
Abstract: Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis …
arXiv →
Translating Under Pressure: Domain-Aware LLMs for Crisis Communication
arXiv:2604.26597v1 Announce Type: cross Abstract: Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis …
arXiv:2604.26597v1 Announce Type: cross Abstract: Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
TDD Governance for Multi-Agent Code Generation via Prompt Engineering
arXiv:2604.26615v1 Announce Type: cross
Abstract: Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discip…
arXiv →
TDD Governance for Multi-Agent Code Generation via Prompt Engineering
arXiv:2604.26615v1 Announce Type: cross Abstract: Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discip…
arXiv:2604.26615v1 Announce Type: cross Abstract: Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM-based approaches typically use tests as auxiliary inputs rather than enforceable process constraints. We present an AI-native TDD framework that operationalizes classical TDD principles as structured prompt-level and workflow-level governance mechanisms. Extracted principles are formalized in a machine-readable manifesto and distributed across planning, generation, repair, and validation stages within a layered architecture that separates model proposal from deterministic engine authority. The system enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility. We describe architecture and discuss encoding software engineering discipline directly into prompt orchestration, which we think offers a promising direction for reliable LLM-assisted development.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
arXiv:2604.26633v1 Announce Type: cross
Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defe…
arXiv →
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
arXiv:2604.26633v1 Announce Type: cross Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defe…
arXiv:2604.26633v1 Announce Type: cross Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation
arXiv:2604.26637v1 Announce Type: cross
Abstract: Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and…
arXiv →
ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation
arXiv:2604.26637v1 Announce Type: cross Abstract: Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and…
arXiv:2604.26637v1 Announce Type: cross Abstract: Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
arXiv:2604.26649v1 Announce Type: cross
Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with…
arXiv →
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
arXiv:2604.26649v1 Announce Type: cross Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with…
arXiv:2604.26649v1 Announce Type: cross Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before reasoning begins, while reasoning models require evidence injection during multi-step inference chains. We introduce ReaLM-Retrieve, a reasoning-aware retrieval framework that addresses this mismatch through three key innovations: (1) a step-level uncertainty detector that identifies knowledge gaps at reasoning-step granularity rather than token or sentence level; (2) a retrieval intervention policy that learns when external evidence maximally benefits ongoing reasoning; and (3) an efficiency-optimized integration mechanism that reduces per-retrieval overhead by 3.2x compared to naive integration. Experiments on MuSiQue, HotpotQA, and 2WikiMultiHopQA demonstrate that ReaLM-Retrieve achieves on average 10.1% absolute improvement in answer F1 over standard RAG (range: 9.0-11.8% across the three benchmarks) while reducing retrieval calls by 47% compared to fixed-interval approaches like IRCoT (all improvements significant at p<0.01, paired bootstrap). On the challenging MuSiQue benchmark requiring 2-4 hop reasoning, our method achieves 71.2% F1 with an average of only 1.8 retrieval calls per question. Analysis shows that ReaLM-Retrieve also improves retrieval quality itself, achieving 81.3% Recall@5 with consistently higher precision and MRR than fixed-interval baselines on supporting evidence, establishing new state-of-the-art efficiency-accuracy trade-offs for reasoning-intensive retrieval tasks.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy
arXiv:2604.26671v1 Announce Type: cross
Abstract: Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In me…
arXiv →
From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy
arXiv:2604.26671v1 Announce Type: cross Abstract: Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In me…
arXiv:2604.26671v1 Announce Type: cross Abstract: Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism, and a human supervision layer for verification, escalation, and risk control. We demonstrate that trust also depends on selective verification of clinically critical findings, bounded clinical context, disciplined prompt architecture, and careful evaluation on realistic cases. Classifier-driven modular prompting is examined as an incremental path to scaling clinical depth without sacrificing prompt performance and without waiting for complete rule-based coverage. To operationalize trust, a set of trust metrics is proposed, built on metrological principles -- measurement uncertainty, calibration, traceability -- enabling quantitative rather than subjective assessment of each architectural layer. In this perspective, trustworthy clinical AI emerges not as a property of an individual model, but as an architectural outcome of a system into which evidence trails, human oversight, tiered escalation, and graduated action rights are embedded from the outset.
▶
arXiv cs.AI
30.04.2026
A Toolkit for Detecting Spurious Correlations in Speech Datasets
arXiv:2604.26676v1 Announce Type: cross
Abstract: We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlati…
arXiv →
A Toolkit for Detecting Spurious Correlations in Speech Datasets
arXiv:2604.26676v1 Announce Type: cross Abstract: We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlati…
arXiv:2604.26676v1 Announce Type: cross Abstract: We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
arXiv:2604.26689v1 Announce Type: cross
Abstract: Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typ…
arXiv →
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
arXiv:2604.26689v1 Announce Type: cross Abstract: Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typ…
arXiv:2604.26689v1 Announce Type: cross Abstract: Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below 26.7%, and whether this dominant ECM enters a composition shifts the success rate by up to +50pp. We characterize the boundary on a simpler pick task where all atomic policies saturate at 100% and the effect is undefined. Across three tasks we further find that off-policy behavioral distance metrics fail to identify the dominant ECM, ruling out the natural cheap predictor. We propose an atomic-quality probe and a Hybrid Selector combining per-skill probes (zero per-decision cost) with selective composition revalidation (full cost), and characterize its Pareto frontier on 144 skill-update decisions. On T6 the atomic-only probe sits 23pp below full revalidation (64.6% vs 87.5% oracle match) at zero per-decision cost; a Hybrid Selector with m=10 closes most of that gap to ~12pp at 46% of full-revalidation cost. On the cross-task average over 144 events, atomic-only is within 3pp of full revalidation under a mixed-oracle caveat. The atomic-quality probe is, to our knowledge, the first principled, deployment-ready primitive for skill-update governance in compositional robot policies.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
arXiv:2604.26694v1 Announce Type: cross
Abstract: We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstru…
arXiv →
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
arXiv:2604.26694v1 Announce Type: cross Abstract: We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstru…
arXiv:2604.26694v1 Announce Type: cross Abstract: We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.
▶
arXiv cs.AI
30.04.2026
Random Cloud: Finding Minimal Neural Architectures Without Training
arXiv:2604.26830v1 Announce Type: cross
Abstract: I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologie…
arXiv →
Random Cloud: Finding Minimal Neural Architectures Without Training
arXiv:2604.26830v1 Announce Type: cross Abstract: I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologie…
arXiv:2604.26830v1 Announce Type: cross Abstract: I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training pruning methods that require a full train-prune-retrain cycle, this method evaluates randomly initialized networks without backpropagation, progressively reduces their topology, and only trains the best minimal candidate at the end. I evaluate on 7 classification benchmarks against magnitude pruning and random pruning baselines. The Random Cloud matches or outperforms both baselines in 6 of 7 datasets, achieving statistically significant improvements on Sonar ($+4.9$pp accuracy, $p{=}0.017$ vs magnitude pruning) with 87\% parameter reduction. Crucially, the method is faster than both pruning baselines in 4 of 5 datasets (0.67--0.94$\times$ the cost of full training), since it avoids training the full-size network entirely.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A self-evolving agent for explainable diagnosis of DFT-experiment band-gap mismatch
arXiv:2604.26703v1 Announce Type: cross
Abstract: Standard density functional theory (DFT) routinely misclassifies the electronic ground state of correlated and structurally complex compounds, predic…
arXiv →
A self-evolving agent for explainable diagnosis of DFT-experiment band-gap mismatch
arXiv:2604.26703v1 Announce Type: cross Abstract: Standard density functional theory (DFT) routinely misclassifies the electronic ground state of correlated and structurally complex compounds, predic…
arXiv:2604.26703v1 Announce Type: cross Abstract: Standard density functional theory (DFT) routinely misclassifies the electronic ground state of correlated and structurally complex compounds, predicting metallic behaviour for materials that experiments report as semiconductors. Each such mismatch encodes a specific non-ideality -- magnetic ordering, electron correlation, an alternative polymorph, or a defect -- that the calculation excluded, but extracting that signal at scale has remained a manual exercise. Here we introduce XDFT, a closed-loop agent that diagnoses the mismatch automatically: it draws candidate hypotheses from a curated catalogue, executes the corresponding first-principles tests, and updates a global Bayesian posterior over hypothesis usefulness from each verdict. On a verified benchmark of 124 materials, XDFT identifies a resolving mechanism for 70 of 90 mismatch cases (78\%), an order of magnitude above a uniform-random baseline (19\%) and a static LLM ordering (20\%). The internal posterior aligns with empirical performance over the benchmark timeline, and resolved cases collapse into a tri-partite element-class taxonomy that we distil into a four-line static rule. Each diagnosed material is returned with a corrected protocol and a mechanistic attribution; failed cases are flagged as evidence-backed targets for experimental re-examination.
▶
arXiv cs.AI
30.04.2026
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
arXiv:2604.26762v1 Announce Type: cross
Abstract: The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean…
arXiv →
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
arXiv:2604.26762v1 Announce Type: cross Abstract: The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean…
arXiv:2604.26762v1 Announce Type: cross Abstract: The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT's missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF's factor matrices are the operator's potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Domain-Adapted Small Language Models for Reliable Clinical Triage
arXiv:2604.26766v1 Announce Type: cross
Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free…
arXiv →
Domain-Adapted Small Language Models for Reliable Clinical Triage
arXiv:2604.26766v1 Announce Type: cross Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free…
arXiv:2604.26766v1 Announce Type: cross Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and silver-standard pediatric triage data, fine-tuned Qwen2.5-7B models substantially reduced discordance and clinically significant errors, outperforming all baseline SLMs and advanced proprietary large language models (LLMs, e.g., GPT-4o). These findings highlight the feasibility of institution-specific SLMs for reliable, privacy-preserving ESI decision support and underscore the importance of targeted fine-tuning over more complex inference strategies.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
arXiv:2604.26774v1 Announce Type: cross
Abstract: Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods…
arXiv →
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
arXiv:2604.26774v1 Announce Type: cross Abstract: Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods…
arXiv:2604.26774v1 Announce Type: cross Abstract: Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
arXiv:2604.26806v1 Announce Type: cross
Abstract: Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by…
arXiv →
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
arXiv:2604.26806v1 Announce Type: cross Abstract: Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by…
arXiv:2604.26806v1 Announce Type: cross Abstract: Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder's cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23\% latency overhead. On MS COCO, $AP_{S}$ improves while $AP_{M}/AP_{L}$ remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.
▶
arXiv cs.AI
30.04.2026
Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training
arXiv:2604.26833v1 Announce Type: cross
Abstract: This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenario…
arXiv →
Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training
arXiv:2604.26833v1 Announce Type: cross Abstract: This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenario…
arXiv:2604.26833v1 Announce Type: cross Abstract: This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined dense rewards and reuses experience through a mode-aware prioritized replay mechanism augmented with rule-derived metadata. We evaluate the framework on two tasks: battery-aware multi-goal delivery and moving-target delivery in obstacle-rich environments. Across both tasks, the proposed method improves early safety and sample efficiency primarily by reducing collision terminations, while preserving the ability to adapt online to scenario-specific dynamics.
▶
arXiv cs.AI
30.04.2026
HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
arXiv:2604.26835v1 Announce Type: cross
Abstract: We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies hav…
arXiv →
HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
arXiv:2604.26835v1 Announce Type: cross Abstract: We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies hav…
arXiv:2604.26835v1 Announce Type: cross Abstract: We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data
arXiv:2604.26841v1 Announce Type: cross
Abstract: When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these quest…
arXiv →
Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data
arXiv:2604.26841v1 Announce Type: cross Abstract: When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these quest…
arXiv:2604.26841v1 Announce Type: cross Abstract: When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these questions by showing that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as Associative Memories (AMs) $\textit{with emergent creative capabilities}$. The core idea of an AM is to reliably recover stored data points as $\textit{memories}$ by establishing distinct basins of attraction around them. Historically, models like Hopfield networks use an explicit energy function to guarantee these stable attractors. We broaden this perspective by leveraging the observation that energy is not strictly necessary, as basins of attraction can also be formed via conditional likelihood maximization. By evaluating token recovery of $\textit{training}$ and $\textit{test}$ examples, we identify in UDDMs a sharp memorization-to-generalization transition governed by the size of the training dataset: as it increases, basins around training examples shrink and basins around unseen test examples expand, until both later converge to the same level. Crucially, we can detect this transition using only the conditional entropy of predicted token sequences: memorization is characterized by vanishing conditional entropy, while in the generalization regime the conditional entropy of most tokens remains finite. Thus, conditional entropy offers a practical probe for the memorization-to-generalization transition in deployed models.
▶
arXiv cs.AI
30.04.2026
Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows
arXiv:2604.26851v1 Announce Type: cross
Abstract: When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-makin…
arXiv →
Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows
arXiv:2604.26851v1 Announce Type: cross Abstract: When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-makin…
arXiv:2604.26851v1 Announce Type: cross Abstract: When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-making. However, there is little empirical exploration of how professionals making high-stakes decisions, such as those related to employment, perceive their agency and level of control when working with genAI systems. Through interviews with 22 recruiting professionals, we investigate how genAI subtly influences control over everyday workflows and even individual hiring decisions. Our findings highlight a pressing conflict: while recruiters believe they have final authority across the recruiting pipeline, genAI has become an invisible architect that shapes the foundational building blocks of information used for evaluation, from defining a job to determining good interview performances. The decision of whether or not to adopt was also often outside recruiters' control, with many feeling compelled to adopt genAI due to calls to integrate AI from higher-ups in their business, to combat applicant use of AI, and the individual need to boost productivity. Despite a seemingly seismic shift in how recruiting happens, participants only reported marginal efficiency gains. Such gains came at the high cost of recruiter deskilling, a trend that jeopardizes the meaningful oversight of decision-making. We conclude by discussing the implications of such findings for responsible and perceptible genAI use in hiring contexts.
▶
arXiv cs.AI
30.04.2026
Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies
arXiv:2604.26903v1 Announce Type: cross
Abstract: This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahert…
arXiv →
Recent Advances in mm-Wave and Sub-THz/THz Oscillators for FutureG Technologies
arXiv:2604.26903v1 Announce Type: cross Abstract: This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahert…
arXiv:2604.26903v1 Announce Type: cross Abstract: This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahertz (sub-THz/THz) oscillators above 100 GHz for next-generation computing and communication systems, including 5G, 6G, and beyond. Various design approaches, including CMOS, SiGe, and III-V semiconductor technologies, are explored in terms of performance metrics such as phase noise, output power, efficiency, frequency tunability, and stability. The review highlights key challenges in achieving high-performance and reliable oscillator designs while discussing emerging techniques for performance enhancement. By evaluating recent design trends, this work aims to offer valuable insights and design guidelines that facilitate the development of robust mm-wave and sub-THz/THz oscillators for future communication, computing, and sensing applications.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ClawGym: A Scalable Framework for Building Effective Claw Agents
arXiv:2604.26904v1 Announce Type: cross
Abstract: Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around t…
arXiv →
ClawGym: A Scalable Framework for Building Effective Claw Agents
arXiv:2604.26904v1 Announce Type: cross Abstract: Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around t…
arXiv:2604.26904v1 Announce Type: cross Abstract: Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
▶
arXiv cs.AI
30.04.2026
Causal Learning with Neural Assemblies
arXiv:2604.26919v1 Announce Type: cross
Abstract: Can Neural Assemblies -- groups of neurons that fire together and strengthen through co-activation -- learn the direction of causal influence between…
arXiv →
Causal Learning with Neural Assemblies
arXiv:2604.26919v1 Announce Type: cross Abstract: Can Neural Assemblies -- groups of neurons that fire together and strengthen through co-activation -- learn the direction of causal influence between…
arXiv:2604.26919v1 Announce Type: cross Abstract: Can Neural Assemblies -- groups of neurons that fire together and strengthen through co-activation -- learn the direction of causal influence between variables? While established as a computationally general substrate for classification, parsing, and planning, neural assemblies have not yet been shown to internalize causal directionality. We demonstrate that the inherent operations of neural assemblies -- projection, local plasticity control, and sparse winner selection -- are sufficient for directional learning. We introduce DIRECT (DIRectional Edge Coupling/Training), a mechanism that co-activates source and target assemblies under an adaptive gain schedule to internalize directed relations. Unlike backpropagation-based methods, DIRECT relies solely on local plasticity, making the resulting causal claims auditable at the mechanism level. Our findings are verified through a dual-readout validation strategy: (i) synaptic-strength asymmetry, measuring the emergent weight gap between forward and reverse links, and (ii) functional propagation overlap, quantifying the reliability of directional signal flow. Across multiple domains, the framework achieves perfect structural recovery under a supervised, known-structure setting. These results establish neural assemblies as an auditable bridge between biologically plausible dynamics and formal causal models, offering an "explainable by design" framework where causal claims are traceable to specific neural winners and synaptic asymmetries.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
arXiv:2604.26951v1 Announce Type: cross
Abstract: Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters …
arXiv →
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
arXiv:2604.26951v1 Announce Type: cross Abstract: Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters …
arXiv:2604.26951v1 Announce Type: cross Abstract: Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
▶
arXiv cs.AI
30.04.2026
Explainable Representation of Finite-Memory Policies for POMDPs using Decision Trees
arXiv:2411.13365v2 Announce Type: replace
Abstract: Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability…
arXiv →
Explainable Representation of Finite-Memory Policies for POMDPs using Decision Trees
arXiv:2411.13365v2 Announce Type: replace Abstract: Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability…
arXiv:2411.13365v2 Announce Type: replace Abstract: Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
arXiv:2604.16902v3 Announce Type: replace
Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native int…
arXiv →
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
arXiv:2604.16902v3 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native int…
arXiv:2604.16902v3 Announce Type: replace Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents
arXiv:2412.13682v5 Announce Type: replace
Abstract: Travel planning stands out among real-world applications of \emph{Language Agents} because it couples significant practical demand with a rigorous …
arXiv →
ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents
arXiv:2412.13682v5 Announce Type: replace Abstract: Travel planning stands out among real-world applications of \emph{Language Agents} because it couples significant practical demand with a rigorous …
arXiv:2412.13682v5 Announce Type: replace Abstract: Travel planning stands out among real-world applications of \emph{Language Agents} because it couples significant practical demand with a rigorous constraint-satisfaction challenge. However, existing benchmarks primarily operate on a slot-filling paradigm, restricting agents to synthetic queries with pre-defined constraint menus, which fails to capture the open-ended nature of natural language interaction, where user requirements are compositional, diverse, and often implicitly expressed. To address this gap, we introduce \emph{ChinaTravel}, with four key contributions: 1) a practical sandbox aligned with the multi-day, multi-POI travel planning, 2) a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison 3) an open-ended dataset that integrates diverse travel requirements and implicit intent from 1154 human participants, and 4) fine-grained analysis reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries, a 10 \times improvement over purely neural models, yet highlighting significant challenges in compositional generalization. Overall, ChinaTravel provides a foundation for advancing language agents through compositional constraint validation in complex, real-world planning scenarios. Project Page: https://www.lamda.nju.edu.cn/shaojj/ChinaTravel/index.html
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Networks of Causal Abstractions: A Sheaf-theoretic Framework
arXiv:2509.25236v3 Announce Type: replace
Abstract: A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arisin…
arXiv →
Networks of Causal Abstractions: A Sheaf-theoretic Framework
arXiv:2509.25236v3 Announce Type: replace Abstract: A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arisin…
arXiv:2509.25236v3 Announce Type: replace Abstract: A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arising from distributed agents with limited and heterogeneous access to the environment. This problem has received little formal treatment, as the existing framework assumes a single shared global causal model. This work introduces the causal abstraction network (CAN), a general sheaf-theoretic framework for representing, learning, and reasoning across collections of mixture of causal models (MCMs) - a class that unifies several existing models of context-dependent causal mechanisms. Sheaf theory provides a natural foundation for this task, offering a rigorous framework to coherently align distributed causal knowledge without requiring explicit causal graphs, functional mechanisms, interventional data, or jointly sampled observations. At the theoretical level, we provide a categorical formulation of MCMs and characterize key properties of CANs, including consistency and smoothness. Under consistency, we establish necessary and sufficient conditions: (i) for the existence of global sections, linked to spectral properties of an associated connection Laplacian; and (ii) for the convergence of causal knowledge diffusion over the CAN to the space of global sections. At the methodological level, we exploit the compositionality of causal abstractions to decompose the learning of consistent CANs into local problems on network edges, extending our prior work on Gaussian variables to Gaussian mixtures via the proposed MIXTURE-CALSEP algorithm. We validate the framework on synthetic data and through a financial application involving a multi-agent trading system, demonstrating CAN recovery, CAN-based portfolio optimization, and counterfactual reasoning.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
arXiv:2510.06002v3 Announce Type: replace
Abstract: In high-stakes legal domains, retrieval must preserve not only semantic relevance, but also the hierarchy, temporality, and causal provenance of le…
arXiv →
Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
arXiv:2510.06002v3 Announce Type: replace Abstract: In high-stakes legal domains, retrieval must preserve not only semantic relevance, but also the hierarchy, temporality, and causal provenance of le…
arXiv:2510.06002v3 Announce Type: replace Abstract: In high-stakes legal domains, retrieval must preserve not only semantic relevance, but also the hierarchy, temporality, and causal provenance of legal norms. Standard Retrieval-Augmented Generation (RAG), based mainly on semantic similarity over text fragments, cannot reliably provide this level of control. Prior work on SAT-Graph RAG addressed the representation problem by modeling legal materials as structure-aware temporal knowledge graphs. This paper addresses the next problem: how an LLM-based reasoning agent can interact with such a graph without reintroducing unreliable retrieval behavior. We specify the SAT-Graph API, a canonical primitive interface for auditable reasoning over temporal knowledge graphs, developed and illustrated in the legal domain. The API exposes typed, atomic, and composable primitives that mediate between a probabilistic language model and a deterministic symbolic substrate. Its design follows Probability Isolation: uncertainty is confined to intent translation, semantic anchoring, and final narrative synthesis, while structural, temporal, and causal graph traversals are executed through deterministic operations over canonical anchors. The interface shifts legal RAG from single-shot Retrieve-then-Generate to active Reason-Act-Observe. An agent decomposes a legal question into an explicit execution plan, invokes primitives for point-in-time retrieval, context reconstruction, provenance tracing, and impact analysis, and produces an answer grounded in an auditable log of graph operations. The result is a formal architectural specification, not an empirical benchmark: a secure interaction protocol that decouples legal knowledge representation from agentic reasoning. Although illustrated in law, the primitive model is domain-portable to other temporally versioned, provenance-sensitive, and authority-governed knowledge bases.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
arXiv:2510.14703v2 Announce Type: replace
Abstract: Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an i…
arXiv →
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
arXiv:2510.14703v2 Announce Type: replace Abstract: Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an i…
arXiv:2510.14703v2 Announce Type: replace Abstract: Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with \textbf{ToolPRM}, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows ``\textbf{explore more but retain less}'', since early JSON errors are unrecoverable.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
arXiv:2510.18165v3 Announce Type: replace
Abstract: Diffusion language models (DLMs) are emerging as a compelling alternative to the dominant autoregressive paradigm, offering inherent advantages in …
arXiv →
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
arXiv:2510.18165v3 Announce Type: replace Abstract: Diffusion language models (DLMs) are emerging as a compelling alternative to the dominant autoregressive paradigm, offering inherent advantages in …
arXiv:2510.18165v3 Announce Type: replace Abstract: Diffusion language models (DLMs) are emerging as a compelling alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, for the tasks with strict structural constraints such as code generation, DLMs face a critical trade-off between inference speed and output quality, where accelerating generation by reducing sampling steps often leads to catastrophic performance collapse. We find that the fundamental reasons are: 1) the generation difficulty is non-uniform in the structured sequence decoding steps, making DLM's static acceleration strategy suboptimal; 2) the context of tokens generated by DLM evolves continuously, causing early high-confidence predictions to turn into irreversible errors. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs that first achieves both better inference speed and output quality in code generation. Saber dynamically adjusts the number of tokens unmasked per step based on the model's evolving confidence, and utilizes a backtracking mechanism to revert tokens whose confidence drops as new context emerges, with its effectiveness supported by theoretical analysis. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average of 1.9\% over mainstream DLM sampling methods, while achieving an average 251.4\% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Treatment, evidence, imitation, and chat
arXiv:2506.23040v5 Announce Type: replace-cross
Abstract: Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this mi…
arXiv →
Treatment, evidence, imitation, and chat
arXiv:2506.23040v5 Announce Type: replace-cross Abstract: Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this mi…
arXiv:2506.23040v5 Announce Type: replace-cross Abstract: Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this might be the case. We start with the treatment problem, the patient's core medical decision-making task, which is solved in collaboration with a clinician. We discuss different approaches to solving it, including, within evidence-based medicine, experimental and observational data. We then discuss the chat problem, and how this differs from the treatment problem -- in particular with respect to imitation (and how imitation alone cannot solve the true treatment problem, although this does not mean it is not useful). We then discuss how a large-language-model-based system might be trained to solve the treatment problem, highlighting that the major challenges relate to the ethics of experimentation and the assumptions associated with observation. We finally discuss how these challenges relate to evidence-based medicine and how this might inform the efforts of the medical research community to solve the treatment problem. Throughout, we illustrate our arguments with the cholesterol medications, statins.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration
arXiv:2601.06160v2 Announce Type: replace
Abstract: Large Language Models (LLMs) often suffer from ''Reasoning Collapse'' on challenging mathematical reasoning tasks, where stochastic sampling produc…
arXiv →
Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration
arXiv:2601.06160v2 Announce Type: replace Abstract: Large Language Models (LLMs) often suffer from ''Reasoning Collapse'' on challenging mathematical reasoning tasks, where stochastic sampling produc…
arXiv:2601.06160v2 Announce Type: replace Abstract: Large Language Models (LLMs) often suffer from ''Reasoning Collapse'' on challenging mathematical reasoning tasks, where stochastic sampling produces lexical variations of the same erroneous logic rather than genuine semantic exploration. We observe that failed reasoning traces are often associated with a low-rank bias manifold in the model's hidden-state geometry, which reduces exploration toward corrective solution directions. To address this, we propose Spectral Orthogonal Exploration (SOE), a geometric inference framework under a ''Student Guides Teacher'' paradigm. Instead of using a weak auxiliary agent for imitation, SOE uses it as an orthogonal probe to introduce semantically heterogeneous reasoning signals into the teacher's orthogonal complement of its dominant subspace. This intervention steers the teacher toward more diverse reasoning trajectories and improves exploration beyond standard sampling. Experiments on mathematical benchmarks show that SOE improves average accuracy by 62.4\% and average sampling efficiency by 113.7\% over baseline methods, suggesting that geometric interventions can be effective for mitigating reasoning collapse in mathematical reasoning. We further provide preliminary evidence that SOE is also effective on logic and code generation benchmarks.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
arXiv:2601.13969v2 Announce Type: replace
Abstract: Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to f…
arXiv →
Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
arXiv:2601.13969v2 Announce Type: replace Abstract: Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to f…
arXiv:2601.13969v2 Announce Type: replace Abstract: Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations. We introduce ARK: Adaptive Retriever of Knowledge, a tool-using KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset: global lexical search over node descriptors and one-hop neighborhood exploration that composes into multi-hop traversal. ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on a fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries. On STaRK, ARK reaches 59.1% average Hit@1 and 67.4 average MRR, improving average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agent-based training-free methods. Finally, we distill ARK's tool-use trajectories from a large teacher into an 8B model via label-free imitation, improving Hit@1 by +7.0, +26.6, and +13.5 absolute points over the base 8B model on AMAZON, MAG, and PRIME datasets, respectively, while retaining up to 98.5% of the teacher's Hit@1 rate.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
arXiv:2601.15808v2 Announce Type: replace
Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing e…
arXiv →
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
arXiv:2601.15808v2 Announce Type: replace Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing e…
arXiv:2601.15808v2 Announce Type: replace Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepSearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
arXiv:2602.01297v3 Announce Type: replace
Abstract: Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges f…
arXiv →
RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
arXiv:2602.01297v3 Announce Type: replace Abstract: Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges f…
arXiv:2602.01297v3 Announce Type: replace Abstract: Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework. RE-MCDF introduces a generation--verification--revision closed-loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi-relation awareness and evaluation expert group that explicitly enforces inter-disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios (https://github.com/shenshaowei/RE-MCDF).
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
The Dual Role of Abstracting over the Irrelevant in Symbolic Explanations: Cognitive Effort vs. Understanding
arXiv:2602.03467v2 Announce Type: replace
Abstract: Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a tran…
arXiv →
The Dual Role of Abstracting over the Irrelevant in Symbolic Explanations: Cognitive Effort vs. Understanding
arXiv:2602.03467v2 Announce Type: replace Abstract: Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a tran…
arXiv:2602.03467v2 Announce Type: replace Abstract: Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a transparent foundation for interpretability, raw logical traces often impose a high extraneous cognitive load. We investigate how formal abstractions, specifically removal and clustering, impact human reasoning performance and cognitive effort. Utilizing Answer Set Programming (ASP) as a formal framework, we define a notion of irrelevant details to be abstracted over to obtain simplified explanations. Our cognitive experiments, in which participants classified stimuli across domains with explanations derived from an answer set program, show that clustering details significantly improve participants' understanding, while removal of details significantly reduce cognitive effort, supporting the hypothesis that abstraction enhances human-centered symbolic explanations.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
arXiv:2602.20426v2 Announce Type: replace
Abstract: While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - age…
arXiv →
Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
arXiv:2602.20426v2 Announce Type: replace Abstract: While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - age…
arXiv:2602.20426v2 Announce Type: replace Abstract: While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re-running a multi-stage per-tool pipeline - synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times - for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large-scale dataset of high-quality tool interfaces derived from real-world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace-Free+ improves robustness as tool catalogs scale to 150+ candidates - in scaling experiments, reducing accuracy degradation by 29.23% and improving average query-level success by 60.89% on StableToolBench - generalizes across domains without retraining, and provides complementary gains on top of agent fine-tuning.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
arXiv:2602.23163v3 Announce Type: replace
Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechani…
arXiv →
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
arXiv:2602.23163v3 Announce Type: replace Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechani…
arXiv:2602.23163v3 Announce Type: replace Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
arXiv:2603.15262v2 Announce Type: replace
Abstract: Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based…
arXiv →
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
arXiv:2603.15262v2 Announce Type: replace Abstract: Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based…
arXiv:2603.15262v2 Announce Type: replace Abstract: Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on JD.com demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in JD.com's AI-Search system.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Framework for Longitudinal Health AI Agents
arXiv:2604.12019v3 Announce Type: replace
Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management…
arXiv →
A Framework for Longitudinal Health AI Agents
arXiv:2604.12019v3 Announce Type: replace Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management…
arXiv:2604.12019v3 Announce Type: replace Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, both within and beyond clinical settings, where follow-up, coherent reasoning, and sustained alignment with individuals' goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
arXiv:2604.14858v2 Announce Type: replace
Abstract: As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve…
arXiv →
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
arXiv:2604.14858v2 Announce Type: replace Abstract: As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve…
arXiv:2604.14858v2 Announce Type: replace Abstract: As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-Codex, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-Codex targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities
arXiv:2604.19653v2 Announce Type: replace
Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it …
arXiv →
A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities
arXiv:2604.19653v2 Announce Type: replace Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it …
arXiv:2604.19653v2 Announce Type: replace Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction
arXiv:2604.25602v2 Announce Type: replace
Abstract: Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, obse…
arXiv →
OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction
arXiv:2604.25602v2 Announce Type: replace Abstract: Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, obse…
arXiv:2604.25602v2 Announce Type: replace Abstract: Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework driven by two core novelties: a unified Oxy abstraction and the OxyBank evolution engine. The unified abstraction encapsulates agents, tools, LLMs, and reasoning flows as pluggable atomic components, enabling Lego-like scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, providing adaptive visualizations. Furthermore, to support continuous evolution, OxyBank serves as an AI asset management platform that drives automated data backflow, annotation, and joint evolution. Empirical evaluations and real-world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is fully open-sourced under the Apache License 2.0 at https://github.com/jd-opensource/OxyGent.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
arXiv:2401.00761v2 Announce Type: replace-cross
Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fin…
arXiv →
Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
arXiv:2401.00761v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fin…
arXiv:2401.00761v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of tested questions. Moreover, we demonstrate that HalluHunter's test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results are available at this link: https://github.com/Mysterchan/HalluHunter.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Data-Centric Foundation Models in Computational Healthcare: A Survey
arXiv:2401.02458v3 Announce Type: replace-cross
Abstract: The advent of foundation models (FMs) as an emerging suite of AI techniques has struck a wave of opportunities in computational healthcare. T…
arXiv →
Data-Centric Foundation Models in Computational Healthcare: A Survey
arXiv:2401.02458v3 Announce Type: replace-cross Abstract: The advent of foundation models (FMs) as an emerging suite of AI techniques has struck a wave of opportunities in computational healthcare. T…
arXiv:2401.02458v3 Announce Type: replace-cross Abstract: The advent of foundation models (FMs) as an emerging suite of AI techniques has struck a wave of opportunities in computational healthcare. The interactive nature of these models, guided by pre-training data and human instructions, has ignited a data-centric AI paradigm that emphasizes better data characterization, quality, and scale. In healthcare AI, obtaining and processing high-quality clinical data records has been a longstanding challenge, encompassing data quantity, annotation, patient privacy, and ethics. In this survey, we investigate a wide range of data-centric approaches in the FM era (from model pre-training to inference) towards improving the healthcare workflow. We discuss key perspectives in AI security, assessment, and alignment with human values. Finally, we offer a promising outlook on FM-based analytics to enhance patient outcomes and clinical workflows in the evolving landscape of healthcare and medicine. We provide an up-to-date list of healthcare-related foundation models and datasets at https://github.com/Yunkun-Zhang/Data-Centric-FM-Healthcare.
▶
arXiv cs.AI
30.04.2026
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models
arXiv:2405.13729v3 Announce Type: replace-cross
Abstract: In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data sampl…
arXiv →
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models
arXiv:2405.13729v3 Announce Type: replace-cross Abstract: In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data sampl…
arXiv:2405.13729v3 Announce Type: replace-cross Abstract: In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, additional attributes are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes can be insufficiently covered by existing training schemes of diffusion generative models, potentially limiting test time performance. We present a simple fix to this problem by constructing stochastic processes that fully exploit the combinatorial structures, hence the name ComboStoc. Using this simple strategy, we show that network training is significantly accelerated across diverse data modalities, including images and 3D structured shapes. Moreover, ComboStoc enables a new way of test time generation which uses asynchronous time steps for different dimensions and attributes, thus allowing for varying degrees of control over them. Our code is available at: https://github.com/Xrvitd/ComboStoc
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Learning to Ask: When LLM Agents Meet Unclear Instruction
arXiv:2409.00557v4 Announce Type: replace-cross
Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tas…
arXiv →
Learning to Ask: When LLM Agents Meet Unclear Instruction
arXiv:2409.00557v4 Announce Type: replace-cross Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tas…
arXiv:2409.00557v4 Announce Type: replace-cross Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
arXiv:2409.06624v4 Announce Type: replace-cross
Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The hu…
arXiv →
A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
arXiv:2409.06624v4 Announce Type: replace-cross Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The hu…
arXiv:2409.06624v4 Announce Type: replace-cross Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
arXiv:2502.11614v3 Announce Type: replace-cross
Abstract: Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for …
arXiv →
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
arXiv:2502.11614v3 Announce Type: replace-cross Abstract: Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for …
arXiv:2502.11614v3 Announce Type: replace-cross Abstract: Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
arXiv:2503.04872v3 Announce Type: replace-cross
Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. Howe…
arXiv →
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
arXiv:2503.04872v3 Announce Type: replace-cross Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. Howe…
arXiv:2503.04872v3 Announce Type: replace-cross Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
arXiv:2504.18662v3 Announce Type: replace-cross
Abstract: Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have pr…
arXiv →
M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
arXiv:2504.18662v3 Announce Type: replace-cross Abstract: Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have pr…
arXiv:2504.18662v3 Announce Type: replace-cross Abstract: Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel training strategy that enables the reuse of learned features across multiple TAS models. Our method sets a new state-of-the-art performance on three robotic datasets REASSEMBLE, (Im)PerfectPour, and JIGSAWS. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
arXiv:2505.02077v2 Announce Type: replace-cross
Abstract: AI agents are beginning to interact with each other directly and across internet platforms and physical environments, creating security chall…
arXiv →
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
arXiv:2505.02077v2 Announce Type: replace-cross Abstract: AI agents are beginning to interact with each other directly and across internet platforms and physical environments, creating security chall…
arXiv:2505.02077v2 Announce Type: replace-cross Abstract: AI agents are beginning to interact with each other directly and across internet platforms and physical environments, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI's task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversight - creating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce multi-agent security, a new field dedicated to securing networks of AI agents against threats that emerge or amplify through their interactions - whether direct or indirect via shared environments - with each other, humans, and institutions, and characterise fundamental security-utility and security-security trade-offs across both distributed and decentralised settings. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) offers applications to multi-agent security for work across diffuse subfields, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
arXiv:2505.10924v4 Announce Type: replace-cross
Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emu…
arXiv →
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
arXiv:2505.10924v4 Announce Type: replace-cross Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emu…
arXiv:2505.10924v4 Announce Type: replace-cross Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
OT Score: An OT based Confidence Score for Prototype-Assisted Source Free Unsupervised Domain Adaptation
arXiv:2505.11669v3 Announce Type: replace-cross
Abstract: We address the computational and theoretical limitations of current distributional alignment methods for source-free unsupervised domain adap…
arXiv →
OT Score: An OT based Confidence Score for Prototype-Assisted Source Free Unsupervised Domain Adaptation
arXiv:2505.11669v3 Announce Type: replace-cross Abstract: We address the computational and theoretical limitations of current distributional alignment methods for source-free unsupervised domain adap…
arXiv:2505.11669v3 Announce Type: replace-cross Abstract: We address the computational and theoretical limitations of current distributional alignment methods for source-free unsupervised domain adaptation (SFUDA) using source class-mean features. In particular, we focus on estimating classification performance and confidence in the absence of target labels. Current theoretical frameworks for these methods often yield computationally intractable quantities and fail to adequately reflect the properties of the alignment algorithms employed. To overcome these challenges, we introduce the Optimal Transport (OT) score, a confidence metric derived from a novel theoretical analysis that exploits the flexibility of decision boundaries induced by Semi-Discrete Optimal Transport alignment. The proposed OT score is intuitively interpretable and theoretically rigorous. It provides principled uncertainty estimates for any given set of target pseudo-labels. Experimental results demonstrate that OT score outperforms existing confidence scores. Moreover, it improves SFUDA performance through training-time reweighting and provides a reliable, label-free proxy for model performance.
▶
arXiv cs.AI
30.04.2026
Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
arXiv:2505.13518v2 Announce Type: replace-cross
Abstract: Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predic…
arXiv →
Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
arXiv:2505.13518v2 Announce Type: replace-cross Abstract: Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predic…
arXiv:2505.13518v2 Announce Type: replace-cross Abstract: Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for multi-label and clustered data. Beyond descriptive categorization, this review critically examines each method's underlying assumptions, operational mechanisms, and suitability for diverse data characteristics, including high dimensionality, mixed feature types, class overlap, and noise. Key findings demonstrate that no single method universally outperforms others; optimal selection depends critically on dataset characteristics, classifier choice, and evaluation metrics. The paper concludes by identifying emerging research directions, including self-supervised learning for imbalance, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for imbalanced deployment, and the adaptation of foundation models to skewed distributions, offering practical guidelines for practitioners and a roadmap for future methodological development.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
RetroMotion: Retrocausal Motion Forecasting Models are Instructable
arXiv:2505.20414v2 Announce Type: replace-cross
Abstract: Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In p…
arXiv →
RetroMotion: Retrocausal Motion Forecasting Models are Instructable
arXiv:2505.20414v2 Announce Type: replace-cross Abstract: Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In p…
arXiv:2505.20414v2 Announce Type: replace-cross Abstract: Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In particular, the output space of joint trajectory distributions grows exponentially with the number of agents. Therefore, we decompose multi-agent motion forecasts into (1) marginal distributions for all modeled agents and (2) joint distributions for interacting agents. Using a transformer model, we generate joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. For each time step, we model the positional uncertainty using compressed exponential power distributions. Notably, our method achieves strong results in the Waymo Interaction Prediction Challenge and generalizes well to the Argoverse 2 and V2X-Seq datasets. Additionally, our method provides an interface for issuing instructions. We show that standard motion forecasting training implicitly enables the model to follow instructions and adapt them to the scene context. GitHub repository: https://github.com/kit-mrt/future-motion
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
arXiv:2505.21190v2 Announce Type: replace-cross
Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation…
arXiv →
Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
arXiv:2505.21190v2 Announce Type: replace-cross Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation…
arXiv:2505.21190v2 Announce Type: replace-cross Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Time Blindness: Why Video-Language Models Can't See What Humans Can?
arXiv:2505.24867v2 Announce Type: replace-cross
Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. Howev…
arXiv →
Time Blindness: Why Video-Language Models Can't See What Humans Can?
arXiv:2505.24867v2 Announce Type: replace-cross Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. Howev…
arXiv:2505.24867v2 Announce Type: replace-cross Abstract: Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
arXiv:2506.02494v2 Announce Type: replace-cross
Abstract: Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. Wit…
arXiv →
MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
arXiv:2506.02494v2 Announce Type: replace-cross Abstract: Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. Wit…
arXiv:2506.02494v2 Announce Type: replace-cross Abstract: Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What's more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of the training data of prior work, our model achieves state-of-the-art evaluation performance across 16 out-of-domain datasets covering both I2T and T2I tasks among all open-source multimodal evaluation models and remain competitive with closed-source models. Extensive experiments demonstrate the importance of leveraging quality control process, jointly training on evaluation data from both I2T and T2I generation tasks and further preference alignment.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Efficient Traffic Forecasting on Large-Scale Road Network by Regularized Adaptive Graph Convolution
arXiv:2506.07179v2 Announce Type: replace-cross
Abstract: Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. To mod…
arXiv →
Efficient Traffic Forecasting on Large-Scale Road Network by Regularized Adaptive Graph Convolution
arXiv:2506.07179v2 Announce Type: replace-cross Abstract: Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. To mod…
arXiv:2506.07179v2 Announce Type: replace-cross Abstract: Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. To model the complex spatial-temporal dependencies in traffic data, Spatial-Temporal Graph Convolutional Networks (STGCNs) have been widely employed, achieving advanced performance. However, when applied to large-scale road networks, the quadratic computational complexity of traditional graph convolution operations severely limits their scalability. Several methods attempt to address this issue through approximation, compression, or spatial partitioning. Nevertheless, these methods often either fail to achieve sufficient computational efficiency or compromise prediction accuracy. To address these challenges, we propose a Regularized Adaptive Graph Convolution (RAGC) model. First, to ensure scalability on large road networks, we develop the Efficient Cosine Operator (ECO), which performs graph convolution based on the cosine similarity of node embeddings with linear time complexity. Second, we introduce a regularized adaptive graph convolution framework that combines Stochastic Shared Embedding (SSE) and adaptive graph convolution through a residual difference mechanism. This design enables the model to learn high-quality node embeddings, thereby improving prediction accuracy while maintaining computational efficiency. Extensive experiments on four large-scale real-world traffic datasets show that RAGC consistently outperforms state-of-the-art methods in terms of prediction accuracy and exhibits competitive computational efficiency. The code is available at: https://github.com/wkq-wukaiqi/RAGC.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
PBiLoss: Popularity-Aware Regularization to Improve Fairness in Graph-Based Recommender Systems
arXiv:2507.19067v2 Announce Type: replace-cross
Abstract: Recommender systems based on graph neural networks (GNNs) have been proved to perform well on user-item interactions. However, they commonly …
arXiv →
PBiLoss: Popularity-Aware Regularization to Improve Fairness in Graph-Based Recommender Systems
arXiv:2507.19067v2 Announce Type: replace-cross Abstract: Recommender systems based on graph neural networks (GNNs) have been proved to perform well on user-item interactions. However, they commonly …
arXiv:2507.19067v2 Announce Type: replace-cross Abstract: Recommender systems based on graph neural networks (GNNs) have been proved to perform well on user-item interactions. However, they commonly suffer from popularity bias -- the tendency to over-recommend popular items -- resulting in less personalization, unfair exposure and lower recommendation diversity. Current solutions address popularity bias through different stages of the recommendation pipeline, including pre-processing methods that may distort data distributions, in-processing approaches which can complicate optimization, and post-processing techniques that are limited in correcting bias already embedded in the learned representations. To address these limitations, we propose PBiLoss, a novel regularization-based loss function designed to explicitly counteract popularity bias in graph-based recommenders. PBiLoss augments traditional training objectives by penalizing the model's inclination toward popular items, thereby encouraging the recommendation of less popular but potentially more personalized content. We introduce two sampling strategies -- Popular Positive (PopPos) and Popular Negative (PopNeg) -- and explore two methods to distinguish popular items -- one based on a fixed popularity threshold and another without any threshold -- making the approach flexible and adaptive. Our proposed method is model-agnostic and can be seamlessly integrated into state-of-the-art graph-based frameworks such as LightGCN and its variants. Extensive experiments carried out on datasets including Epinions, iFashion, and MovieLens highlight the advantages of the PBiLoss for enhancing fairness in recommendations, decreasing PRU and PRI by up to 10\%, compared to other baseline models, while maintaining accuracy and other standard metrics intact in the process.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
arXiv:2508.04325v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However,…
arXiv →
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
arXiv:2508.04325v2 Announce Type: replace-cross Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However,…
arXiv:2508.04325v2 Announce Type: replace-cross Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Vertex Features for Neural Global Illumination
arXiv:2508.07852v2 Announce Type: replace-cross
Abstract: Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering appl…
arXiv →
Vertex Features for Neural Global Illumination
arXiv:2508.07852v2 Announce Type: replace-cross Abstract: Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering appl…
arXiv:2508.07852v2 Announce Type: replace-cross Abstract: Recent research on learnable neural representations has been widely adopted in the field of 3D scene reconstruction and neural rendering applications. However, traditional feature grid representations often suffer from substantial memory footprint, posing a significant bottleneck for modern parallel computing hardware. In this paper, we present neural vertex features, a generalized formulation of learnable representation for neural rendering tasks involving explicit mesh surfaces. Instead of uniformly distributing neural features throughout 3D space, our method stores learnable features directly at mesh vertices, leveraging the underlying geometry as a compact and structured representation for neural processing. This not only optimizes memory efficiency, but also improves feature representation by aligning compactly with the surface using task-specific geometric priors. We validate our neural representation across diverse neural rendering tasks, with a specific emphasis on neural radiosity. Experimental results demonstrate that our method reduces memory consumption to only one-fifth (or even less) of grid-based representations, while maintaining comparable rendering quality and lowering inference overhead.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
arXiv:2508.09547v2 Announce Type: replace-cross
Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to generate contextually coherent naviga…
arXiv →
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
arXiv:2508.09547v2 Announce Type: replace-cross Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to generate contextually coherent naviga…
arXiv:2508.09547v2 Announce Type: replace-cross Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to generate contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike prior work relying on structured inputs, such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, improving adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) navigation visualization, predicting intermediate visual states bridging the initial and goal views; and (2) instruction generation, synthesizing coherent instructions grounded in observed and anticipated visuals. Both subtasks are integrated within an autoregressive multimodal LLM trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human navigation cognition. To comprehensively evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant performance improvements over state-of-the-art methods in BLEU-4 and CIDEr scores along with robust cross-domain generalization.
▶
arXiv cs.AI
30.04.2026
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
arXiv:2508.12672v4 Announce Type: replace-cross
Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios w…
arXiv →
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
arXiv:2508.12672v4 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios w…
arXiv:2508.12672v4 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
arXiv:2508.16131v2 Announce Type: replace-cross
Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing…
arXiv →
The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing…
arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 2254 files from 881 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Shell appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM; under a fixed model, relative language-level rankings are moderately stable across evaluation corpora. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks
arXiv:2509.09870v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such de…
arXiv →
Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks
arXiv:2509.09870v2 Announce Type: replace-cross Abstract: Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such de…
arXiv:2509.09870v2 Announce Type: replace-cross Abstract: Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such designs shape user perceptions. This study investigates how personality expression levels and user-agent personality alignment influence perceptions in goal-oriented tasks. In a between-subjects experiment (N=150), participants completed travel planning with CAs exhibiting low, medium, or high expression across the Big Five traits, controlled via our novel Trait Modulation Keys framework. Results revealed an inverted-U relationship: medium expression produced the most positive evaluations across Intelligence, Enjoyment, Anthropomorphism, Intention to Adopt, Trust, and Likeability, significantly outperforming both extremes. Personality alignment further enhanced outcomes, with Extraversion and Emotional Stability emerging as the most influential traits. Cluster analysis identified three distinct compatibility profiles, with "Well-Aligned" users reporting substantially positive perceptions. These findings demonstrate that personality expression and strategic trait alignment constitute optimal design targets for CA personality, offering design implications as LLM-based CAs become increasingly prevalent.
▶
arXiv cs.AI
30.04.2026
Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning
arXiv:2509.21983v2 Announce Type: replace-cross
Abstract: Constructing robots to accomplish long-horizon tasks is a long-standing challenge within artificial intelligence. Approaches using generative…
arXiv →
Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning
arXiv:2509.21983v2 Announce Type: replace-cross Abstract: Constructing robots to accomplish long-horizon tasks is a long-standing challenge within artificial intelligence. Approaches using generative…
arXiv:2509.21983v2 Announce Type: replace-cross Abstract: Constructing robots to accomplish long-horizon tasks is a long-standing challenge within artificial intelligence. Approaches using generative methods, particularly Diffusion Models, have gained attention due to their ability to model continuous robotic trajectories for planning and control. However, we show that these models struggle with long-horizon tasks that involve complex decision-making and, in general, are prone to confusing different modes of behavior, leading to failure. To remedy this, we propose to augment continuous trajectory generation by simultaneously generating a high-level symbolic plan. We show that this requires a novel mix of discrete variable diffusion and continuous diffusion, which dramatically outperforms the baselines. In addition, we illustrate how this hybrid diffusion process enables flexible trajectory synthesis, allowing us to condition synthesized actions on partial and complete symbolic conditions.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
arXiv:2509.23410v4 Announce Type: replace-cross
Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an…
arXiv →
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
arXiv:2509.23410v4 Announce Type: replace-cross Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an…
arXiv:2509.23410v4 Announce Type: replace-cross Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 13B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Auto-ARGUE: LLM-Based Report Generation Evaluation
arXiv:2509.26184v5 Announce Type: replace-cross
Abstract: Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation to…
arXiv →
Auto-ARGUE: LLM-Based Report Generation Evaluation
arXiv:2509.26184v5 Announce Type: replace-cross Abstract: Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation to…
arXiv:2509.26184v5 Announce Type: replace-cross Abstract: Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Emergent Coordination in Multi-Agent Language Models
arXiv:2510.05174v4 Announce Type: replace-cross
Abstract: When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We int…
arXiv →
Emergent Coordination in Multi-Agent Language Models
arXiv:2510.05174v4 Announce Type: replace-cross Abstract: When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We int…
arXiv:2510.05174v4 Announce Type: replace-cross Abstract: When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way -- whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
arXiv:2510.08049v3 Announce Type: replace-cross
Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward …
arXiv →
A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
arXiv:2510.08049v3 Announce Type: replace-cross Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward …
arXiv:2510.08049v3 Announce Type: replace-cross Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
arXiv:2510.10150v4 Announce Type: replace-cross
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Lan…
arXiv →
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
arXiv:2510.10150v4 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Lan…
arXiv:2510.10150v4 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, its training is often plagued by \emph{entropy collapse}, a rapid decline in policy entropy that limits exploration and undermines training effectiveness. While recent works attempt to mitigate this issue via several heuristic entropy interventions, the underlying mechanisms remain poorly understood. In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations. Extensive experiments across six mathematical reasoning and three coding benchmarks demonstrate that STEER effectively mitigates entropy collapse and consistently outperforms state-of-the-art baselines.
▶
arXiv cs.AI
30.04.2026
FedPF: Accurate Target Privacy Preserving Federated Learning Balancing Fairness and Utility
arXiv:2510.26841v2 Announce Type: replace-cross
Abstract: Federated Learning (FL) enables collaborative model training without data sharing, yet participants face a fundamental challenge, e.g., simul…
arXiv →
FedPF: Accurate Target Privacy Preserving Federated Learning Balancing Fairness and Utility
arXiv:2510.26841v2 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative model training without data sharing, yet participants face a fundamental challenge, e.g., simul…
arXiv:2510.26841v2 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative model training without data sharing, yet participants face a fundamental challenge, e.g., simultaneously ensuring fairness across demographic groups while protecting sensitive client data. We introduce a differentially private fair FL algorithm (FedPF) that transforms this multi-objective optimization into a zero-sum game where fairness and privacy constraints compete against model utility. Our theoretical analysis reveals an inverse relationship: privacy mechanisms that protect sensitive attributes can reduce the statistical power available for detecting and correcting demographic biases under finite samples in federated settings. We further show that our theoretical bounds are consistent with a non-monotonic fairness-utility relationship, which is empirically validated by experiments where moderate fairness constraints improve generalization before excessive enforcement degrades performance. Compared with mainstream algorithms, even under strict privacy constraints, FedPF still maintains the lowest discrimination level among all tested algorithms while retaining high utility. Experimental validation demonstrates up to 42.9 % discrimination reduction across three datasets while maintaining competitive accuracy, but more importantly, reveals that achieving strong privacy and fairness simultaneously requires carefully balanced tradeoffs rather than optimizing either objective in isolation. Furthermore, hardware-level simulations demonstrate that FedPF maintains a low computational footprint, making it suitable for resource-constrained edge devices. The source code for our proposed algorithm is publicly accessible at https://github.com/szpsunkk/FedPF.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
arXiv:2511.02399v2 Announce Type: replace-cross
Abstract: Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirem…
arXiv →
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
arXiv:2511.02399v2 Announce Type: replace-cross Abstract: Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirem…
arXiv:2511.02399v2 Announce Type: replace-cross Abstract: Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction
arXiv:2601.19216v2 Announce Type: replace-cross
Abstract: The emerging applications of next-generation wireless networks demand high-fidelity environmental intelligence. 3D radio maps bridge physical…
arXiv →
Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction
arXiv:2601.19216v2 Announce Type: replace-cross Abstract: The emerging applications of next-generation wireless networks demand high-fidelity environmental intelligence. 3D radio maps bridge physical…
arXiv:2601.19216v2 Announce Type: replace-cross Abstract: The emerging applications of next-generation wireless networks demand high-fidelity environmental intelligence. 3D radio maps bridge physical environments and electromagnetic propagation for spectrum planning and environment-aware sensing. However, most existing methods treat visual and wireless data as independent modalities and fail to leverage shared electromagnetic propagation principles. To bridge this gap, we propose URF-GS, a unified radio-optical radiation field framework based on 3D Gaussian splatting and inverse rendering for 3D radio map construction. By fusing cross-modal observations, our method recovers scene geometry and material properties to predict radio signals under arbitrary transceiver configurations without retraining. Experiments demonstrate up to a 24.7% improvement in spatial spectrum accuracy and a 10x increase in sample efficiency compared with NeRF-based methods. We further showcase URF-GS in Wi-Fi AP deployment and robot path planning tasks. This unified visual-wireless representation supports holistic radiation field modeling for future wireless communication systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Stress Testing Factual Consistency Metrics for Long-Document Summarization
arXiv:2511.07689v2 Announce Type: replace-cross
Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where …
arXiv →
Stress Testing Factual Consistency Metrics for Long-Document Summarization
arXiv:2511.07689v2 Announce Type: replace-cross Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where …
arXiv:2511.07689v2 Announce Type: replace-cross Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
arXiv:2511.20714v2 Announce Type: replace-cross
Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realisti…
arXiv →
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
arXiv:2511.20714v2 Announce Type: replace-cross Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realisti…
arXiv:2511.20714v2 Announce Type: replace-cross Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness
arXiv:2512.03992v2 Announce Type: replace-cross
Abstract: Vision-Language Models (VLMs) are essential for embodied AI and safety-critical applications, such as robotics and autonomous systems. Howeve…
arXiv →
Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness
arXiv:2512.03992v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are essential for embodied AI and safety-critical applications, such as robotics and autonomous systems. Howeve…
arXiv:2512.03992v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are essential for embodied AI and safety-critical applications, such as robotics and autonomous systems. However, existing benchmarks primarily focus on static or curated visual inputs, neglecting the challenges posed by adversarial conditions, value misalignment, and error propagation in continuous deployment. Current benchmarks either overlook the impact of real-world perturbations, or fail to account for the cumulative effect of inconsistent reasoning over time. To address these gaps, we introduce the Degraded Image Quality Leading to Hallucinations (DIQ-H) benchmark, the first to evaluate VLMs under adversarial visual conditions in continuous sequences. DIQ-H simulates real-world stressors including motion blur, sensor noise, and compression artifacts, and measures how these corruptions lead to persistent errors and misaligned outputs across time. The benchmark explicitly models error propagation and its long-term value consistency. To enhance scalability and reduce costs for safety-critical evaluation, we propose the Value-Guided Iterative Refinement (VIR) framework, which automates the generation of high-quality, ethically aligned ground truth annotations. VGIR leverages lightweight VLMs to detect and refine value misalignment, improving accuracy from 72.2% to 83.3%, representing a 15.3% relative improvement. The DIQ-H benchmark and VGIR framework provide a robust platform for embodied AI safety assessment, revealing vulnerabilities in error recovery, ethical consistency, and temporal value alignment.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement
arXiv:2512.08982v2 Announce Type: replace-cross
Abstract: Retinex-based low-light image enhancement benefits from separating reflectance and illumination, yet recent generative approaches often rely …
arXiv →
Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement
arXiv:2512.08982v2 Announce Type: replace-cross Abstract: Retinex-based low-light image enhancement benefits from separating reflectance and illumination, yet recent generative approaches often rely …
arXiv:2512.08982v2 Announce Type: replace-cross Abstract: Retinex-based low-light image enhancement benefits from separating reflectance and illumination, yet recent generative approaches often rely on iterative sampling and are difficult to deploy under strict latency budgets. Consistency models offer a natural route to one-step restoration, but direct adaptation to Retinex-factorized enhancement is unstable: one-step inference is evaluated at the high-noise endpoint, whereas standard training schedules provide little supervision there, and temporal self-consistency alone does not determine the correct conditional target. We propose Consist-Retinex, which first uses a Retinex Transformer Decomposition Network (TDN) to obtain paired reflectance and illumination maps, then trains two conditional consistency models with a Retinex-aware dual objective and adaptive noise-emphasized fixed-point sampling. The dual objective combines trajectory consistency with paired ground-truth component alignment, while the sampling rule concentrates supervision near the inference endpoint without discarding full-range noise coverage. We further provide an endpoint error bound, an anchoring-propagation result, and a high-noise sample-allocation analysis that explain why endpoint supervision and temporal consistency are complementary for one-step Retinex enhancement. Experiments on paired and unpaired low-light benchmarks show that Consist-Retinex obtains the best VE-LOL-L scores among the compared methods under one-step inference and remains competitive on LOL, with substantially reduced sampling and consistency-stage training cost in the reported setup.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
q3-MuPa: Quick, Quiet, Quantitative Multi-Parametric MRI using Physics-Informed Diffusion Models
arXiv:2512.23726v2 Announce Type: replace-cross
Abstract: The 3D fast silent multi-parametric mapping sequence with zero echo time (MuPa-ZTE) is a novel quantitative MRI (qMRI) acquisition that enabl…
arXiv →
q3-MuPa: Quick, Quiet, Quantitative Multi-Parametric MRI using Physics-Informed Diffusion Models
arXiv:2512.23726v2 Announce Type: replace-cross Abstract: The 3D fast silent multi-parametric mapping sequence with zero echo time (MuPa-ZTE) is a novel quantitative MRI (qMRI) acquisition that enabl…
arXiv:2512.23726v2 Announce Type: replace-cross Abstract: The 3D fast silent multi-parametric mapping sequence with zero echo time (MuPa-ZTE) is a novel quantitative MRI (qMRI) acquisition that enables nearly silent scanning by using a 3D phyllotaxis sampling scheme. MuPa-ZTE improves patient comfort and motion robustness, and generates quantitative maps of T1, T2, and proton density using the acquired weighted image series. In this work, we propose a diffusion model-based qMRI mapping method that leverages both a deep generative model and physics-based data consistency to further improve the mapping performance. Furthermore, our method enables additional acquisition acceleration, allowing high-quality qMRI mapping from a fourfold-accelerated MuPa-ZTE scan (approximately 1 minute). Specifically, we trained a denoising diffusion probabilistic model (DDPM) to map MuPa-ZTE image series to qMRI maps, and we incorporated the MuPa-ZTE forward signal model as an explicit data consistency (DC) constraint during inference. We compared our mapping method against a baseline dictionary matching approach and a purely data-driven diffusion model. The diffusion models were trained entirely on synthetic data generated from digital brain phantoms, eliminating the need for large real-scan datasets. We evaluated on synthetic data, a NISM/ISMRM phantom, healthy volunteers, and a patient with brain metastases. The results demonstrated that our method produces 3D qMRI maps with high accuracy, reduced noise and better preservation of structural details. Notably, it generalised well to real scans despite training on synthetic data alone. The combination of the MuPa-ZTE acquisition and our physics-informed diffusion model is termed q3-MuPa, a quick, quiet, and quantitative multi-parametric mapping framework, and our findings highlight its strong clinical potential.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models
arXiv:2601.03423v3 Announce Type: replace-cross
Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new …
arXiv →
Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models
arXiv:2601.03423v3 Announce Type: replace-cross Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new …
arXiv:2601.03423v3 Announce Type: replace-cross Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6\% over UniTE, +41.4\% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
arXiv:2601.04389v2 Announce Type: replace-cross
Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under gen…
arXiv →
Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
arXiv:2601.04389v2 Announce Type: replace-cross Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under gen…
arXiv:2601.04389v2 Announce Type: replace-cross Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we introduce MiJaBench, a bilingual (English-Portuguese) adversarial benchmark comprising 43,961 controlled jailbreaking prompts across 16 minority groups. By evaluating 14 state-of-the-art LLMs on MiJaBench, we curate 615,454 prompt-response pairs that compose MiJaBench-Align, revealing that safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates fluctuating by up to 42% within the same model solely based on the target group. This disparity persists across architectures and languages and is amplified by scaling, indicating that current alignment methods learn group-specific safeguards rather than a generalized notion of harm. Through targeted direct preference optimization (DPO) on a 1B-parameter baseline, we achieve strong zero-shot safety generalizations to entirely unseen demographics and complex attack strategies. We release all datasets and scripts to provide the community with a concrete pathway toward equitable, transferable safety alignment.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
arXiv:2601.11568v2 Announce Type: replace-cross
Abstract: Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gr…
arXiv →
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
arXiv:2601.11568v2 Announce Type: replace-cross Abstract: Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gr…
arXiv:2601.11568v2 Announce Type: replace-cross Abstract: Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gradient splitting, but its static hyperparameters -- the subspace ratio ($\rho$) and update frequency ($T$) -- require costly manual tuning, limiting adaptability. We present AdaFRUGAL, which automates this process by introducing two dynamic controls: (i) a linear decay for $\rho$ to progressively reduce memory, and (ii) a loss-aware schedule for $T$ to lower computational overhead. Experiments across large-scale pre-training (English C4, Vietnamese VietVault) and fine-tuning (GLUE) demonstrate that AdaFRUGAL achieves a compelling trade-off. It maintains competitive performance against AdamW and static FRUGAL while significantly reducing both GPU memory and training time, offering a more practical, autonomous solution for resource-constrained LLM training.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning
arXiv:2601.13942v2 Announce Type: replace-cross
Abstract: Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries i…
arXiv →
Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning
arXiv:2601.13942v2 Announce Type: replace-cross Abstract: Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries i…
arXiv:2601.13942v2 Announce Type: replace-cross Abstract: Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
arXiv:2601.21459v4 Announce Type: replace-cross
Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companions…
arXiv →
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
arXiv:2601.21459v4 Announce Type: replace-cross Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companions…
arXiv:2601.21459v4 Announce Type: replace-cross Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: lacking data with high-quality reasoning traces, and lacking reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering, and construct human-aligned principles and reward models. Leveraging these resources, we train HER models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97% gain on the Minimax Role-Play Bench. Our datasets, principles, and models are released to facilitate future research.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
arXiv:2602.03558v2 Announce Type: replace-cross
Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering pr…
arXiv →
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
arXiv:2602.03558v2 Announce Type: replace-cross Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering pr…
arXiv:2602.03558v2 Announce Type: replace-cross Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Affective Flow Language Model for Emotional Support Conversation
arXiv:2602.08826v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains ch…
arXiv →
Affective Flow Language Model for Emotional Support Conversation
arXiv:2602.08826v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains ch…
arXiv:2602.08826v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains challenging.This is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at https://github.com/chz2025/AffectiveFlow.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization
arXiv:2602.15983v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that execu…
arXiv →
ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization
arXiv:2602.15983v2 Announce Type: replace-cross Abstract: Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that execu…
arXiv:2602.15983v2 Announce Type: replace-cross Abstract: Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations -- a feasibility-correctness gap reaching 90 percentage points on compositional problems. We introduce ReLoop, which addresses this gap through two complementary mechanisms. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify), preventing formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation -- an external semantic signal that bypasses LLM self-review and requires no ground truth. The two mechanisms are complementary by error structure: structured generation drives the largest gains on compositional problems (+8.5pp accuracy on RetailOpt-190 with Claude Opus 4.6), while behavioral verification dominates on localized defects (+4.4pp on MAMO-ComplexLP, its largest contribution across benchmarks). Combined with diagnostic execution recovery, ReLoop reaches 100% executable code on Claude Opus 4.6 and consistently improves accuracy on chat-tuned foundation models across three benchmarks; we further identify a known limitation of narrowly-tuned SFT models, whose learned output formats are brittle to chain-of-thought prompts -- an interaction we document and analyze. We release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Provable Coordination for LLM Agents via Message Sequence Charts
arXiv:2604.17612v2 Announce Type: replace-cross
Abstract: Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismat…
arXiv →
Provable Coordination for LLM Agents via Message Sequence Charts
arXiv:2604.17612v2 Announce Type: replace-cross Abstract: Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismat…
arXiv:2604.17612v2 Announce Type: replace-cross Abstract: Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismatched messages are often hard to detect through testing. We introduce a domain-specific language for specifying agent coordination based on message sequence charts (MSCs). The language separates message-passing structure from LLM actions, whose outputs remain unpredictable. We define the syntax and semantics of the language and present a syntax-directed projection that generates deadlock-free local agent programs from global coordination specifications. We illustrate the approach with a diagnosis consensus protocol and show how coordination properties can be established independently of LLM nondeterminism. We also describe a runtime planning extension in which an LLM dynamically generates a coordination workflow for which the same structural guarantees apply. An open-source Python implementation of our framework is available as ZipperGen.
▶
arXiv cs.AI
30.04.2026
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
arXiv:2602.21720v2 Announce Type: replace-cross
Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly re…
arXiv →
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
arXiv:2602.21720v2 Announce Type: replace-cross Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly re…
arXiv:2602.21720v2 Announce Type: replace-cross Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
CoFL: Continuous Flow Fields for Language-Conditioned Navigation
arXiv:2603.02854v2 Announce Type: replace-cross
Abstract: Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene-…
arXiv →
CoFL: Continuous Flow Fields for Language-Conditioned Navigation
arXiv:2603.02854v2 Announce Type: replace-cross Abstract: Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene-…
arXiv:2603.02854v2 Announce Type: replace-cross Abstract: Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene--instruction annotation mainly to supervise one start-conditioned rollout. To address these limitations, we present CoFL, an end-to-end policy that maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. CoFL reformulates navigation as workspace-conditioned field learning rather than start-conditioned trajectory prediction: it learns local motion vectors at arbitrary BEV locations, turning each scene--instruction annotation into dense spatial control supervision. Trajectories are generated from any start by numerical integration of the predicted field, enabling simple real-time rollout and closed-loop recovery. To enable large-scale training and evaluation, we build a dataset of over 500k BEV image--instruction pairs, each procedurally annotated with a flow field and a trajectory derived from semantic maps built on Matterport3D and ScanNet. Evaluating on strictly unseen scenes, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and trajectory generation policies in both navigation precision and safety, while maintaining real-time inference. Finally, we deploy CoFL zero-shot in real-world experiments with BEV observations across multiple layouts, maintaining feasible closed-loop control and a high success rate.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
arXiv:2603.08182v2 Announce Type: replace-cross
Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in trai…
arXiv →
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
arXiv:2603.08182v2 Announce Type: replace-cross Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in trai…
arXiv:2603.08182v2 Announce Type: replace-cross Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
▶
arXiv cs.AI
30.04.2026
Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
arXiv:2603.09145v3 Announce Type: replace-cross
Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. H…
arXiv →
Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
arXiv:2603.09145v3 Announce Type: replace-cross Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. H…
arXiv:2603.09145v3 Announce Type: replace-cross Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. However, such task-specific features learned from the new task may collide with the old features. From a causal perspective, spurious feature correlations are the main cause of this collision, manifesting in two scopes: (i) guided by empirical risk minimization (ERM), intra-task spurious correlations cause task-specific features to rely on shortcut features. These non-robust features are vulnerable to interference, inevitably drifting into the feature space of other tasks; (ii) inter-task spurious correlations induce semantic confusion between visually similar classes across tasks. To address this, we propose a Probability of Necessity and Sufficiency (PNS)-based regularization method to guide feature expansion in CIL. Specifically, we first extend the definition of PNS to expansion-based CIL, termed CPNS, which quantifies both the causal completeness of intra-task representations and the separability of inter-task representations. We then introduce a dual-scope counterfactual generator based on twin networks to ensure the measurement of CPNS, which simultaneously generates: (i) intra-task counterfactual features to minimize intra-task PNS risk and ensure causal completeness of task-specific features, and (ii) inter-task interfering features to minimize inter-task PNS risk, ensuring the separability of inter-task representations. Theoretical analyses confirm its reliability. The regularization is a plug-and-play method for expansion-based CIL to mitigate feature collision. Extensive experiments demonstrate the effectiveness of the proposed method.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Rethinking the Harmonic Loss via Non-Euclidean Distance Layers
arXiv:2603.10225v3 Announce Type: replace-cross
Abstract: Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unb…
arXiv →
Rethinking the Harmonic Loss via Non-Euclidean Distance Layers
arXiv:2603.10225v3 Announce Type: replace-cross Abstract: Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unb…
arXiv:2603.10225v3 Announce Type: replace-cross Abstract: Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unbounded weight growth, and inefficiencies that can contribute to costly training dynamics. The harmonic loss is a distance-based alternative grounded in Euclidean geometry that improves interpretability and mitigates phenomena such as grokking, or delayed generalization on the test set. However, the study of harmonic loss remains narrow: only Euclidean distance is explored, and no systematic evaluation of computational efficiency or sustainability was conducted. We extend harmonic loss by systematically investigating a broad spectrum of distance metrics as replacements for the Euclidean distance. We comprehensively evaluate distance-tailored harmonic losses on both vision backbones and large language models. Our analysis is framed around a three-way evaluation of model performance, interpretability, and sustainability. On vision tasks, cosine distances provide the most favorable trade-off, consistently improving accuracy while lowering carbon emissions, whereas Bray-Curtis and Mahalanobis further enhance interpretability at varying efficiency costs. On language models, cosine-based harmonic losses improve gradient and learning stability, strengthen representation structure, and reduce emissions relative to cross-entropy and Euclidean heads. Our code is available at: https://anonymous.4open.science/r/rethinking-harmonic-loss-5BAB/.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
SciMDR: Advancing Scientific Multimodal Document Reasoning
arXiv:2603.12249v2 Announce Type: replace-cross
Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, fait…
arXiv →
SciMDR: Advancing Scientific Multimodal Document Reasoning
arXiv:2603.12249v2 Announce Type: replace-cross Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, fait…
arXiv:2603.12249v2 Announce Type: replace-cross Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
▶
arXiv cs.AI
30.04.2026
Assessing the Utility of Volumetric Motion Fields for Radar-based Precipitation Nowcasting with Physics-informed Deep Learning
arXiv:2603.13589v2 Announce Type: replace-cross
Abstract: Estimating motion from spatiotemporal geoscientific data is a fundamental component of many environmental modeling and forecasting tasks. In …
arXiv →
Assessing the Utility of Volumetric Motion Fields for Radar-based Precipitation Nowcasting with Physics-informed Deep Learning
arXiv:2603.13589v2 Announce Type: replace-cross Abstract: Estimating motion from spatiotemporal geoscientific data is a fundamental component of many environmental modeling and forecasting tasks. In …
arXiv:2603.13589v2 Announce Type: replace-cross Abstract: Estimating motion from spatiotemporal geoscientific data is a fundamental component of many environmental modeling and forecasting tasks. In this work, we propose a physics-informed deep learning framework for estimating altitude-wise motion fields directly from volumetric radar reflectivity data. The model utilizes a fully differentiable semi-Lagrangian extrapolation operator to process three-dimensional inputs as independent horizontal slice sequences, enabling efficient inference of horizontal motion across multiple altitude levels. Using a multi-year radar dataset from Central Europe, we evaluate the impact of altitude-wise motion estimation on extrapolation-based precipitation forecasting and conduct a systematic dataset-scale analysis of inter-altitude motion consistency. The results show that the estimated motion fields exhibit strong vertical coherence, with high correlation across altitude levels, which results in limited improvement over traditional two-dimensional approach in this setting. The proposed framework provides a general tool for efficiently analyzing motion structure in volumetric geospatial data. The findings indicate that, in regions dominated by vertically coherent precipitation systems, the added complexity of volumetric motion modeling may offer limited benefit, warranting careful consideration in the design of efficient spatiotemporal advection models.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
arXiv:2603.14845v3 Announce Type: replace-cross
Abstract: Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challengin…
arXiv →
Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
arXiv:2603.14845v3 Announce Type: replace-cross Abstract: Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challengin…
arXiv:2603.14845v3 Announce Type: replace-cross Abstract: Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan-solar, a two-stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery to produce 24-hour irradiance forecasts at kilometer scale. Its decoupled two-stage design first forecasts day-night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine-scale cloud structures from satellite and large-scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan-solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud-induced transients. An operational deployment of Baguan-solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at https://github.com/DAMO-DI-ML/Baguan-solar.git.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
arXiv:2603.19470v2 Announce Type: replace-cross
Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and furth…
arXiv →
Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
arXiv:2603.19470v2 Announce Type: replace-cross Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and furth…
arXiv:2603.19470v2 Announce Type: replace-cross Abstract: Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows because of the techniques to enhance inference efficiency, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation (ALP), which injects small learnable perturbations into the input hidden states of each layer during updates and uses the resulting perturbed policy as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy and enlarges the policy family to cover inference-time mismatch noise. Hence, the flattened distribution can naturally tighten the gap between the updated and inference policies and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoids blow-up in the importance-ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Unilateral Relationship Revision Power in Human-AI Companion Interaction
arXiv:2603.23315v5 Announce Type: replace-cross
Abstract: When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal r…
arXiv →
Unilateral Relationship Revision Power in Human-AI Companion Interaction
arXiv:2603.23315v5 Announce Type: replace-cross Abstract: When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal r…
arXiv:2603.23315v5 Announce Type: replace-cross Abstract: When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal relationships extend to these interactions. So what, if anything, is morally significant about them? I argue that this debate has missed a prior structural question: who controls the relationship, and from where? Human-AI companion interaction is a triadic structure in which the provider exercises constitutive control over the AI. I identify three structural conditions of normatively robust dyads that the norms characteristic of personal relationships presuppose and show that AI companion interactions fail all three. This reveals what I call Unilateral Relationship Revision Power (URRP): the provider can rewrite how the AI interacts from a position where these revisions are not answerable within that interaction. I argue that URRP is pro tanto wrong in interactions designed to cultivate the norms of personal relationships, because the design produces expectations that the structure cannot sustain. URRP has three implications: i) normative hollowing, under which the interaction elicits commitment but no agent inside it bears the resulting obligations; ii) displaced vulnerability, under which the user's emotional exposure is governed by an agent not answerable to her within the interaction; and iii) structural irreconcilability, under which the interaction cultivates norms of reconciliation but no agent inside it can acknowledge or answer for the revision. I propose design principles that partially substitute for the internal constraints the triadic structure removes. A central and underexplored problem in relational AI ethics is therefore the structural arrangement of power over the human-AI interaction itself.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Woosh: A Sound Effects Foundation Model
arXiv:2604.01929v3 Announce Type: replace-cross
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines…
arXiv →
Woosh: A Sound Effects Foundation Model
arXiv:2604.01929v3 Announce Type: replace-cross Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines…
arXiv:2604.01929v3 Announce Type: replace-cross Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Generative models on phase space
arXiv:2604.02415v2 Announce Type: replace-cross
Abstract: Deep generative models such as diffusion and flow matching are powerful machine learning tools capable of learning and sampling from high-dim…
arXiv →
Generative models on phase space
arXiv:2604.02415v2 Announce Type: replace-cross Abstract: Deep generative models such as diffusion and flow matching are powerful machine learning tools capable of learning and sampling from high-dim…
arXiv:2604.02415v2 Announce Type: replace-cross Abstract: Deep generative models such as diffusion and flow matching are powerful machine learning tools capable of learning and sampling from high-dimensional distributions. They are particularly useful when the training data appears to be concentrated on a submanifold of the data embedding space. For high-energy physics data, consisting of collections of relativistic energy-momentum 4-vectors, this submanifold can enforce extremely strong physically-motivated priors, such as energy and momentum conservation. If these constraints are learned only approximately, rather than exactly, this can inhibit the interpretability and reliability of such generative models. To remedy this deficiency, we introduce generative models which are, by construction, confined at every step of their sampling trajectory to the manifold of massless N-particle Lorentz-invariant phase space in the center-of-momentum frame. In the case of diffusion models, the "pure noise" forward process endpoint corresponds to the uniform distribution on phase space, which provides a clear starting point from which to identify how correlations among the particles emerge during the reverse (de-noising) process. We demonstrate that our models are able to learn both few-particle and many-particle distributions with various singularity structures, paving the way for future interpretability studies using generative models trained on simulated jet data.
▶
arXiv cs.AI
30.04.2026
Why Attend to Everything? Focus is the Key
arXiv:2604.03260v2 Announce Type: replace-cross
Abstract: Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into …
arXiv →
Why Attend to Everything? Focus is the Key
arXiv:2604.03260v2 Announce Type: replace-cross Abstract: Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into …
arXiv:2604.03260v2 Announce Type: replace-cross Abstract: Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into pretrained models, they often degrade perplexity, downstream accuracy, or both. We introduce Focus, a method that learns which token pairs matter. Focus adds a small set of learnable centroids--as few as 148K parameters per layer--that act as gates: only token pairs belonging to the same centroid group attend to each other over long ranges. Focus is composable: it can be added to any pretrained model by training only the centroids while keeping all original weights frozen. Experiments show that composing Focus onto pretrained models yields zero degradation on downstream benchmarks across model sizes from 124M to 70B parameters and five attention architectures. Surprisingly, sparse Focus attention outperforms full attention at 124M scale (30.3 vs. 31.4 perplexity) and matches full attention when trained from scratch at 7B scale (13.82 vs. 13.89). Focus is also fast: top-k group membership gives a 2x speedup with better quality than the original pretrained model. Using our FlashAttention decomposition, Focus achieves an 8.6x speedup at 1M tokens without custom kernels.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DC-Ada: Reward-Only Decentralized Sensor Adaptation for Heterogeneous Multi-Robot Teams
arXiv:2604.03905v2 Announce Type: replace-cross
Abstract: Heterogeneity is a defining feature of deployed multi-robot teams: platforms often differ in sensing modalities, ranges, fields of view, and …
arXiv →
DC-Ada: Reward-Only Decentralized Sensor Adaptation for Heterogeneous Multi-Robot Teams
arXiv:2604.03905v2 Announce Type: replace-cross Abstract: Heterogeneity is a defining feature of deployed multi-robot teams: platforms often differ in sensing modalities, ranges, fields of view, and …
arXiv:2604.03905v2 Announce Type: replace-cross Abstract: Heterogeneity is a defining feature of deployed multi-robot teams: platforms often differ in sensing modalities, ranges, fields of view, and failure patterns. Controllers trained under nominal sensing can degrade sharply when deployed on robots with missing or mismatched sensors, even when the task and action interface are unchanged. We present DC-Ada, a reward-only decentralized adaptation method that keeps a pretrained shared policy frozen and instead adapts compact per-robot observation transforms to map heterogeneous sensing into a fixed inference interface. DC-Ada is gradient-free and communication-minimal: it uses budgeted accept/reject random search with short common-random-number rollouts under a strict step budget. We evaluate DC-Ada against four baselines in a deterministic 2D multi-robot simulator covering warehouse logistics, search and rescue, and collaborative mapping, across four heterogeneity regimes (H0--H3) and five seeds with a matched budget of $200{,}000$ joint environment steps per run. Results show that heterogeneity can substantially degrade a frozen shared policy and that no single mitigation dominates across all tasks and metrics. Observation normalization is strongest for reward robustness in warehouse logistics and competitive in search and rescue, while the frozen shared policy is strongest for reward in collaborative mapping. DC-Ada offers a useful complementary operating point: it improves completion most clearly in severe coverage-based mapping while requiring only scalar team returns and no policy fine-tuning or persistent communication. These results position DC-Ada as a practical deploy-time adaptation method for heterogeneous teams.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Retrieval-Augmented LLMs for Evidence Localization in Clinical Trial Recruitment from Longitudinal EHR Narratives
arXiv:2604.05190v2 Announce Type: replace-cross
Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures.…
arXiv →
Retrieval-Augmented LLMs for Evidence Localization in Clinical Trial Recruitment from Longitudinal EHR Narratives
arXiv:2604.05190v2 Announce Type: replace-cross Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures.…
arXiv:2604.05190v2 Announce Type: replace-cross Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the "Lost in the Middle" issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Self-Calibrating Framework for Analog Circuit Sizing Using LLM-Derived Analytical Equations
arXiv:2604.07387v2 Announce Type: replace-cross
Abstract: We present a design automation framework for analog circuit sizing that produces calibrated, topology-specific analytical equations from raw …
arXiv →
A Self-Calibrating Framework for Analog Circuit Sizing Using LLM-Derived Analytical Equations
arXiv:2604.07387v2 Announce Type: replace-cross Abstract: We present a design automation framework for analog circuit sizing that produces calibrated, topology-specific analytical equations from raw …
arXiv:2604.07387v2 Announce Type: replace-cross Abstract: We present a design automation framework for analog circuit sizing that produces calibrated, topology-specific analytical equations from raw circuit netlists. A large language model (LLM) derives a complete Python sizing function in which each device dimension is traceable to a specific design rationale - a form of interpretable output absent from existing optimization-based and LLM-based sizing methods. A deterministic calibration loop extracts process-dependent parameters from a single DC operating point simulation, while a prediction-error feedback mechanism compensates for analytical inaccuracies. We validate the framework on circuits ranging from 8 to 30 transistors - spanning two-stage Miller-compensated, current-mirror, folded cascode, nested Miller-compensated, and complementary class-AB output topologies - across three process nodes (40 nm, 90 nm, 180 nm). On matched-specification benchmarks, including the class-AB opamp case, the framework converges in 2-7 simulations. Despite large initial prediction errors, convergence depends on the measurement-feedback architecture, not prediction accuracy. The one-shot calibration automatically captures process-dependent variations, enabling cross-node portability without modification, retraining, or per-process characterization.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support
arXiv:2604.08618v2 Announce Type: replace-cross
Abstract: Deploying LLM-powered agents in enterprise scenarios such as cloud technical support demands high-quality, domain-specific skills. However, e…
arXiv →
SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support
arXiv:2604.08618v2 Announce Type: replace-cross Abstract: Deploying LLM-powered agents in enterprise scenarios such as cloud technical support demands high-quality, domain-specific skills. However, e…
arXiv:2604.08618v2 Announce Type: replace-cross Abstract: Deploying LLM-powered agents in enterprise scenarios such as cloud technical support demands high-quality, domain-specific skills. However, existing skill creators lack domain grounding, producing skills poorly aligned with real-world task requirements. Moreover, once deployed, there is no systematic mechanism to trace execution failures back to skill deficiencies and drive targeted refinements, leaving skill quality stagnant despite accumulating operational evidence. We introduce SkillForge, a self-evolving framework that closes an end-to-end creation-evaluation-refinement loop. To produce well-aligned initial skills, a Domain-Contextualized Skill Creator grounds skill synthesis in knowledge bases and historical support tickets. To enable continuous self-optimization, a three-stage pipeline -- Failure Analyzer, Skill Diagnostician, and Skill Optimizer -- automatically diagnoses execution failures in batch, pinpoints the underlying skill deficiencies, and rewrites the skill to eliminate them. This cycle runs iteratively, allowing skills to self-improve with every round of deployment feedback. Evaluated on five real-world cloud support scenarios spanning 1,883 tickets and 3,737 tasks, experiments show that: (1) the Domain-Contextualized Skill Creator produces substantially better initial skills than the generic skill creator, as measured by consistency with expert-authored reference responses from historical tickets; and (2) the self-evolution loop progressively improves skill quality from diverse starting points (including expert-authored, domain-created, and generic skills) across successive rounds, demonstrating that automated evolution can surpass manually curated expert knowledge.
▶
arXiv cs.AI
30.04.2026
Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach
arXiv:2604.12807v2 Announce Type: replace-cross
Abstract: Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging s…
arXiv →
Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach
arXiv:2604.12807v2 Announce Type: replace-cross Abstract: Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging s…
arXiv:2604.12807v2 Announce Type: replace-cross Abstract: Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models
arXiv:2604.13127v2 Announce Type: replace-cross
Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, r…
arXiv →
Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models
arXiv:2604.13127v2 Announce Type: replace-cross Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, r…
arXiv:2604.13127v2 Announce Type: replace-cross Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.
▶
arXiv cs.AI
30.04.2026
Diffusion Language Models for Speech Recognition
arXiv:2604.14001v2 Announce Type: replace-cross
Abstract: Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional …
arXiv →
Diffusion Language Models for Speech Recognition
arXiv:2604.14001v2 Announce Type: replace-cross Abstract: Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional …
arXiv:2604.14001v2 Announce Type: replace-cross Abstract: Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
arXiv:2604.14246v2 Announce Type: replace-cross
Abstract: Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when …
arXiv →
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
arXiv:2604.14246v2 Announce Type: replace-cross Abstract: Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when …
arXiv:2604.14246v2 Announce Type: replace-cross Abstract: Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when processing long-tail knowledge. We identify that this fragility stems from static Top-$k$ routing: routers tend to favor high-frequency patterns over rare factual associations. Consequently, ``specialist experts'' possessing critical long-tail knowledge are often assigned low gating scores and remain ``dormant'' -- under-prioritized for specific tokens despite their proven causal importance on other inputs. To address this, we propose Counterfactual Routing (CoR), a training-free inference framework designed to awaken these dormant experts. CoR integrates layer-wise perturbation analysis with the Counterfactual Expert Impact (CEI) metric to dynamically shift computational resources from syntax-dominant to knowledge-intensive layers while maintaining a constant total activation count, effectively retrieving causally decisive experts via virtual ablation. Extensive experiments on TruthfulQA, FACTOR, and TriviaQA demonstrate that CoR improves factual accuracy by 3.1\% on average without increasing the inference budget, establishing a superior Pareto frontier compared to static scaling strategies.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG
arXiv:2604.14572v2 Announce Type: replace-cross
Abstract: Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results:…
arXiv →
Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG
arXiv:2604.14572v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results:…
arXiv:2604.14572v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird's-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics. We further evaluate generalization on nine RAGBench subsets reformulated as retrieval-stress benchmarks: Corpus2Skill attains the highest macro-average F1 across the full 10-dataset suite and characterizes a clear regime -- single-domain, atomic-document corpora -- where corpus navigation is the right primitive, while flat retrieval remains preferable for open-domain or extractive pools.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
arXiv:2604.16552v2 Announce Type: replace-cross
Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to…
arXiv →
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
arXiv:2604.16552v2 Announce Type: replace-cross Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to…
arXiv:2604.16552v2 Announce Type: replace-cross Abstract: Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM's help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the scene space, conditioned on both the current seen text instructions and already synthesized 3D scene. It then uses a second step to generate the 3D latents in the smaller object space, which can be decoded into fine-grained object geometry and appearance. We curate a large dataset of 230K indoor scenes with paired text instructions for training. We evaluate 7B 3D-ARD+, on challenging scenes, and showcase the model can generate and place objects following non-trivial spatial layout and semantics prescribed by the text instructions.
▶
arXiv cs.AI
30.04.2026
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
arXiv:2604.18521v2 Announce Type: replace-cross
Abstract: Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of…
arXiv →
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
arXiv:2604.18521v2 Announce Type: replace-cross Abstract: Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of…
arXiv:2604.18521v2 Announce Type: replace-cross Abstract: Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.
▶
arXiv cs.AI
30.04.2026
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
arXiv:2604.18701v2 Announce Type: replace-cross
Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction er…
arXiv →
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
arXiv:2604.18701v2 Announce Type: replace-cross Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction er…
arXiv:2604.18701v2 Announce Type: replace-cross Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
arXiv:2604.21017v2 Announce Type: replace-cross
Abstract: Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhum…
arXiv →
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
arXiv:2604.21017v2 Announce Type: replace-cross Abstract: Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhum…
arXiv:2604.21017v2 Announce Type: replace-cross Abstract: Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
▶
arXiv cs.AI
30.04.2026
Causal Disentanglement for Full-Reference Image Quality Assessment
arXiv:2604.21654v2 Announce Type: replace-cross
Abstract: Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep…
arXiv →
Causal Disentanglement for Full-Reference Image Quality Assessment
arXiv:2604.21654v2 Announce Type: replace-cross Abstract: Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep…
arXiv:2604.21654v2 Announce Type: replace-cross Abstract: Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
arXiv:2604.22063v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in …
arXiv →
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
arXiv:2604.22063v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in …
arXiv:2604.22063v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.
▶
arXiv cs.AI
30.04.2026
Semantic Error Correction and Decoding for Short Block Codes
arXiv:2604.22269v2 Announce Type: replace-cross
Abstract: This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using mul…
arXiv →
Semantic Error Correction and Decoding for Short Block Codes
arXiv:2604.22269v2 Announce Type: replace-cross Abstract: This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using mul…
arXiv:2604.22269v2 Announce Type: replace-cross Abstract: This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using multiple short block codes. After ASCII encoding, the sentence is divided into segments, each independently encoded with a short block code and transmitted over an AWGN channel. At the receiver, segments are decoded in parallel, followed by a semantic error correction (SEC) model, which reconstructs corrupted segments using language model context. We further propose the semantic list decoding (SLD), which generates multiple candidate reconstructions and selects the best one via weighted Hamming distance, and a semantic confidence-guided HARQ (SHARQ) mechanism that replaces CRC-based error detection with a confidence score, enabling selective segment retransmission without CRC overhead. All modules are designed and trained using bidirectional and auto-regressive transformers (BART). Simulation results demonstrate that the proposed scheme significantly outperforms conventional capacity-approaching short codes and long codes at the same rate. Specifically, SEC provides approximately 0.4 dB BLER gain over plain short-code transmission, while SLD extends this to 0.8 dB. Compared to transmitting the entire sentence as a single long 5G LDPC codeword, our approach significantly improves semantic fidelity and reduces decoding latency by up to 90\%. SHARQ further provides an additional 1.5 dB gain over conventional HARQ.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
arXiv:2604.24810v2 Announce Type: replace-cross
Abstract: Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a sign…
arXiv →
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
arXiv:2604.24810v2 Announce Type: replace-cross Abstract: Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a sign…
arXiv:2604.24810v2 Announce Type: replace-cross Abstract: Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
ViPO: Visual Preference Optimization at Scale
arXiv:2604.24953v2 Announce Type: replace-cross
Abstract: While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexp…
arXiv →
ViPO: Visual Preference Optimization at Scale
arXiv:2604.24953v2 Announce Type: replace-cross Abstract: While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexp…
arXiv:2604.24953v2 Announce Type: replace-cross Abstract: While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver
arXiv:2604.25067v2 Announce Type: replace-cross
Abstract: Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing bench…
arXiv →
Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver
arXiv:2604.25067v2 Announce Type: replace-cross Abstract: Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing bench…
arXiv:2604.25067v2 Announce Type: replace-cross Abstract: Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4's time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
Towards Unified Multi-task EEG Analysis with Low-Rank Adaptation
arXiv:2604.25131v2 Announce Type: replace-cross
Abstract: Recent self-supervised pre-training methods for electroencephalogram (EEG) have shown promising results. However, the pre-trained models typi…
arXiv →
Towards Unified Multi-task EEG Analysis with Low-Rank Adaptation
arXiv:2604.25131v2 Announce Type: replace-cross Abstract: Recent self-supervised pre-training methods for electroencephalogram (EEG) have shown promising results. However, the pre-trained models typi…
arXiv:2604.25131v2 Announce Type: replace-cross Abstract: Recent self-supervised pre-training methods for electroencephalogram (EEG) have shown promising results. However, the pre-trained models typically require full fine-tuning on each downstream task individually to achieve good performance. In practical applications involving multiple tasks, utilizing a separate model for each task is not ideal regarding computational and spatial cost. In this study, we go one step further and explore the simultaneous adaptation of a pre-trained model to multiple different tasks. The EEG signals exhibit significant heterogeneity due to their collection from various subjects using diverse devices and experimental setups, resulting in potential conflicts among different tasks that impede joint optimization. To tackle this challenge, we propose MTEEG, a multi-task EEG analysis framework which incorporates task-specific low-rank adaptation (LoRA) modules to disentangle the parameter space and alleviate task conflicts. To investigate the trade-off between task specification and interaction, we propose three variants of MTEEG that integrate the LoRA modules in different ways and evaluate them on six downstream tasks, demonstrating that MTEEG can surpass state-of-the-art single-task methods on the majority of metrics. MTEEG shows the potential of multi-task EEG analysis and promotes the development of general-purpose brain-computer interfaces in the future.
▶
arXiv cs.AI
30.04.2026
The Role of Symmetry in Optimizing Overparameterized Networks
arXiv:2604.25150v2 Announce Type: replace-cross
Abstract: Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely und…
arXiv →
The Role of Symmetry in Optimizing Overparameterized Networks
arXiv:2604.25150v2 Announce Type: replace-cross Abstract: Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely und…
arXiv:2604.25150v2 Announce Type: replace-cross Abstract: Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely understood. We analyze weight-space symmetries in neural networks and show that overparameterization introduces additional symmetries that benefit optimization in two distinct ways. First, we prove that these symmetries act as a form of diagonal preconditioning on the Hessian, enabling the existence of better-conditioned minima within each equivalence class of functionally identical solutions. Second, we show that overparameterization increases the probability mass of global minima near typical initializations, making these favorable solutions more reachable. Teacher-student network experiments validate our theoretical predictions: as width increases, the Hessian trace decreases, condition numbers improve, and convergence accelerates. Our analysis provides a unified framework for understanding overparameterization and width growth as a geometric transformation of the loss landscape.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale
arXiv:2604.25209v2 Announce Type: replace-cross
Abstract: Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood…
arXiv →
DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale
arXiv:2604.25209v2 Announce Type: replace-cross Abstract: Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood…
arXiv:2604.25209v2 Announce Type: replace-cross Abstract: Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.
▶
⭐ Highlight
arXiv cs.AI
30.04.2026
AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
arXiv:2604.25326v2 Announce Type: replace-cross
Abstract: Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language mode…
arXiv →
AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
arXiv:2604.25326v2 Announce Type: replace-cross Abstract: Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language mode…
arXiv:2604.25326v2 Announce Type: replace-cross Abstract: Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves parallel drafting on the PIM and verification on a single NPU through task-level DLM-TLM decoupling and specifically, it incorporates Entropy-History-Aware Drafting Control and Time-Aware Pre-Verification Control to dynamically manage adaptive drafting algorithm execution and pre-verification timing, suppressing invalid drafting based on low-confidence drafts. Additionally, AHASD integrates Attention Algorithm Units and Gated Task Scheduling Units within LPDDR5-PIM to enable attention link localization and sub-microsecond task switching on the PIM side. Experimental results for different LLMs and adaptive drafting algorithms show that AHASD achieves up to 4.2$\times$ in throughput and 5.6$\times$ in energy efficiency improvements over a GPU-only baseline, and 1.5$\times$ in throughput and 1.24$\times$ in energy efficiency gains over the state-of-the-art GPU+PIM baseline, with hardware overhead below 3\% of the DRAM area.
▶
arXiv cs.AI
30.04.2026
The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation
arXiv:2604.25530v2 Announce Type: replace-cross
Abstract: Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typicall…
arXiv →
The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation
arXiv:2604.25530v2 Announce Type: replace-cross Abstract: Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typicall…
arXiv:2604.25530v2 Announce Type: replace-cross Abstract: Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, canonical logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99% of the teacher's mIoU on Cityscapes (79.0 vs 79.8) and 92% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
arXiv:2604.26568v1 Announce Type: new
Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable …
arXiv →
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
arXiv:2604.26568v1 Announce Type: new Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable …
arXiv:2604.26568v1 Announce Type: new Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.
▶
⭐ Highlight
arXiv cs.CL
30.04.2026
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
arXiv:2604.26506v1 Announce Type: new
Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instru…
arXiv →
SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instru…
arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.