Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

Multimodal models can name the action units (AUs) behind a facial emotion, but their AU->emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU->emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU->emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit's emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

02.
arXiv (math.PR) 2026-06-15

Boltzmann-Like Occupation of Nonequilibrium Steady States on Dense Networks

arXiv:2606.14542v1 Announce Type: cross Abstract: A central problem in statistical physics is to extend the Boltzmann distribution to nonequilibrium steady states (NESS). We prove that NESS on large dense networks have Boltzmann-like occupation despite extensive entropy production. We further show that the active-matter heuristic of "low rattling" is asymptotically exact. Intuitively, these NESS spend a greater fraction of their time in states they leave more slowly. This explanation extends to the broader class of "equiaccessible" steady states, which play a role in our analysis akin to that of equilibrium in linear response.

03.
Nature (Science) 2026-06-17

Probing picometre-scale interlayer deformations via hyperbolic polaritons

作者:

The resilience of van der Waals (vdW) materials to large strain fields makes them an ideal platform for tuning electronic, optical and magnetic properties1–4. Although in-plane strain is readily mapped, non-invasive and quantitative characterization of out-of-plane strain remains a formidable challenge, particularly for picometre-scale deformations buried at interfaces. Here we demonstrate a polaritonic optical method that uses the mid-infrared out-of-plane hyperbolic polaritons (oHPs) mode to detect interlayer deformations in prototypical vdW polar insulator–hexagonal boron nitride (hBN). This method uses the softening mechanism of out-of-plane transverse optical (oTO) phonons induced by interlayer strain, enabling highly sensitive detection of picometre-scale deformations. Although these oTO phonon modes are typically spectroscopically ‘dark’, their strain response is activated through the oHPs, achieving an atomic displacement sensitivity of about 10 pm (about 8 × 10−7 times the probing wavelength), enabling ultradeep-subwavelength mechanical interlayer deformation detection. This is experimentally validated in both planar hBN and at the buried interface of quantum dot–hBN nanotube heterostructures. This polariton-based picometrology bridges nanomechanics and photonics, providing a non-destructive lens to visualize hidden stress landscapes with atomic precision. A new polaritonic optical method that uses the mid-infrared out-of-plane hyperbolic polaritons mode is described and experimentally validated to allow the examination of picometre-scale interlayer deformations, providing a bridge between nanomechanics and photonics.

04.
medRxiv (Medicine) 2026-06-16

Development and reliability and validity test of the Questionnaire on Knowledge, Attitude and Practice of ICU Nurses on Blood Oxygen Saturation Management in Mechanically Ventilated Patients

Objective: A questionnaire on the knowledge, attitude and practice of ICU nurses regarding the management of blood oxygen saturation in patients with mechanical ventilation was compiled, and its reliability and validity were tested. Method: Drawing upon the knowledge-attitude-practice theory, the initial questionnaire draft was developed through literature review and consultation with Delphi experts. Employing convenience sampling, 32 nurses from the General ICU of Wuxi Second People's Hospital were surveyed between 1 August 2025 and 27 September 2025, enabling item screening and assessment of reliability and validity.The full version of the developed questionnaire is provided as Supporting Information (S1 File). All items are published under a CC BY 4.0 license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Result: A questionnaire on the knowledge, attitude and practice of ICU nurses regarding the management of blood oxygen saturation in mechanically ventilated patients was finalised, comprising 26 items: 11 in the knowledge dimension, 6 in the attitude dimension and 9 in the behaviour dimension. The overall Cronbach's coefficient for the questionnaire was 0.88, with dimension-specific coefficients of 0.787, 0.722, and 0.781 respectively. The Spearman-Brown coefficient for the entire questionnaire was 0.967, while dimension-specific coefficients were 0.796, 0.666, and 0.728 respectively. The content validity index at the questionnaire level (S-CVI) was 0.886, and the item-level content validity index (I-CVI) ranged from 0.913 to 0.967. 0.728. The questionnaire's level content validity index (S-CVI) was 0.886, and the item level content validity index (I-CVI) ranged from 0.913 to 1.00. Conclusion: The questionnaire on knowledge, attitude and practice of blood oxygen saturation management in mechanically ventilated patients demonstrates good reliability and validity. It may serve as an assessment tool for intensive care unit nurses regarding their knowledge, attitude, and practices concerning blood oxygen saturation management in mechanically ventilated patients, thereby establishing a foundation for developing targeted intervention strategies in future practice.

05.
arXiv (CS.CL) 2026-06-16

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

06.
arXiv (CS.CV) 2026-06-11

Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender, and race biases. To this end, we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Our analysis demonstrates that visual confounders, particularly head pose and face resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, performance disparities across gender and race vanish. However, we identify a statistically significant age-related bias, with higher localization errors for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

07.
arXiv (quant-ph) 2026-06-15

Certifying Macroscopic Quantum Mechanics via Hypothesis Testing with Finite Data

arXiv:2506.22092v2 Announce Type: replace Abstract: We address the challenge of certifying quantum behavior with single macroscopic massive particles, subject to decoherence and finite data. We propose a hypothesis testing framework that distinguishes between classical and quantum mechanics based on position measurements. While interference pattern visibility in single-particle quantum superposition experiments has been commonly used as a sufficient criterion to falsify classical mechanics, we show that, from a hypothesis testing perspective, it is neither necessary nor efficient. Focusing on recent proposals to prepare macroscopic superposition states of levitated nanoparticles, we show that the likelihood ratio test – which leverages differences across the entire probability distribution – provides an exponential reduction in measurements needed to reach a given confidence level. These results generalize to a broad class of quantum states, and offer a principled, efficient method to falsify classical mechanics in interference experiments, relaxing the experimental constraints faced by current efforts to test quantum mechanics at the macroscopic scale.

08.
arXiv (CS.CL) 2026-06-16

Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($\sigma = E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $\kappa = 0.88$) while false-presupposition scoring is judge-sensitive ($\kappa = 0.36$) – a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

09.
arXiv (quant-ph) 2026-06-12

Interference of critical dynamics associated with zero modes

arXiv:2606.13200v1 Announce Type: new Abstract: We study the interference of critical dynamics associated with zero modes (ICDZM) in the generalized Creutz ladders using closed quench paths that pass through two critical points successively. By reading out the final zero-mode transfer probability, we find rich ICDZM interference patterns dependent on the quench path. In particular, when the closed path links two topologically nontrivial phases, the ICDZM pattern may either vanish or exhibit period doubling. Within the framework of WKB analysis, this phenomenon is well clarified by the interference phase accumulated in the quench procedure. We also demonstrate that the zero-mode transfer probability can be detected by the deviation of the boundary particle number from its initial fractional value, which arises from the blending of bulk modes in the critical dynamics. As an edge defect, the zero-mode transfer probability captures both the ICDZM oscillation and the known anomalous defect production in a non-closed quench path. These results identify ICDZM and the corresponding edge defect as probes for critical dynamics associated with topological zero modes.

11.
arXiv (CS.LG) 2026-06-12

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

作者:

arXiv:2606.07334v2 Announce Type: replace-cross Abstract: This report treats chord-symbol sequences as an interpretable, controllable time series for genre-local harmonic modeling. The frozen Music Transformer base - released as a pop-jazz fine-tune endpoint but verified in this revision weight-identical to the pop-only Phase-0 baseline, so all gains are measured over a pure-pop prior (see Changes in v2) - is extended to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction (macro gains +2.89 to +3.61 percentage points); LoRA and IA3 score highest, but pairwise Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: at a common corpus size IA3 stays on top while LoRA drops to last, so the small method gaps are partly data-driven rather than representational. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting the adaptation effect is largely lightweight conditioning over a reusable harmonic base rather than genre-specific adapter memory. Further diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation that v2 reinterprets as a same-weights control, chord-only genre classification, output-distribution statistics, real-song evaluation, duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. Perceived genre authenticity and musical quality are left to controlled listener evaluation.

12.
arXiv (CS.CL) 2026-06-16

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.

13.
arXiv (CS.CL) 2026-06-12

A Unifying Lens on Reward Uncertainty in RLHF

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pm\beta\log\mathbb{E}_p[e^{\pm r/\beta}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

14.
arXiv (CS.LG) 2026-06-18

Provable quantum speedups for computing persistence in topological data analysis

arXiv:2410.21258v2 Announce Type: replace-cross Abstract: Topological data analysis (TDA) aims to extract noise-robust features from a data set by examining the number and persistence of holes in its topology. We provide an efficient quantum algorithm for a computational problem closely related to a core task in TDA – determining whether a given hole persists across different length scales. Further, we prove the problem itself is $\mathsf{BQP}_1$-hard, implying that a classical solution is extremely unlikely; this stands in contrast to all previous quantum approaches to TDA, where the problems were also intractable for quantum computers, or where a rigorous proof of classical hardness still remains open. This result implies an {exponential} quantum speedup for this problem under standard complexity-theoretic assumptions. Our approach relies on encoding the persistence of a hole in a variant of the guided sparse Hamiltonian problem, where the guiding state is constructed from a harmonic representative of the hole.

15.
arXiv (CS.CV) 2026-06-15

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

16.
Nature (Science) 2026-06-10

Light slows down carbon nanotubes in water

Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’. Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’.

17.
arXiv (math.PR) 2026-06-16

Delayed acceptance sampling with Hamiltonian proposal subchains for random field materials inference

arXiv:2606.14743v1 Announce Type: cross Abstract: This paper focuses on accelerating Markov chain Monte Carlo sampling in Bayesian inverse problems in which forward model evaluations dominate the computational cost. It builds on several established ingredients previously used in related scenarios: delayed acceptance, neural network surrogate models, Hamiltonian proposals, and proposal subchains. The main framework is the delayed-acceptance Metropolis-Hastings algorithm of Christen and Fox (2005). The first-stage proposal distribution is constructed from a subchain of Hamiltonian trajectories targeting the surrogate posterior. For each fixed surrogate model, the Hamiltonian subchain and delayed-acceptance correction define a kernel invariant with respect to the exact posterior. In the present work, the surrogate is updated only during a burn-in phase, after which the production run uses a fixed surrogate model. The sampling framework is implemented in Python using parallel processes. Several chains are generated in parallel and share a single surrogate model trained during burn-in on all collected data. The forward model is treated as a black box; therefore, the application area is broad. However, the main motivation is efficient solution of geotechnical inverse problems with material properties represented by Gaussian random fields. In this study, the sampling framework is applied to a geotechnical inverse problem in which hydraulic conductivity and porosity are modeled as non-stationary Gaussian random fields approximated using truncated Karhunen-Loeve expansions. Based on a precomputation, the truncation dimensions are chosen separately for hydraulic conductivity and porosity. The forward model outputs are pore pressure values at control points and selected observation times. These are compared with in situ pore pressure measurements collected over one year during the Tunnel Sealing Experiment in an underground laboratory in Canada.

18.
arXiv (CS.AI) 2026-06-12

Agents-K1: Towards Agent-native Knowledge Orchestration

arXiv:2606.13669v1 Announce Type: new Abstract: Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce Agents-K1, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce Scholar-KG, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

19.
medRxiv (Medicine) 2026-06-17

Targeted Proteomic Profiling of Nasal Fluid from the Brain-Nose Interface

The brain-nose interface is an anatomical junction where olfactory neurons from the olfactory bulb traverse the cribriform plate into the nasal mucosa, providing minimally invasive access to the central nervous system (CNS). We hypothesized that nasal fluid from this region could enable detection of neurology-relevant proteins using targeted multiplex assays. Using nosecollect, a targeted nasal sampling device, nasal fluid proximal to brain-nose interface was collected from cognitively impaired patients, alongside matched cerebrospinal fluid (CSF) and plasma. After nasal sample-specific dilution optimization and intra-assay precision evaluation, all matrices were profiled with the Olink Target 96 Neurology and NUcleic acid Linked Immuno-Sandwich Assay CNS disease 120 (NULISAseq CNS Disease 120) panels. Nasal fluid showed technically repeatable detection (intra-assay coefficient of variation

20.
arXiv (CS.AI) 2026-06-15

Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

arXiv:2606.14507v1 Announce Type: new Abstract: Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

21.
arXiv (CS.AI) 2026-06-16

CoAgent: Concurrency Control for Multi-Agent Systems

arXiv:2606.15376v1 Announce Type: cross Abstract: Multi-agent LLM systems – coding agents, devops agents, document agents – now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort-and-retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre-Order), fixes a serialization order at launch, serves each read the order-filtered value, and applies writes speculatively in place; a one-way notification asks an affected reader to re-judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga-style inverse each tool registers in advance. At quiescence the run is serializable in the pre-decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint-declared, undoable tools online. On ten contended workloads, CoAgent stays within 5\% of serial correctness at a $1.4\times$ speedup and near-serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash-only target system, it grows a 25-tool library online and lifts the task pass rate from 45/71 to 63/71 at $0.80\times$ the time and $0.86\times$ the cost.

22.
arXiv (CS.CL) 2026-06-11

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

23.
arXiv (CS.CV) 2026-06-19

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token ; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

24.
arXiv (CS.LG) 2026-06-18

Complementary Attention Head Pruning for Efficient Transformers

arXiv:2606.19150v1 Announce Type: new Abstract: The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the "proximity bias" of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model's intermediate layers.

25.
arXiv (CS.CL) 2026-06-19

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.