×

Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

Authors: Qu ×
Shuffle
01.
arXiv (CS.CV) 2026-06-18

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

02.
arXiv (CS.CL) 2026-06-16

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

03.
arXiv (CS.AI) 2026-06-12

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

Authors:

arXiv:2606.12835v1 Announce Type: cross Abstract: The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

04.
arXiv (CS.CL) 2026-06-11

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression methods.We have released our code at https://github.com/bebr2/SKIM .

05.
arXiv (CS.CL) 2026-06-19

Where Does Social Reasoning Come From? Capability Provenance in Language Models

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

06.
arXiv (quant-ph) 2026-06-24

The most discriminable quantum states in the multicopy regime

arXiv:2604.26927v2 Announce Type: replace Abstract: This work investigates which sets of quantum states give rise to the highest achievable success probability in minimum-error state discrimination if multiple copies of the unknown state are given. Specifically, we consider uniformly distributed ensembles of the form $\left\{\frac{1}{N},\rho_i^{\otimes k}\right\}_{i=1}^N$, where $N$ states in dimension $d$ are provided in $k$ identical copies, and derive universal limits in this scenario. For pure state ensembles, we prove that whenever $N$ is large enough to support a state $k$-design, these designs will exactly give rise to the maximally discriminable sets. We further show that when $N$ exceeds the size required for a $k$-design, mixed states can outperform all pure state ensembles. We then recognise that the problem of most discriminable classical states in the multi-copy regime is in one-to-one correspondence to the concept of the multiplicative Bayes capacity of independent uses of classical channels, a concept that emerges naturally in the context of classical information leakage. This connection allows us to completely solve the classical analogue of our problem when $N\geq \binom{d + k - 1}{k}$, and to prove that quantum systems offer a quadratic advantage (in number of copies $k$) over classical ones. Then, we prove that this classical over quantum advantage is strongly reduced when one is restricted to real quantum states, more precisely, when $N \geq k + 1$, pure real qubits only offer a constant advantage over classical bits. Finally, we introduce computational techniques to find sets of most discriminable ensembles and to obtain rigorous universal upper bounds on the maximal success probability for multi-copy state discrimination in cases that are analytically intractable.

07.
arXiv (CS.AI) 2026-06-16

LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

arXiv:2606.16802v1 Announce Type: new Abstract: Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

08.
arXiv (quant-ph) 2026-06-15

Physics-Informed Variational Quantum Classifier for Phase Detection in Strongly Correlated Matter

arXiv:2606.14489v1 Announce Type: new Abstract: The characterisation of quantum phases in strongly correlated systems is a crucial milestone for the deployment of quantum sensors. In this work, we present a Physics-Informed Variational Quantum Classifier (VQC) designed to detect the topological phase transition between the Fermi polaron quasiparticle and the molecular bound state. Unlike conventional Machine Learning approaches, our quantum architecture is constructed via the Trotterised time-evolution of an effective Hamiltonian, ensuring that the learnable parameters correspond to interpretable physical quantities. We show that the VQC efficiently discovers the optimal interferometric protocol, specifically the evolution time and effective bath interactions required to maximise the visibility of Ramsey fringes, thereby clearly distinguishing the Bose-Einstein Condensate (BEC) and Bardeen-Cooper-Schrieffer (BCS) regimes. Furthermore, we report the validation of this classifier on the QRed superconducting quantum processor (BSC-CNS). Despite the intrinsic hardware noise and decoherence, the VQC preserves the relative ordering of the topological phases. We demonstrate that the physics-informed architecture achieves a linear gate complexity $\mathcal{O}(N)$, bypassing the exponential memory wall of classical simulation and ensuring scalability to many-body regimes.

09.
medRxiv (Medicine) 2026-06-18

Antimicrobial-resistant E. coli in human, animal and environmental reservoirs in rural Bangladeshi households with young children

In low-income countries, ESBL-producing Escherichia coli (ESBL-EC) is frequently detected in humans, animals and household environments, indicating widespread exposure to antimicrobial resistance (AMR). Established risk factors such as antibiotic use do not explain the high community carriage of AMR in all settings; identifying the dominant exposure pathways can inform interventions against AMR. We aimed to investigate (i) animal-human-environment sharing of AMR by assessing associations between the abundance of ESBL-EC in the household environment, domestic animal feces and young children's stool and (ii) household factors associated with ESBL-EC abundance in these reservoirs. We enrolled 112 households from the CRADLE trial in rural Bangladesh. We enumerated ESBL-EC in drinking water, food, child hand rinses, outdoor soil, indoor floor swabs, chicken and cow feces, and stool from children aged 6 months. We recorded indicators of sanitation, animal ownership/management, human and animal antibiotic use, and child exposure behaviors using structured questionnaires and spot checks. The highest prevalence of ESBL-EC was in child stool (95.6%) and animal feces (82.3-96.9%), followed by soil (48.2%) and floors (36.6%); < 10% of food, child hands and drinking water harbored ESBL-EC. The abundance of ESBL-EC in child stool was not associated with its abundance in any sampled matrix; the abundance in chicken but not cow feces showed positive correlations with soil, floors, child hands, and drinking water (correlation coefficients: 0.19-0.39, p-values < 0.05). Higher-quality latrines (improved, pour-flush, with slab) were associated with lower ESBL-EC abundance across matrices; unsafe animal management (animals roaming or spending the night inside the home) was associated with higher abundance. Child antibiotic use and exposure behaviors (soil ingestion, time spent on floor) were not associated with ESBL-EC abundance in child stool. We observed high AMR colonization among young children and domestic animals in rural Bangladesh not explained by traditional fecal-oral exposure pathways. Future studies should explore additional pathways and assess whether sanitation and animal management improvements can reduce AMR.

10.
arXiv (CS.CV) 2026-06-25

Efficient Real-World Dehazing via Physics-Inspired Global-Local Decoupling

Real-world single image dehazing is highly ill-posed due to spatially and spectrally varying scattering, while practical deployment demands lightweight and low-latency models. Existing approaches either rely on fragile physical inversion under simplified assumptions or adopt heavy blind architectures unsuitable for edge deployment. To overcome these limitations, we propose PGL-Net (Physics-Inspired Global-Local Decoupling Network), a lightweight framework that incorporates physical inductive biases via operator-level emulation, avoiding explicit parameter estimation. It decouples dehazing into global distribution rectification and local structural refinement. A Physics-Inspired Affine Fusion (PAF) module performs globally conditioned alignment across hierarchical skip connections to compensate for haze-induced bias, while a compact Degradation-Aware Modulation (DAM) block adaptively restores spatially and spectrally variant details through dynamic feature modulation. Extensive experiments on multiple real-world benchmarks demonstrate that PGL-Net achieves state-of-the-art restoration quality with significantly reduced complexity. Compared with the recent SOTA SGDN, the Tiny variant (PGL-Net-T) improves PSNR by up to 2.6dB and consistently enhances downstream object detection accuracy, while achieving over a 10x reduction in inference latency. Code is publicly available at: https://github.com/sc-30-bit/PGL-Net.

11.
arXiv (CS.AI) 2026-06-16

Agentomics: Economic Foundations for the Valuation, Attribution, and Pricing of AI Agents in Human-AI Workflows

Authors:

arXiv:2606.14769v1 Announce Type: cross Abstract: Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces Agentomics, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human–AI workflows.

12.
arXiv (CS.CL) 2026-06-17

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

13.
arXiv (CS.CV) 2026-06-16

BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data

Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing sparse, efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, transformer, AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.

14.
arXiv (quant-ph) 2026-06-11

Controlled ion-ion interactions and cavity-enhanced emission of a coherent dinuclear Eu$^{3+}$ complex

arXiv:2606.11947v1 Announce Type: new Abstract: Molecular rare-earth-ion complexes offer unique opportunities for quantum technologies by combining the intrinsic coherence properties of rare-earth ions with chemically tunable molecular environments. A crucial capability is the realization of multi-qubit architectures with defined qubit couplings to enable two-qubit quantum gates. Here, we investigate the optical coherence properties and excitation-induced interactions of two Eu$^{3+}$-based molecular complexes, comparing a mononuclear reference system with a dinuclear analogue in which two Eu$^{3+}$ ions are positioned at a well-defined intramolecular distance of about 7 Angstrom. Using cryogenic ensemble spectroscopy, including spectral hole burning, free-induction decay, and photon echo measurements at temperatures down to 100 mK, we demonstrate long optical coherence times $T_{2,o}$ of up to 9 $\mu$s. As a key step toward scalable multi-qubit architectures, a control-target sequence was implemented to probe conditional ion-ion interactions, revealing a stronger interaction-induced dephasing in the dinuclear complex. Finally, we show the integration of the dinuclear complex into a fiber-based optical microcavity, and observe an 380-fold emission enhancement of the $\mathrm{}^5\mathrm{D}_0\rightarrow\mathrm{}^7\mathrm{F}_0$ transition. Together, these results position molecular rare-earth complexes as versatile and chemically tunable building blocks for scalable quantum technologies.

15.
arXiv (CS.AI) 2026-06-24

AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models

Authors:

arXiv:2606.24616v1 Announce Type: new Abstract: Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.

16.
arXiv (quant-ph) 2026-06-16

Neural network inverse design of nanophotonic scintillators

arXiv:2606.16309v1 Announce Type: cross Abstract: Scintillators are materials converting high-energy radiation into optical light, essential in a range of technologies such as medical imaging systems and security scanners. Scintillator development and optimization have remained limited by the complexity of their underlying physics, involving stochastic cascades of electron-electron, electron-phonon, and electron-photon interactions. Such processes are typically modeled by non-differentiable Monte Carlo simulations, limiting the applicability of machine learning for scintillator development. Here we present a physics-informed neural network that learns the scintillation cascade process from the incident high-energy particle to photon emission, substantially accelerating scintillator design and optimization. Combining this neural network with photonic simulations enables end-to-end differentiable optimization of the scintillator geometry. This allows us to optimize for arbitrary figures of merit, such as specific target emission patterns.. We demonstrate the concept and characterize it relative to previous approaches by inverse design of nanophotonic scintillators for X-ray imaging.

17.
arXiv (CS.AI) 2026-06-16

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

arXiv:2512.22420v5 Announce Type: replace-cross Abstract: Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves up to 14.76% higher throughput than standard speculative decoding and up to 20.18% lower latency in the main benchmark suite under dynamic request arrival rates for real-time LLM serving scenarios.

18.
arXiv (CS.CV) 2026-06-11

Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.

19.
arXiv (CS.LG) 2026-06-16

Inference-Time Decision Calibration for Temporal Classification

arXiv:2606.16034v1 Announce Type: new Abstract: Temporal classification errors are often treated as representation failures, but they can also arise from how available evidence is converted into decisions. This paper proposes a representation–calibration decomposition for temporal classification. We keep a trained native classifier frozen and separate two inference-time interventions: a conservative residual multi-scale branch that adds auxiliary logits to the native prediction, and a post-hoc branch-aware calibrator that recombines native and residual evidence at decision time. This design distinguishes missing temporal evidence from underused decision-level evidence without retraining the backbone. Across FI-2010, PTB-XL, UCI-HAR, MHEALTH, and HARTH, we find that gains are strongly regime-dependent. Residual multi-scale evidence is most useful in noisy or representation-limited settings, especially short-horizon FI-2010 and weaker recurrent backbones, while branch-aware calibration helps when native and auxiliary logits contain complementary evidence not fully exploited by the raw decision rule. Near-saturated settings show limited gains from either intervention. These results suggest that temporal classification should be understood not only as representation learning, but also as the problem of trusting, combining, and calibrating evidence from multiple views.

20.
arXiv (CS.AI) 2026-06-11

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

arXiv:2606.11830v1 Announce Type: new Abstract: Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

21.
arXiv (CS.LG) 2026-06-12

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

arXiv:2606.12552v1 Announce Type: new Abstract: Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

22.
arXiv (CS.AI) 2026-06-19

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

arXiv:2606.20246v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

23.
arXiv (CS.LG) 2026-06-24

Efficient reduction of stellar contamination and noise in planetary transmission spectra using neural networks

arXiv:2602.10330v3 Announce Type: replace-cross Abstract: Context: The characterization of exoplanetary atmospheres has been transformed by the James Webb Space Telescope (JWST), whose infrared sensitivity enables transmission spectroscopy at unprecedented precision. However, stellar heterogeneities (e.g., spots and faculae) remain a dominant source of contamination that can bias atmospheric retrievals if not properly corrected. Aims: We present a methodology for reducing stellar contamination and instrument-specific noise from exoplanet transmission spectra using neural networks, in particular the so-called Denoising AutoEncoders (DAE). Our goals are to enable fast, accurate corrections that improve the reliability of atmospheric parameter retrievals and to promote the use of unsupervised algorithms for efficient data processing. Methods: We designed and trained DAE architectures using large synthetic datasets of terrestrial (TRAPPIST-1e analogues) and sub-Neptune (K2-18b analogues) planets. Atmospheric retrieval experiments were then performed on contaminated spectra in order to compare our deep-learning approach against standard correction methods in terms of accuracy and computational cost. Results: Our autoencoders successfully reconstruct uncontaminated spectra, preserving essential molecular features even in low-S/N regimes. In retrieval tests, the denoising autoencoder pre-processing reduces bias in retrieved abundance parameters compared to uncorrected observations. Notably, our method matches the accuracy of simultaneous stellar-contamination fitting while maintaining a much lower computational cost, typically one order of magnitude smaller. Conclusions: These results demonstrate that DAEs outperform conventional correction methods in computational efficiency while maintaining high accuracy, paving the way for their integration into future atmospheric characterization pipelines for both rocky and giant exoplanets.

24.
arXiv (CS.CV) 2026-06-25

ESTANet: Efficient Online Error Detection in Procedural Videos via Prediction Inconsistency

An efficient and accurate system for detecting errors in procedural tasks is crucial for supporting human needs in daily life, as it can provide instant notifications and guide people to correct mistakes. In this work, we study real-time online error detection in procedural videos from a simple but overlooked perspective: the prediction behavior of action detectors themselves. Instead of designing complex architectures or specialized supervision, we observe that action detectors naturally exhibit different prediction characteristics depending on their sensitivity to input dynamics and temporal context. We therefore propose ESTANet (Error-Sensitive and Temporally-vArying Network), a lightweight framework that detects errors by exploiting inconsistencies among action predictions produced by a small set of action detectors. We construct standard and error-sensitive action detectors that behave similarly on correct executions but respond differently when errors occur. Meanwhile, detectors operating with different temporal contexts further amplify prediction inconsistencies when the procedure deviates from the intended sequence. During inference, we detect errors by aggregating mismatches between standard and error-sensitive predictions through majority voting to flag frames that contain errors. Extensive experiments on EgoPER, Assembly-101-O, and EPIC-Tent-O demonstrate that ESTANet achieves state-of-the-art performance in online error detection while maintaining real-time efficiency with a lightweight architecture. Our results highlight that leveraging the intrinsic properties of action detectors can yield a powerful and practical solution for online error detection without increasing architectural design complexity.

25.
arXiv (CS.CL) 2026-06-17

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.