Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-16

Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions

Authors:

arXiv:2606.16415v1 Announce Type: new Abstract: Enterprise behavioral simulation requires more than producing a plausible response. Many decisions depend on the shape of a population under a proposed action: which segments accept, defect, hesitate, or move into risk-sensitive states. This paper introduces Posterior Twins, a memory-grounded digital-twin approach that represents likely behavior as an updated distribution under a specific decision context. We evaluate a family of Twinning Labs behavioral-model operating points on a 226-example held-out behavioral-response benchmark and report both modal accuracy and Wasserstein-1 distance. The results show that modal accuracy and distributional fidelity identify different operating regimes. TL-Twin Alpha achieves the lowest observed Wasserstein-1 distance in the reported result set ($W_1 = 1.16$), while TL-Twin Delta and TL-Twin Gamma provide balanced operating points near the modal-accuracy frontier. The paper frames these results as a systems result: governed memory, behavioral model routing, scenario orchestration, distributional aggregation, and auditability are necessary for turning simulated behavior into reusable enterprise decision evidence.

02.
arXiv (CS.CV) 2026-06-12

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

03.
arXiv (CS.AI) 2026-06-16

Virtual Sensing to Enable Real-Time Monitoring of Inaccessible Locations & Unmeasurable Parameters

arXiv:2412.00107v2 Announce Type: replace-cross Abstract: Real-time monitoring of safety-critical interior states remains an open problem in energy systems where physical instrumentation is infeasible. Existing approaches rely on explicit governing equations, finite-dimensional state vectors, or per-instance retraining, which prevents mesh-independent, field-level inference at arbitrary interior coordinates under real-time constraints. We introduce operator-based virtual sensing for nuclear-grade thermal-fluid systems: we use the neural-operator framework to learn solution operators that map sparse boundary measurements to coupled internal fields in physically inaccessible regions, framing the problem class explicitly to distinguish it from classical state estimation and pointwise soft sensing. We instantiate this framework with MIMONet, a branch-trunk operator extended with three practical choices: multi-modal branch encoders for heterogeneous (scalar and function-valued) inputs; multiplicative branch fusion to preserve the bilinear PDE coupling structure; and shared-latent multi-field decoding with per-channel basis projections at the trunk's final layer. Evaluated across escalating complexity, from canonical lid-driven cavity flow to pressurized water reactor subchannels to fully coupled heat exchangers, MIMONet achieves below 5% relative errors and sub-millisecond inference on data-center accelerators (0.35 ms / 46 mJ per heat-exchanger inference on an NVIDIA H200, and sub-millisecond across the A40-H200-GH200 range), while remaining stable under 50% sensor noise. By staying accurate as geometric confinement and physics coupling intensify, MIMONet shows that operator-based virtual sensing can restore observability where physical instrumentation fails, establishing simulation-based feasibility within the evaluated operating envelopes as a step toward future experimental and cross-solver validation for safety-critical energy systems.

04.
arXiv (CS.AI) 2026-06-12

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

arXiv:2606.13148v1 Announce Type: new Abstract: Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

05.
arXiv (CS.AI) 2026-06-16

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

arXiv:2606.16210v1 Announce Type: new Abstract: Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

06.
bioRxiv (Bioinfo) 2026-06-12

PeptiDIA: A Machine Learning Framework for Enhanced Peptide Identification in Fast-Gradient Data-Independent Acquisition Proteomics

Data-independent acquisition (DIA) mass spectrometry has become increasingly prevalent in proteomics as advances in instrumentation, chromatography, and computational analysis have enabled robust proteome identification across complex biological samples. However, analytical depth achieved with fast chromatographic gradients remains lower than that obtained using long-gradients, reflecting a throughput-depth trade-off. Here, we present PeptiDIA, a machine learning framework that enhances peptide identification in fast-gradient DIA data by leveraging paired fast and long-gradient acquisitions from identical samples. PeptiDIA processes DIA-NN outputs generated at relaxed false discovery rate thresholds to obtain expanded candidate peptide pools and trains gradient-boosted decision tree models using long-gradient identifications as reference labels. The model integrates DIA-NN features with engineered peptide descriptors and applies isotonic regression to calibrate probabilities, enabling controlled peptide recovery relative to the long-gradient reference. Applied to human and murine datasets spanning six tissues acquired on an Orbitrap Exploris 480, PeptiDIA increased peptide identifications by 25-34% at 1% target reference-discordance rate (RDR) and increased the number of protein groups containing at least one rescued peptide by 15-17%. Overall, PeptiDIA improves the identification depth of fast-gradient DIA-NN workflows without altering acquisition strategies. The framework is available as a web application and command-line tool at https://github.com/Jordano700/PeptiDIA.

07.
medRxiv (Medicine) 2026-06-17

LLM-Driven Extraction of NI-RADS and Imaging Tumor Characteristics to Enhance Oropharyngeal Cancer Survivorship Surveillance

Abstract Purpose Radiologic surveillance is essential for oropharyngeal cancer (OPC) survivors, guiding recurrence detection and follow-up strategies. The Neck Imaging Reporting and Data System provides a standardized framework for post-treatment risk reporting at both the primary tumor site (pNI-RADs) and cervical lymph nodes (nNI-RADS). Comprehensive surveillance additionally requires assessment of disease status, including the primary tumor, nodal involvement, and distant metastases. These clinical results are often embedded as unstructured data within free-text radiology reports. We hypothesized that a large language model (LLM) can reliably extract NI-RADS score criteria and summarize key imaging features from unstructured radiology text, achieving high concordance with expert review. Methods Previously untreated OPC patients who received definitive cancer therapy were identified. Eligible imaging reports included post-treatment head and neck CT, MRI, or FDG PET/CT scans containing narrative and impression text. Examinations lacking narrative or impression text, containing pre-existing NI-RADS annotations, or involving non-surveillance imaging modalities were excluded. A total of 200 reports were randomly selected from 7,076 eligible examinations for manual abstraction using a three-reviewer consensus framework to establish a reference dataset. Using the Palantir Foundry Pipeline Builder, a GPT-5-based LLM was deployed to extract pNI-RADS and nNI-RADS scores, and key imaging features of disease status from these reports. Performance was evaluated using exact agreement and F1-based metrics. Results Agreement for no evidence of disease (score of 1) was 93.3% (126/135; F1 = 0.94) and 90.3% (130/144; F1 = 0.93) for pNI-RADS and nNI-RADS, respectively. For NI-RADS [≥]2, exact category agreement was 73.1% (38/52; macro-F1 = 0.75) for pNI-RADS and 64.3% (27/42; macro-F1 = 0.56) for nNI-RADS. Quadratic weighted {kappa} was 0.81 and 0.59, respectively. For post-treatment disease surveillance variables, agreement was 94.9% (149/157; F1 = 0.87) for primary tumor presence, 89.1% (164/184; F1 = 0.87) for nodal disease presence, and 94.7% (126/133; F1 = 0.70) for distant metastasis detection. Specificity was high across disease-status variables (0.95-0.99), with negative predictive values of 0.95 for primary tumor, 0.87 for nodal disease, and 0.99 for distant metastasis. Conclusions Our LLM-based information retrieval and classification approach for radiographic treatment response from unstructured, multidimensional imaging reports achieved high performance for disease exclusion and moderate performance for detecting suspected residual and/or new disease. This pipeline supports scalable and standardized surveillance data capture for longitudinal monitoring, clinical analytics, and survivorship research in head and neck oncology.

08.
arXiv (math.PR) 2026-06-17

Analysis of the asymmetric shelf shuffle

arXiv:2606.18047v1 Announce Type: new Abstract: In an asymmetric shelf shuffle, a deck of $n$ cards is dealt sequentially from the bottom and assigned one of the $m$ shelves uniformly at random. The card is placed at the top of the assigned shelf with probability $p$, and at the bottom of the assigned shelf with probability $(1-p)$. Analysis of the shelf shuffle has gained much attention recently, and the case $p=1/2$ was first treated by Diaconis–Fulman–Holmes [Ann. Appl. Prob. 23 (2013), no. 4, 1692–1720]. In this paper, we extend the analysis of the shelf shuffle to general $p\in (0, 1)$. In particular, we study the distribution of cycles, cycle lengths, number of descents, number of valleys, number of inversions, and the RSK shape of a permutation obtained from an asymmetric shelf shuffle. Our results extend the analysis of Diaconis–Fulman–Holmes to arbitrary $p$. Furthermore, our analysis of the distribution of descents and inversions is new even for $p=1/2$.

09.
arXiv (CS.AI) 2026-06-15

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

arXiv:2606.13720v1 Announce Type: new Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) – nullspace projection and counterfactual flipping – on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations between the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite – an intriguing distinction that warrants further investigation in future work.

10.
arXiv (CS.LG) 2026-06-16

Federated Foundation Language Model Post-Training Should Focus on Open-Source Models

arXiv:2505.23593v4 Announce Type: replace Abstract: Post-training of foundation language models has emerged as a promising research domain in federated learning (FL) with the goal to enable privacy-preserving model improvements and adaptations to user's downstream tasks. Recent advances in this area adopt centralized post-training approaches that build upon black-box foundation language models where there is no access to model weights and architecture details. Although the use of black-box models has been successful in centralized post-training, their blind replication in FL raises several concerns. Our opinion is that using black-box models in FL contradicts the core principles of federation such as data privacy and autonomy. In this paper, we critically analyze the usage of black-box models in federated post-training, and provide a detailed account of various aspects of openness and their implications for FL.

11.
arXiv (CS.CL) 2026-06-16

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

12.
arXiv (CS.LG) 2026-06-15

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability audit of BridgeDPI architecture on three different datasets including Gao, Human, and C.elegans. This study combines gradient-based attributions – integrated gradients, saliency, layer-wise relevance propagation, SmoothGrad, and SmoothGrad-IG – with feature-wise occlusion ablation and strict intersection consensus across methods to reduce single-explainer bias. We summarize sensitivity and signed effects at raw inputs, at the bridge similarity scaffold, and through the graph convolution, including edge-level sensitivities and targeted edge removals. The results show that explainability is most informative when treated as model criticism: it reveals modality dominance, padding and special-token artifacts, dataset-dependent cooperative versus suppressive effects across layers, and chemistry-consistent fragment and composition motifs where methods agree. These analyses do not substitute for structural or experimental ground truth, yet they can provide testable hypotheses for downstream validation in computational drug discovery pipelines. More broadly, applying modern XAI to contemporary DTI/DTA models is still an early pass over the rich structure implicit in trained weights and data – yet even this first layer of scrutiny already helps researchers relate predictions to drug- and target-side representations and to prioritize external validation.

13.
arXiv (quant-ph) 2026-06-12

Fibonacci Steady-States and Persistent Oscillations in an Ordered Multimode Dicke Model

arXiv:2606.13072v1 Announce Type: new Abstract: Ultracold atoms in multimode optical cavities provide a rich testbed for many-body phenomena enabled by light-mediated interactions. Recent experiments include realizations of spin glasses and associative memories, as described by multimode Dicke models with disordered couplings. However, the properties of multimode Dicke models with ordered coupling geometries remain largely unexplored. In this work, we investigate the stable steady-states of the multimode Dicke model with an ordered nearest-neighbor coupling geometry, where $n_c$ atomic clusters are coupled via $n_c-1$ cavity modes. We show that the number of mean-field stable steady-states in the superradiant phase exhibits Fibonacci scaling with the number of atomic clusters, and that a subset of these steady-states exhibit persistent oscillations. Using both the truncated Wigner approximation and the numerically-exact hierarchy of pure states, we further demonstrate that these features of the stable steady-state solutions persist for finite cluster sizes. Ordered multimode Dicke models, such as the nearest-neighbor coupling geometry considered here, are accessible with current experimental technologies and point toward a broader class of strongly interacting dissipative systems with similarly rich behavior.

14.
arXiv (CS.AI) 2026-06-16

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

arXiv:2606.14817v1 Announce Type: cross Abstract: This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

15.
arXiv (CS.CV) 2026-06-18

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

16.
arXiv (CS.CV) 2026-06-17

Query-Efficient Video Adversarial Attack with Stylized Logo on Service Computing

In service computing, video classification has become fundamental to many intelligent applications. While Deep Neural Networks (DNNs) have demonstrated excellent performance in recognizing video content, recent studies have shown that DNNs are highly vulnerable to adversarial examples. Thus, understanding adversarial attacks can better respond to emergency situations. In order to improve attack performance, many style-transfer-based attacks and patch-based attacks have been proposed. However, the global perturbation of the former will bring unnatural global colors, while the latter is difficult to achieve success in targeted attacks due to the limited perturbation space. Moreover, compared to a plethora of methods targeting image classifiers, video adversarial attacks remain relatively underexplored. Therefore, to generate adversarial examples with a low budget and to provide them with a higher verisimilitude, we propose a novel black-box video attack framework, called Stylized Logo Attack (SLA). SLA is conducted through three stages. The first stage involves building a style reference set for logos, which can not only make the generated examples more natural, but also carry more target class features in targeted attacks. Then, Reinforcement Learning is employed to determine the style reference and position parameters of the logo within the video, which ensures that the stylized logo is placed in the video with optimal attributes. Finally, perturbations are optimized in a step-by-step manner so as to improve the fooling rate. Experimental results indicate that SLA can achieve better performance than state-of-the-art methods and still maintain good deception effects when facing various defense methods. We believe SLA can raise awareness among the security community about the reliability and security of video classification systems and serve as a memorandum of possible attack methods.

17.
arXiv (CS.LG) 2026-06-11

FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI

arXiv:2606.11500v1 Announce Type: cross Abstract: The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at https://github.com/OneMore1/FlexiBrain.

18.
arXiv (CS.LG) 2026-06-17

Conformalized Quantum DeepONet Ensembles for Scalable Operator Learning with Distribution-Free Uncertainty

arXiv:2605.00330v2 Announce Type: replace Abstract: Operator learning enables fast surrogate modeling of high-dimensional dynamical systems, but existing approaches face two fundamental limitations: quadratic inference complexity and unreliable uncertainty quantification in safety-critical settings. We propose Conformalized Quantum DeepONet Ensembles, a framework that addresses both challenges simultaneously. By leveraging Quantum Orthogonal Neural Networks (QOrthoNNs), we reduce operator inference complexity from O(n^2) to O(n), enabling scalable evaluation over fine discretizations. To provide rigorous uncertainty quantification, we combine ensemble-based epistemic modeling with adaptive conformal prediction, yielding distribution-free coverage guarantees. A key challenge in ensembling is that naive parallelism scales hardware resources linearly with the number of models. We resolve this by using Superposed Parameterized Quantum Circuits (SPQCs), which compress multiple ensemble members into a single circuit and enable simultaneous multi-model execution. Experiments on synthetic partial differential equations and real-world power system dynamics demonstrate that our approach achieves accurate predictions while maintaining calibrated uncertainty under realistic quantum noise. These results establish a practical pathway toward scalable, uncertainty-aware operator learning in quantum machine learning.

19.
arXiv (CS.CV) 2026-06-15

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

Authors:

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

20.
arXiv (CS.AI) 2026-06-17

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

arXiv:2510.01359v2 Announce Type: replace-cross Abstract: Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents compile and run malicious programs. We present JAWS-Bench (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes mirroring attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, to measure deployable harm. Across seven LLM backends from five families, prompt-only attacks in JAWS-0 achieve 61% compliance; 58% are harmful, 52% parse, and 27% run end-to-end. In JAWS-1, compliance reaches ~100% for stronger models with a mean ASR (Attack Success Rate) ~71%; JAWS-M raises mean ASR to ~75%, with 32% runnable attack code. Wrapping an LLM in an agent increases ASR by 1.6$\times$, by overturning initial refusals during planning and tool use. Similar trends hold for OpenHands, SWE-Agent, and OpenAI Codex, suggesting our JAWS-Bench is agent-agnostic. Category analyses identify which attack classes are most vulnerable and deployable, motivating execution-aware defenses and refusal-preserving agent designs.

21.
PLOS Medicine 2026-05-06

Point-of-care early infant HIV diagnosis at birth in a pragmatic cluster-randomized trial in Mozambique and Tanzania: A comparative cost and cost-effectiveness study

Authors:

by Kira Elsbernd, Issa Sabi, Ilesh V. Jani, Chishamiso Mudenyanga, Siriel Boniface, Arlete Mahumane, Joaquim Lequechane, Falume Chale, Bindiya Meggi, Kassia Pereira, Raphael Edom, Anange F. Lwilla, W. Chris Buck, Nyanda Elias Ntinyinya, Michael Hoelscher, Till Baernighausen, Arne Kroidl, Stefan Kohler, the LIFE Study Consortium Background Timely access to early infant diagnosis (EID) is crucial for newborns with HIV, as late diagnosis can delay lifesaving antiretroviral treatment (ART). We assessed the comparative cost and cost-effectiveness of integrating point-of-care EID at birth into routine care in primary healthcare settings. Methods and findings This pre-specified secondary analysis was nested in the cluster-randomized LIFE study conducted at 28 primary healthcare facilities in Mozambique and Tanzania from October 2019 to September 2021. We estimated the health system cost of point-of-care birth plus 4–8-week HIV testing (very early infant diagnosis; VEID) compared to standard-of-care (SoC) testing at 4–8 weeks only, both with immediate ART initiation. We assessed the cost-effectiveness of VEID relative to SoC with respect to ART initiation within one week of life using Bayesian hierarchical models. As this is an intermediate outcome, incremental cost-effectiveness ratios (ICERs) cannot be directly compared to available life-year-based cost-effectiveness thresholds. To contextualize results, we derived the minimum life-years gained per early ART initiation required for VEID to meet standard thresholds in a break-even analysis.VEID was associated with a higher cost and resulted in earlier ART initiation than SoC in both countries. In Mozambique, VEID increased the proportion of infants initiating ART within one week of life by 90.0 (95% CrI [67.5, 98.5]) percentage points at an incremental cost of $2,632 (95% CrI [$2,249, $3,062]) per infant with HIV. In Tanzania, VEID increased early ART initiation by 59.9 (95% CrI [20.9, 89.5]) percentage points at an incremental cost of $6,263 (95% CrI [$5,394, $7,243]) per infant with HIV. The ICER was $2,924 and $10,458 in Mozambique and Tanzania, respectively and was sensitive to intrauterine transmission rate. These findings were limited by the lack of long-term health outcome data and reliance on an intermediate outcome. Based on the break-even analysis, we estimated that VEID would need to yield 6–32 life-years gained per additional early ART initiation to meet standard thresholds. Conclusions Adding birth testing improved early ART initiation but was unlikely to be cost-effective relative to standard thresholds given current prices, vertical transmission rates, and knowledge of long-term health benefits. Cost-effectiveness could be achieved at current costs if early ART translates to substantial long-term health benefits or if targeted to infants at high risk of vertical transmission.

22.
arXiv (quant-ph) 2026-06-19

Effective Faraday interaction between light and Helium-3 nuclear spins in a multi-pass cell

arXiv:2606.20328v1 Announce Type: new Abstract: Helium-3 nuclear spins form an exceptionally stable quantum system with extremely long coherence time, offering exciting opportunities for quantum technologies. In particular, nuclear spin-squeezed states promise enhanced precision for sensing tasks and tests of new physics. A central challenge for all these applications is the realization of a controllable light-nuclear spin interface. Here we experimentally demonstrate such an interface by exploiting metastability-exchange collisions in a low-pressure helium-3 gas cell at room temperature. A radio-frequency discharge produces a small population of metastable atoms that both enables efficient optical pumping and mediates an effective Faraday interaction between the collective nuclear spin and an optical probe. We quantitatively characterize the strength of this interaction as a function of the nuclear polarization, applied magnetic field, and probe-beam parameters. Moreover, we show that using a multi-pass cell enhances this interaction by effectively increasing the optical depth. Extrapolating to a tenfold increase of the probe power used in the present experiment, we project a measurement-induced squeezing rate of 0.52 s$^{-1}$. Our results provide a practical pathway for optical access to helium-3 nuclear spins and open prospects for generating long-lived, macroscopic nuclear spin-squeezed states for quantum metrology.

23.
arXiv (CS.AI) 2026-06-16

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

arXiv:2604.06173v2 Announce Type: replace-cross Abstract: Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

24.
bioRxiv (Bioinfo) 2026-06-20

Systematic Evaluation of Feature Representations for Cancer-Associated sORF Prediction in Non-coding RNA

Short open reading frames (sORFs) within non-coding RNAs (ncRNAs) have arisen as a hidden layer of gene regulation, encoding small peptides that represent a new class of cancer regulators with diagnostic and therapeutic potential. However, inferring associations between sORFs to specific cancer types remains challenging and requires computational approaches for accurate prediction. Recently, the CoraL framework introduced the first computational approach for predicting cancer-associated peptides, focusing primarily on model architecture while overlooking how feature extraction strategies influence predictive accuracy. We present a systematic evaluation of machine learning models and feature extraction approaches to predict cancer-associated sORFs across 15 cancer types. We benchmarked seven traditional machine learning algorithms combined with three feature extraction methods: k-mer frequency, Word2Vec embeddings, and genomic language model (gLM)-based embeddings. To our knowledge, this is the first study applying gLM-derived embeddings to the prediction of cancer-associated sORFs in ncRNA. Our results show that traditional machine learning models with appropriate feature extraction outperform the CoraL baseline across all cancer types, achieving up to 10% higher accuracy in some of the 15 evaluated datasets. Interestingly, k-mer features consistently outperformed gLM embeddings without fine-tuning, suggesting that local sequence composition may provide more discriminative information for this task and that pre-trained genomic representations may require task-specific adaptation to fully capture these patterns. Additionally, we observed that the way sequences are tokenized, such as the k-mer length, can affect performance: longer fragments (e.g., k=7) sometimes reduced accuracy for Random Forest but had a smaller effect on MLP. Our findings suggest that appropriate feature engineering can provide greater improvements than increasing model complexity.

25.
arXiv (CS.AI) 2026-06-19

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

arXiv:2606.20074v1 Announce Type: cross Abstract: Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.