Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (quant-ph) 2026-06-19

Quantum models with the Yang-Lee phase transition

arXiv:2606.19732v1 Announce Type: cross Abstract: In this article, we present four different $1+1$D quantum models that realize the Yang-Lee (YL) phase transition under a deformation that preserves $PT$ symmetry. These are the antiferromagnetic Ising spin chain in transverse and longitudinal magnetic fields, the massive Schwinger model, the Blume-Capel model, and the three-state quantum clock model. Using the state-operator correspondence, we identify the YL critical point, compute the scaling dimensions of the lowest operators in each model, and find perfect agreement with the exact results for the YL criticality in two dimensions. Using bosonization for the Schwinger model and the Polyakov-Hubbard transformation for the other models, we show that in all of these quantum models the YL critical point is described, as expected, by a massless bosonic field with an $i \phi^3$ interaction. In the quantum clock model, this critical field interacts with a massive bosonic field, and we identify the massless and massive states in the Hamiltonian spectrum. In addition, we numerically compute the two-point function of $\phi$ at the Yang-Lee critical point and show that it grows with distance, in agreement with theoretical expectations.

02.
arXiv (quant-ph) 2026-06-15

Experimental violation of a Bell-like inequality for causal order

arXiv:2506.20516v2 Announce Type: replace Abstract: Quantum mechanics is compatible with scenarios where physical processes happen in an indefinite order. In theory, this feature could be detected through violations of inequalities on the observed correlations, analogous to Bell inequalities. However, experimental demonstrations of such violations have been missing until recently due to the complexity of the required setup. Here we report an experimental violation of a Bell-like inequality involving the correlations of four parties, one of which is spacelike separated from the others. Our demonstration employs 3 km fiber spools to simulate spacelike separation, and achieves high-speed operations in photonic time-bin encoding, nanosecond synchronization, and accurate temperature stabilization. These experimental advances enable a violation by 5.7 standard deviations and open a path towards a certification of indefinite order in conditions that guarantee spacelike separation with existing state-of-the-art devices. However, the certification is not device-independent, as it relies on knowledge about the setup to exclude bidirectional signaling–a loophole inherent to implementations in classical acyclic spacetimes, which may be resolved in future quantum-spacetime tests.

03.
arXiv (CS.CV) 2026-06-16

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

04.
arXiv (CS.CL) 2026-06-16

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

05.
arXiv (CS.CL) 2026-06-16

SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG

Fixed-length chunking in Retrieval-Augmented Generation (RAG) often leads to boundary fragmentation, where critical evidence is split across segments, degrading retrieval recall. While static windowing and parent retrieval improve recall, they introduce significant token overhead. We propose SCAR (Semantic Continuity-Aware Retrieval), an adaptive retrieval policy that selectively expands neighboring chunks by weighing query-neighbor relevance against a structural continuity penalty. SCAR uses a relative expansion threshold tied to each retrieved chunk's own query-relevance, yielding an approximately scale-invariant decision rule that transfers across embedding models without recalibration. Across four diverse corpora (RFC, GDPR, a 10-K report, and a Merger agreement; N=320 queries; 160 boundary-fragmented), SCAR achieves 92.8% recall on boundary-fragmented queries with only 7.84 chunks, a 22.9% reduction compared to static windowing (10.16 chunks). Paired bootstrap tests (B=10,000) confirm the chunk reduction is highly significant (p

06.
medRxiv (Medicine) 2026-06-11

Computer Vision for Real-Time Anatomical Navigation in Neurosurgery: First-in-Human Clinical Evaluation and Iterative Development (IDEAL Stage 1)

Introduction: Precise anatomical navigation is fundamental to safe endoscopic pituitary surgery, a high-stakes procedure characterised by a challenging learning curve. While traditional navigation systems often rely on workflow-disrupting probes or static preoperative imaging, advancements in computer vision AI (CVAI) now enable dynamic, real-time anatomical segmentation directly from live surgical video1-3. Our group has previously conducted a series of preclinical human-computer interaction studies to refine the system's design, alongside digital and high-fidelity physical simulations demonstrating the benefit of AI assistance in improving overall performance, training, and safety4-8. Building on this foundation, the current study represents a first-in-human application of real-time CVAI assistance in the neurosurgical operating room, serving to assess feasibility and safety, and to iteratively improve the system. Method: Guided by DECIDE-AI and IDEAL frameworks, this single-centre evaluation comprises an initial proof-of-concept phase (n=6) for endoscopic transsphenoidal pituitary surgeries. The AI model utilised a DINOv3-derived vision transformer architecture, deployed via a high-performance edge computing unit to achieve low-latency, real-time inference without reliance on cloud infrastructure2. Given the high-risk nature of the procedure and the early stage of clinical AI integration, the system was initially deployed as an educational adjunct on a secondary monitor, ensuring the primary surgical feed remains uncompromised. Functionality and safety were assessed via structured questionnaire, prospective observation, and blinded retrospective review of the recordings of the endoscopic surgical video feed and wider operating room environment. Continuous multi-stakeholder feedback through validated human factors surveys drove iterative technical refinements between cases. Results: Six patients with pituitary adenomas were enrolled. The CVAI system was successfully deployed in four cases, demonstrating acceptable real-time sella segmentation accuracy. Deployment failed pre-operatively in two cases owing to a single recurring system reboot bug. Iterative refinement between cases were driven by our experience and surgical team feedback. This resulted in the integration of additional anatomical structure segmentations (e.g., carotid arteries), enhanced model accuracy via training dataset expansion, and hardware firmware upgrades. Multi-stakeholder surveys demonstrated satisfactory system feasibility, usability, and acceptability among the surgical team. Both prospective observation and retrospective video review confirmed the absence of adverse events, including no significant distraction to the primary surgeon, and there were no AI-related clinical complications. Conclusion: This first-in-human early clinical evaluation demonstrates the feasibility, safety and iterative development of real-time, CVAI-based anatomical navigation during high-stakes neurosurgery. Future work will include a larger single-centre case series (IDEAL Stage 2a) with more surgical teams to further iterate the system and explore its impact on training and workflow. As the underpinning technology improves, deployment will transition to direct intra-operative decision support and integration with other intra-operative navigational technologies.

07.
arXiv (quant-ph) 2026-06-16

Real-space spectral functions of three-dimensional billion-size topological non-Hermitian matter with tensor networks

arXiv:2606.16424v1 Announce Type: cross Abstract: Non-Hermitian systems host a wide range of unconventional topological phenomena while large-scale simulations in finite three dimensional systems remain challenging because of the rapidly growing number of sites. In particular, higher-order topological corner modes are often studied only in small lattices, where strong finite-size effects can mask their intrinsic behavior. Here, we develop a tensor-network framework that combines quantics tensor cross interpolation with the kernel polynomial method, enabling compact representations of large non-Hermitian tight-binding Hamiltonians and direct calculations of real-space spectral functions for systems exceeding one billion lattice sites. Using this approach, we investigate three-dimensional non-Hermitian higher-order topological insulators with with structured real-space geometries. The unprecedented system size enables direct access to the macroscopic regime and allows corner-mode spectral responses to be resolved in genuinely three-dimensional systems.By tuning the loss strength, we identify distinct in-gap corner modes across weak- and strong-loss regimes.Our results establish tensor-network algorithms as a powerful strategy to perform real-space spectral calculations in exceptionally large non-Hermitian systems.

08.
arXiv (CS.CL) 2026-06-17

Non-negative Elastic Net Decoding for Information Retrieval

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

09.
arXiv (CS.AI) 2026-06-12

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

arXiv:2606.12921v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

10.
arXiv (quant-ph) 2026-06-16

Diagonal-Budgeted Trotterization for Efficient Quantum Hamiltonian Simulation

arXiv:2606.16959v1 Announce Type: new Abstract: Efficient classical simulation of quantum Hamiltonian dynamics is often bottlenecked by exponential state growth and the overhead of generic sparse linear algebra. We introduce diagonal-budgeted Trotterization, a structure-aware strategy that decomposes Hamiltonians into factors preserving diagonal sparsity while tightly controlling fidelity loss. Our implementation, HamSim, utilizes a compact diagonal-sparse data layout and specialized C++/CUDA kernels to bypass the overheads of generic formats like CSR. By leveraging SIMD vectorization, multithreading, and GPU acceleration, HamSim achieves high performance across heterogeneous architectures. Benchmarks on the HamLib suite show that HamSim significantly outperforms Qiskit-Aer. On CPUs, HamSim attains speedups of $182$–$1,269\times$ on optimization instances (TSP, MaxCut) and $4.8$–$841\times$ on physical models (TFIM, Heisenberg). On GPUs, it achieves up to $178\times$ speedup for $12$–$16$ qubit problems. Unlike traditional Trotterization, HamSim maintains near-perfect fidelity without requiring exponential steps. This demonstrates that diagonal-aware numerical kernels provide a scalable foundation for high-fidelity classical Hamiltonian simulation.

11.
arXiv (CS.AI) 2026-06-11

Runtime Enforcement of Hybrid System Properties

arXiv:2606.12022v1 Announce Type: cross Abstract: Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncertain and dynamic environments. Unlike traditional runtime verification, runtime enforcement actively intervenes during execution to prevent property violations by modifying unsafe system behaviors. Existing enforcement frameworks primarily focus on untimed or discrete-time specifications and are often limited to delaying or suppressing events, making them inadequate for reactive systems exhibiting complex continuous dynamics. In this paper, we propose a runtime enforcement framework where safety requirements are modeled using Hybrid Automata (HA). The framework combines discrete-event editing with continuous-time monitoring to support enforcement actions such as suppression, delay, and insertion of events at arbitrary time instants. Upon observing environmental inputs, the automaton is initialized, and runtime reachability analysis is used to synthesize safe corrective actions. We formally define the enforcement problem for safety hybrid automata, establish enforceability conditions, and present an online enforcement algorithm for reactive systems. A detailed case study on an Adaptive Cruise Control (ACC) system demonstrates the effectiveness of the proposed approach in maintaining safety properties under unsafe controller behaviors. Experimental results show that the framework introduces minimal computational overhead while ensuring continuous compliance with safety requirements in real time.

12.
arXiv (CS.LG) 2026-06-16

Schattor: Schatten-family methods for deep learning optimization

arXiv:2606.15702v1 Announce Type: cross Abstract: Modern deep learning optimization features heterogeneous parameter structures, noisy gradients, and highly nonconvex landscapes, posing significant challenges for both algorithm design and theoretical analysis. Motivated by the limitations of SGD and the success of adaptive optimizers, we propose {\it Schattor}, a family of adaptive first-order methods based on Schatten norms. Schattor unifies SGD and the recently proposed matrix-variate adaptive optimizer Muon within a single Schatten-norm-based framework. We establish dimension-free stationarity guarantees for methods in the Schattor family for stochastic matrix optimization problems via a novel matrix martingale moment bound. We also develop multi-block extensions that adaptively balance block-wise optimization progress and prove dimension-free stationarity guarantees in this more general setting.

13.
medRxiv (Medicine) 2026-06-16

Usability testing with a prototype user interface of an Artificial Intelligence driven air-Safety Tool (AISaT)

Involving end-users in the development of an AI tool is an important facilitator to its implementation. Usability testing was therefore conducted with a prototype user interface of an Artificial Intelligence driven air-Safety Tool (AISaT) to capture the perspectives and user experiences of AISaT from 10 staff members across two hospitals working within estates, infection prevention and control, and clinical areas, to inform the development of next iterations of AISaT. The perspectives shared could be grouped under improvements to the understand-ability; content; navigation; visibility; usability; workflow; ownership; and frequency of use of the tool. There were key areas that can and will be easily improved within AISaT, however there were areas that required a deeper level of critical reflection, such as incorporating data on more existing variables in a room (i.e., existing ventilation) and whether all patients should be assumed as infectious and breathing heavily. The research team must consider if the target audience of end users and recommended frequency of AISaT use will be pre-defined by the tool developers, or whether this level of detail should be left to each individual hospital to decide.

14.
arXiv (CS.CV) 2026-06-18

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

15.
arXiv (CS.LG) 2026-06-17

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

arXiv:2606.17508v1 Announce Type: new Abstract: Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.

16.
arXiv (CS.AI) 2026-06-16

Discrete optimal transport is a strong audio adversarial attack

arXiv:2509.14959v3 Announce Type: replace-cross Abstract: In this paper, we investigate discrete optimal transport (DOT) as a black-box attack against modern automatic speaker verification (ASV) and anti-spoofing countermeasure (CM) systems. Our attack operates as a post-processing distribution-alignment step. Frame-level WavLM embeddings of generated speech (or another person speech) are aligned to an unpaired bona fide speech pool using entropic optimal transport and a top-k barycentric projection, followed by neural vocoding. Unlike gradient-based attacks, the proposed method requires no access to model parameters, gradients, or training data. Experiments on ASVspoof2019 and ASVspoof5 demonstrate that DOT attack substantially increases CM EER and substantially degrades ASV performance across multiple spoofing attacks. The attack transfers across datasets and remains effective after CM fine-tuning. Analysis using speaker similarity, Fréchet Audio Distance, and visualization of embedding distributions suggests that DOT succeeds by shifting source speech toward bona fide regions of the representation space rather than by maximizing speaker similarity. These results indicate that optimal-transport-based distribution alignment represents a previously underexplored attack vector for contemporary ASV and anti-spoofing systems.

17.
arXiv (CS.AI) 2026-06-18

Towards Understanding What State Space Models Learn About Code

arXiv:2602.06774v2 Announce Type: replace Abstract: State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

18.
arXiv (CS.LG) 2026-06-15

FreshRetailNet-LT: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail

arXiv:2505.16319v4 Announce Type: replace Abstract: Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K) and code (https://github.com/Dingdong-Inc/frn-50k-baseline}) are openly released.

19.
arXiv (CS.AI) 2026-06-15

LEPO: Latent Reasoning Policy Optimization for Large Language Models

arXiv:2604.17892v4 Announce Type: replace-cross Abstract: Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization~(LEPO), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.

20.
arXiv (CS.LG) 2026-06-15

The Program Is Still There: A Conservation Law for Program Discovery

arXiv:2606.13799v1 Announce Type: cross Abstract: Finding the shortest program that generates a sequence is uncomputable, and for six decades that fact has been mistaken for a wall around finding any generating program. It is not a wall but a price, and this paper measures it. For every algorithm that learns about a candidate program only through its score, a class spanning Levin search, evolutionary methods, simulated annealing, and the cross-entropy method, we define the coupling width of a search problem and prove an unconditional worst-case lower bound, exponential in that width with base one less than the domain size. From it follows a conservation law: structural knowledge injected into a search trades one for one against the search it removes, and their sum can never fall below the length of the program sought. Levin's 1973 upper bound and the lower bound proved here are the two ends of one conserved quantity, closing on each other as the instruction set grows. The only escape is to read a candidate's structure rather than its score, and its price, which we prove for generic targets, is incompleteness. A deterministic engine built on this theory recovers a generating program, certified by compressing its data and predicting an unseen continuation, for 2,383 of 3,914 sequences across four independent populations, including 244 of the 256 elementary cellular automata, with measured discovery cost rising along program length more than an order of magnitude inside the score-oracle worst case.

21.
arXiv (CS.CL) 2026-06-16

Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

Safe deployment of clinical vision-language models (VLMs) requires reliable uncertainty estimation (UE): a signal indicating when predictions should be trusted or escalated to a clinician. We test whether current UE methods actually deliver this signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering (VQA), we find that UE quality is not an intrinsic property of the UE method: it tracks model accuracy, degrading precisely where the model performance is weakest, and therefore where reliability is most needed. When we stress-test models by hiding the correct option among the multiple-choice answers (NOTA perturbations), accuracy collapses while uncertainty barely changes, leaving models systematically miscalibrated. Yet, we find that uncertainty on the unperturbed input reliably anticipates which predictions will collapse under NOTA, indicating that UE in current VLMs carries diagnostic information about model fragility. Our results position UE as a diagnostic tool for identifying fragile predictions and motivate perturbation-based evaluation as a path toward safe clinical deployment.

22.
arXiv (CS.AI) 2026-06-17

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

arXiv:2606.17642v1 Announce Type: new Abstract: Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

23.
arXiv (CS.LG) 2026-06-18

Shrinkage priors for Bayesian Substitute Confounders

arXiv:2606.18535v1 Announce Type: cross Abstract: Multi-cause observational studies contain information about unmeasured confounding through the dependence structure among causes. However, literal imputation of the unobserved confounder is often more complex than learning a lower-dimensional substitute score that preserves the shared assignment variation needed for stable causal adjustment. The deconfounder (Wang and Blei, 2019) and related substitute confounder methods exploit this idea, but flexible assignment models can fit the joint distribution of the causes while producing scores that over-encode the treatment vector, collapse overlap, or capture single-cause variation. We develop a Bayesian factor assignment framework for learning sparse substitute confounders that retain coarse multi-cause dependence with shrinkage priors. The theory is stated at the level of posterior concentration, factor score contraction, and overlap-preserving assignment geometry and therefore does not rely on a particular shrinkage prior. Under these conditions, the proposed regression-adjusted estimators are consistent for mean potential outcomes when the corresponding latent variable identification assumptions hold. Shrinkage priors provide a natural tool for latent structural learning: they favour low-dimensional factors supported by multiple causes, discourage effectively single-cause factors, and induce an ordering of the latent factors through progressive shrinkage. Synthetic experiments illustrate the roles of signal strength, outcome validity, and geometry-aware regularization. In an Alzheimer's Disease Neuroimaging Initiative (ADNI) baseline analysis, sparse substitute scores recover much of the adjustment obtained by directly conditioning on invasive cerebrospinal-fluid biomarkers, while collapse diagnostics identify when fitted factors reduce to individual observed measurements.

24.
medRxiv (Medicine) 2026-06-10

Transcriptomic Architecture of Type 2 Diabetes in Human Pancreatic Islets:An Integrative Meta-Analysis and Machine Learning Framework for Biomarker Discovery

作者:

Background. Type 2 diabetes mellitus (T2D) is defined by progressive pancreatic {beta}-cell dysfunction whose molecular underpinnings remain incompletely understood. Single-cohort transcriptomic analyses of donor islets have yielded heterogeneous gene lists of limited cross-study reproducibility, constraining both mechanistic interpretation and biomarker development. Methods. We combined two complementary analytical strategies applied to four public human islet transcriptomic cohorts (GSE25724, GSE20966, GSE38642, and GSE164416; n = 7-57 donors per contrast). For the integrative arm, three microarray datasets and one bulk RNA-seq dataset were processed independently and unified through gene-level random-effects meta-analysis, hallmark pathway scoring (GSVA/MSigDB), and iterative module refinement, yielding a two-axis disease framework. For the diagnostic arm, a consensus multi-method machine learning pipeline, combining LASSO penalized logistic regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest importance scoring, was applied to 184 differentially expressed genes from the RNA-seq cohort, with all normalization steps performed within leave-one-out cross-validation (LOOCV) folds to prevent data leakage. Machine learning classification of the RNA-seq cohort was additionally subjected to external transportability testing in the independent bulk human islet RNA-seq cohort GSE50244 using an overlap-restricted reduced score and a threshold fixed in the discovery cohort. Results. Meta-analysis across all four cohorts identified 337 high-confidence T2D-associated genes (96.1% directional concordance in beta-cell-enriched tissue). These were distilled into two refined 14-gene modules: ImmuneStress (MICB, HLA-DRA, HLA-DPA1, IL1R2, and others) and BetaCellIdentitySecretion (RASGRP1, PPP1R1A, SLC2A2, and others), whose composite IsletDysfunctionScore provided the most stable cross-platform separation of non-diabetic from T2D islets (Hedges' g = 1.80, p = 9.83 x $10^-17$, $text{I}^2$= 0%). Consistent with progressive disease, IsletDysfunctionScore increased monotonically from non-diabetic to impaired glucose tolerance to T2D. Separately, the machine learning pipeline derived a 10-gene diagnostic panel: GABRA2, SLC2A2, ARG2, DKK3, PRIMA1, TAFA4, HHATL, PARVG, RNU1-70P, and the novel lncRNA ENSG00000284653, that achieved perfect discrimination in LOOCV (AUC = 1.000, sensitivity = 1.000, specificity = 1.000, zero misclassifications across all 57 donors). A leakage-verification experiment confirmed that this performance reflected genuine biological signal: global quantile normalization prior to cross-validation collapsed AUC to 0.380. External testing showed that 8 of the 10 panel genes were measurable in GSE50244. The frozen 8-gene reduced score retained strong discrimination (external AUC = 0.907), with 6 of 8 genes preserving directional concordance, but the discovery-derived threshold did not transfer because the external score distribution was shifted upward and compressed, yielding complete sensitivity but zero specificity at the frozen cutoff Conclusions. Integrating pathway-level meta-analysis with machine learning classification, we present a coherent two-axis model: immune/stress activation and loss of beta-cell identity/secretory competence, together with a compact, biologically interpretable 10-gene diagnostic signature. Panel genes converge on GABA signaling, glucose transport, arginine metabolism, WNT pathway inhibition, and a novel lncRNA, providing both mechanistic hypotheses and high-priority targets for external validation. These findings offer a reproducible transcriptomic scaffold for future mechanistic, biomarker, and clinical translation studies of human islet dysfunction. They also support external transportability of the core biological signal, while indicating that absolute operating thresholds are cohort-dependent and would require recalibration before deployment in independent datasets.

25.
arXiv (CS.AI) 2026-06-19

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

arXiv:2606.19787v1 Announce Type: new Abstract: Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.