Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-12

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

arXiv:2501.08425v3 Announce Type: replace Abstract: In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

02.
arXiv (CS.LG) 2026-06-18

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

arXiv:2606.19212v1 Announce Type: cross Abstract: Recent empirical work shows that semantically equivalent paraphrases can fool financial sentiment classifiers: although a paraphrase remains close to the original under a strong reference embedding, it may shift the target model's representation enough to change the predicted class. Existing robustness theory either assumes a single-model threat model or focuses mainly on empirical attack algorithms. We develop a continuous local model of semantic paraphrase perturbations that captures this two-model structure. We show that the worst-case local displacement of the target representation, subject to a proxy-model budget, is governed by the largest generalised eigenvalue of a matrix pencil $(A,B)$ constructed from the Jacobians of the two embedding maps. The resulting attackability index $\lambda^*(x)$ is intrinsic to the local paraphrase geometry and the chosen embedders, yields a closed-form prediction-flip condition for affine readouts, and supports conservative population and finite-sample attackability certificates. For uniform control over classes of affine readouts, we derive a distribution-free VC bound for binary attackability indicators and a scale-sensitive margin bound based on an attackability-adjusted margin that subtracts a local geometric penalty from the standard classifier margin. We also connect the continuous theory to discrete paraphrase search, identify an asymmetry between successful and unsuccessful finite searches, and give a covering condition under which the discrete and continuous settings agree. Finally, we propose an empirical verification framework using soft-token relaxations and generated paraphrase sets to assess the local eigenvalue geometry, prediction-flip condition, and finite-search approximation on a deployed financial-text classifier.

03.
arXiv (CS.LG) 2026-06-18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: https://doi.org/10.57967/hf/8695. Code: https://github.com/edstevenson/ThousandWorlds.

04.
arXiv (CS.CV) 2026-06-16

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

05.
bioRxiv (Bioinfo) 2026-06-15

Maternal BMI and Placental Transcriptomic Changes: A Meta-Analysis of Gene Expression at the Maternal-Fetal Interface

Objective: Maternal body mass index (BMI) is often used as a measure of metabolic status and increased or decreased maternal BMI is associated with a heightened risk of cardiometabolic diseases across generations. The placenta mediates these maternal metabolic cues; however, its genome wide transcriptional adaptations in response to maternal BMI remain incompletely defined. Methods: To delineate placental genes, pathways, and interaction clusters whose transcript abundance varies with maternal prepregnancy BMI through a genome wide meta analysis of human placental RNA sequencing datasets. Placental RNA seq reads from four publicly available cohorts (n=146) were mapped to the GRCh38 reference genome and differentially expressed genes were identified. An independent microarray cohort (n=19) was reanalysed separately to facilitate cross platform comparison. Functional enrichment employed GO, KEGG, and STRING protein interaction resources. Results: Meta-analysis of 146 RNA seq samples identified eight genes with genome-wide significance in placentae from underweight pregnancies including inflammatory signaling gene MAP4K1 and metabolic enzyme PSPH, while overweight and obese categories revealed nominally significant differential expression. KEGG analysis demonstrated significant downregulation of oxidative phosphorylation with increasing maternal BMI, and protein-protein interaction networks revealed inflammatory mediators as central nodes in overweight and obese groups. Independent microarray validation corroborated key findings, including consistent downregulation of oxidative phosphorylation in obesity. Conclusion: Maternal BMI is associated with placental transcriptomic signatures involving inflammatory, metabolic, and hormonal pathways, with consistent downregulation of oxidative phosphorylation across platforms. This genome-wide meta-analysis provides a reproducible catalogue of BMI-responsive placental transcripts that may contribute to developmental programming of offspring health.

06.
arXiv (CS.CL) 2026-06-11

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

07.
arXiv (CS.AI) 2026-06-12

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

arXiv:2606.12817v1 Announce Type: new Abstract: Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

08.
arXiv (CS.AI) 2026-06-11

DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation

arXiv:2606.12245v1 Announce Type: cross Abstract: Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the seesaw dilemma: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental distributional disparity: warm item embeddings occupy a complex ``behavioral manifold" shaped by rich interaction signals, whereas cold item embeddings are constrained to a ``semantic manifold" derived solely from auxiliary content. Existing methods often force a rigid mapping between these inconsistent spaces, causing the model to sacrifice the precision of warm representations to accommodate cold ones. To address this, we propose DiffCold, a diffusion-based generative model that unifies warm and cold representations. Unlike GANs or VAEs, DiffCold leverages conditional diffusion to reconstruct warm item embeddings from content, preserving the underlying manifold structure without degradation. We further tailor this paradigm with two specific designs: a Retrieval-enhanced Aggregator that initializes generation using semantically similar warm items to bypass inefficient noise, and a Simulation-based Representation Alignment module that enforces distribution consistency between generated and real embeddings via contrastive learning. Experiments on three benchmarks confirm that DiffCold resolves the seesaw dilemma, consistently outperforming state-of-the-art methods across all metrics.

09.
arXiv (quant-ph) 2026-06-16

Classical Explanations in (and of) General Probabilistic Theories

arXiv:2603.05627v2 Announce Type: replace Abstract: We introduce a notion of the ``explanation" of one (generalized) probabilistic model by another as particular kind of span in the category $\Prob$ of probabilistic models and morphisms. We show that explanations compose under a standard pullback construction (notwithstanding that $\Prob$ does not support arbitrary pullbacks). We then show that every locally-finite probabilistic model has a canonical, sharp classical explanation. The construction is functorial, so every locally-finite probabilistic theory has a canonical, sharp classical (though of course, usually non-local) representation.

10.
arXiv (quant-ph) 2026-06-19

Propagating Collective Spin-valley Modes in Twisted WSe2

arXiv:2507.18770v2 Announce Type: replace-cross Abstract: The emergence of neutral collective modes is a hallmark of correlated quantum phases but is often challenging to probe experimentally. In two-dimensional flatband systems, charge responses have been intensively investigated yet neutral excitations remain largely unexplored. In particular, intervalley coherent state (IVC) features a neutral Goldstone mode due to spontaneously broken valley U(1) symmetry. While IVC state has been proposed as a unifying theme across graphene and semiconductor based systems, its defining feature, the neutral Goldstone mode, remains elusive in experiment. Here we investigate space and time resolved transport of neutral modes in twisted WSe2 moire superlattices through a novel ultrafast imaging technique. We uncover two new propagating collective modes with very different velocities, which emerge near the van Hove singularity (VHS) in both intermediate (3.5 to 4 degree) and large (around 5 degree) angle twisted WSe2. The fast-propagating mode has a large speed of about 3 km/s and is consistent with a Goldstone mode for an IVC state, while the slow-moving mode is likely a gapped amplitude mode. They can be understood as the spin-valley analogues of collective modes of a superfluid, whose propagation is imaged for the first time in a condensed matter system. Our study demonstrates a powerful new approach for probing charge-neutral modes in quantum materials and offers key insights into the interplay between charge and spin-valley physics in moire superlattices.

11.
arXiv (CS.CL) 2026-06-18

Improve Large Language Model Systems with User Logs

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

12.
arXiv (CS.AI) 2026-06-11

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

arXiv:2606.11780v1 Announce Type: cross Abstract: We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

13.
arXiv (quant-ph) 2026-06-12

Scalar Quantum Fields: Theory Space and its Geometry

arXiv:2606.12580v1 Announce Type: cross Abstract: Scalar fields provide perhaps the simplest playground in which to develop our understanding of quantum field theory. In this lecture, we consider what it means to write down a scalar quantum field theory and how we can give geometrical interpretations to the space of such theories: the theory space.

14.
arXiv (quant-ph) 2026-06-12

Coupling-Grouped XY-QAOA for Joint Anomaly-Feature Selection

Authors:

arXiv:2606.13244v1 Announce Type: new Abstract: Selecting anomalous samples and explanatory features under fixed budgets defines a coupled constrained-optimization problem. Sequential feature-first selection ranks features before choosing samples, which can overlook features whose utility depends on which samples are selected, especially when scores are calibrated from reference data that may be limited, noisy, or drifting. We instead formulate the task as joint sample-feature selection under the same fixed counts. In the analyzed formal model, calibration-error sensitivity grows linearly with the number of samples for feature-first ordering but stays constant for joint selection. We introduce Coupling-Grouped XY-QAOA, a constraint-preserving grouped-angle variant for the resulting optimization problem. On matched sparse IBM Heron R3 benchmarks, a hardware-aware implementation reduces circuit depth by 45.9%-61.3% and two-qubit gates by 2.6%-5.2% relative to Qiskit optimization level 3 on the CZ-basis target. It enables, to our knowledge, the largest reported width-depth configurations for constraint-preserving bipartite-selection QAOA hardware executions with feasible-sector retention: 64 qubits at p=2 and 36 qubits at p=3. The 20-qubit p=5 runs retain 63% valid samples. Across 36-64 qubits, fixed-angle runs yield lower-energy feasible samples than matched random-feasible sampling. Warm starts reduce the gap to strict-feasible classical references by 57.5%-80.5%, and near-budget repair matches the sparse classical reference at 36 qubits. Benchmarks show gains in balanced fixed-budget regimes, and noiseless simulations show that problem-structured angle grouping improves over same-depth XY-QAOA and matched-parameter, type-preserving randomization controls. Overall, the results support calibrated joint selection and hardware-realizable constrained-mixer execution in the tested regimes.

15.
arXiv (CS.CV) 2026-06-11

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy–compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

16.
arXiv (CS.CV) 2026-06-12

Person Identification from Contextual Motion

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, interactive, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

17.
arXiv (CS.AI) 2026-06-17

Patients With Personality: Realistic Patient Simulation through Controlled Diversity and Selective Disclosure

arXiv:2606.17441v1 Announce Type: cross Abstract: Simulating realistic patient interactions is a key requirement to testing clinical applications of LLMs at scale without time-consuming and expensive user studies. However, existing approaches often lack realism and controllability, often oversharing information unprompted, and failing to capture the wide variability of patient behavior. Here, we introduce PatientsWithPersonality (PWP), a patient simulation framework that generates realistic yet diverse virtual patient responses through explicit personality parametrization over a latent patient state. Grounded in HEXACO, a six-dimensional personality space used to quantify and parameterize human behavioral traits, our approach enables fine-grained control over conversational style, cooperativeness, and information disclosure within a unified framework. In a clinician evaluation, PWP is judged nearly as realistic as recorded human actors and clearly ahead of prior simulators, while being flagged as "too informative" far less often. Conditioning on HEXACO axes yields personas whose configured traits are recoverable by both clinicians and an autorater, span a substantially wider behavioral footprint than the closest baseline, and prevent oversharing. Altogether, our framework paves the way for more accurate and informative LLM benchmarking through our realistic and steerable patient simulator.

18.
medRxiv (Medicine) 2026-06-18

Predicting Motor Recovery After Stroke: Utility and Limits of Corticospinal Tract Biomarkers

Background: Corticospinal tract (CST) damage is a major cause of post-stroke motor deficits. However, it remains unclear which estimates of CST damage best predict motor recovery, especially regarding different aspects of motor control. While conventional CST-lesion metrics offer superior feasibility, data-driven machine learning (ML) approaches may better capture patients propensity for task-specific recovery with important implication for their use as future clinical biomarkers. Methods: Providing the first direct longitudinal comparison of these approaches based exclusively on CST-lesion patterns, we evaluated six conventional CST-lesion metrics and a voxel-wise ML approach using clinical MRI data from 127 acute ischemic stroke patients. Acute impairment and outcome (>3 months post-stroke) were assessed for basal and complex motor functions. Conventional CST-lesion metrics and ML were used to predict task-specific motor impairment and outcome. Results: All conventional CST-lesion metrics correlated significantly with both acute impairment and motor outcome across motor domains, with metrics weighted for CST narrowing and tract probability performing best. However, predictive performance for unseen patients was low. ML outperformed conventional markers in predicting acute impairment across motor domains and basal motor outcome, but failed to predict complex motor outcome. Topographically, predictive voxels clustered within and above the posterior limb of the internal capsule, with distinct CST subregions associated with basal versus complex motor impairment, consistent with a task-specific somatotopic organization. Conclusions: The predictive utility of CST biomarkers was task- and timepoint-dependent. While ML may improve predictive performance, complex motor outcome remained difficult to predict, likely reflecting greater reliance on distributed cortical reorganization beyond the CST. By revealing task-specific CST subregions, voxel-wise ML provides an anatomically informed foundation for future predictive models. Such future models should combine CST biomarkers with measures of broader motor network integrity to enable individualized prognosis tailored to specific motor domains and recovery stages.

19.
arXiv (quant-ph) 2026-06-11

Super-Heisenberg Non-Equilibrium Quantum Sensing with Waveguide-Coupled Emitters

arXiv:2606.11975v1 Announce Type: new Abstract: We explore an array of quantum emitters as non-equilibrium probes, coupled to a one-dimensional photonic waveguide, aiming to estimate its properties such as wave number which encodes the waveguide frequency and dispersive characteristics. By considering transient dynamics following initial excitation, we show that the quantum Fisher information (QFI) can be significantly enhanced through careful emitter positioning. For two-emitter probes, optimal spacing stabilizes populations and coherences in the single-excitation subspace, suppressing super radiant decay and extending both the magnitude and longevity of QFI. Randomized emitter configurations also reveal that vanishing waveguide-mediated cross decay maximizes both achievable sensitivity and the temporal duration over which information about the parameter remains accessible. Extending to multipartite probes, we demonstrate that the maximum QFI and its temporal integral scale with system size, exceeding the Heisenberg limit for all positioning strategies. Our results highlight the potential of waveguide-coupled emitter arrays as versatile quantum sensors, where collective radiative dynamics can be harnessed to achieve tunable, long-lived, and enhanced precision.

20.
arXiv (CS.AI) 2026-06-18

Explaining Attention with Program Synthesis

arXiv:2606.19317v1 Announce Type: cross Abstract: A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

21.
bioRxiv (Bioinfo) 2026-06-13

Testing the reliability of AI-generated protein structures

Authors:

Although AlphaFold2 and its competitors have demonstrated remarkable abilities to predict protein structure, more work is needed to explore the limitations of these methods. Here we investigated the reliability of AlphaFold2 and ColabFold by creating a set of realistic but false protein sequences, using ColabFold to predict their structure, and then asking how often the program produces a high-scoring structure for a sequence that does not represent a protein. We determined that AlphaFold2 has a very small but non-zero false positive rate, estimated here at approximately 1 in 435 if one uses a threshold pLDDT score of 70 to define positive predictions. We also discovered, serendipitously, that some high-scoring sequences in the human genome were not false positives, but instead were previously unknown and un-annotated pseudogenes. These latter findings indicate that some well-established human annotations of protein-coding genes may have incorrectly extended the 5-prime untranslated regions too far. They also suggest that the false positive rate of AlphaFold2 is low enough that almost any high-scoring structure, even in a noncoding region, is worthy of further investigation.

22.
arXiv (CS.CL) 2026-06-18

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7–10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity–discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

23.
arXiv (CS.LG) 2026-06-16

Data-Centric Benchmarking of Exploit Generation in LLMs: Understanding the Impact of Fine-Tuning

arXiv:2606.15123v1 Announce Type: cross Abstract: We study the task of CVE-conditioned exploit generation, where a model drafts proof-of-concept (PoC) exploits given software vulnerability context. We adopt a data-centric approach, constructing a high-quality dataset via multi-stage preprocessing and introducing a scalable evaluation framework with LLM-as-judge and fine-grained rubrics. Under this unified setup, we benchmark 17 large language models across 8 evaluation criteria, providing systematic insights into their zero-shot capabilities. We further show that a compact 8B open-weight model, when fine-tuned on curated data, achieves over 42.5% improvement in exploit quality and rivals some proprietary models when combined with simple test-time rejection strategies. Our results highlight the importance of data quality, structured supervision, and evaluation design for reliable exploit generation, suggesting that these factors can be as critical as model scale in adapting LLMs to cybersecurity tasks.

24.
arXiv (CS.AI) 2026-06-18

Analysing drivers and interdependencies in European electricity markets using XAI

arXiv:2606.19118v1 Announce Type: new Abstract: Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

25.
arXiv (CS.AI) 2026-06-11

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

arXiv:2605.00545v2 Announce Type: replace-cross Abstract: Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.