Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-15

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19–0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise–pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

02.
arXiv (CS.AI) 2026-06-12

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

arXiv:2606.12821v1 Announce Type: new Abstract: Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

03.
Nature (Science) 2026-06-09

Scientists have a bad case of AI FOMO, <i>Nature</i> poll reveals

作者:

Almost half of the scientists who responded said that they feel broadly negative towards artificial intelligence, but they think that some tools are better than others. Almost half of the scientists who responded said that they feel broadly negative towards artificial intelligence, but they think that some tools are better than others.

04.
arXiv (CS.LG) 2026-06-11

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

arXiv:2606.12280v1 Announce Type: new Abstract: Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by $+1.9$ CLIP (95% CI $[+1.21,+2.64]$, excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8's weights match FP8's footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

05.
arXiv (CS.LG) 2026-06-18

UST-GNN: A Unified Spatial–Topological Graph Neural Network Framework for Urban Analytics–Demonstrated through a Case Study on Urban Health Prediction

arXiv:2504.04739v3 Announce Type: replace Abstract: Understanding how social, demographic, environmental, and spatial factors jointly shape urban outcomes is essential for sustainable urban development and evidence-based policy. Traditional statistical approaches often struggle to capture complex non-linear relationships, while many machine learning methods overlook the joint roles of spatial autocorrelation and network topology in urban systems. Recent advances in GeoAI have addressed these challenges only partially, often treating spatial effects, graph structure, evaluation, and interpretability separately. We present UST-GNN, a unified spatial–topological graph neural network framework that integrates neighbourhood connectivity, heterogeneous urban features, and positional/locational embeddings into a single representation. Using the MedSAT dataset, which contains over 150 environmental and socio-demographic variables and six prescription outcomes across 4,835 neighbourhoods in Greater London, UST-GNN outperforms strong statistical, geographically enhanced, and graph Machine Learning baselines, improving out-of-sample $R^2$ by 8.4–13.2\% under strict spatial cross-validation. We further introduce a lightweight principal-component module to interpret learned node embeddings geographically and relate them to policy-relevant covariates. The resulting analyses recover established patterns, offer new perspectives on debated associations, and reveal novel predictors warranting further causal investigation. Together, these findings demonstrate the value of graph-based spatial machine learning for urban health analytics, environmental inequality assessment, and evidence-based urban policy. Beyond predictive gains, UST-GNN provides a unified GeoAI analytical pipeline that can be embedded into urban digital twin workflows for scenario testing, monitoring, and data-informed decision-making for healthier, more sustainable cities.

06.
arXiv (CS.CL) 2026-06-16

A Mechanistic Understanding of Pronoun Fidelity in LLMs

Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model's internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms – group entity binding (G), recency bias (R), and stereotypical bias (S) – are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

07.
arXiv (math.PR) 2026-06-12

Characterizing metric-space-valued processes: separating classes and weak invariance principles for measure-theoretic inference

arXiv:2606.13084v1 Announce Type: cross Abstract: This article investigates stochastic processes taking values in metric spaces that lack a topological vector space structure, a regime characterized by intricate interplay between topological, geometric, and temporal dependence structures. It is formally established that spaces admitting an isometric Hilbertian embedding constitute a strict subclass within the much broader class of metric spaces possessing the ball property. While traditional kernel methods are susceptible to geometric distortion when the underlying space cannot be isometrically embedded into a Hilbert space, we bypass such limitations by exploiting a fundamental structural property inherent to this broader class; namely, that Borel probability measures are uniquely determined by their values on balls. These separating classes provide the foundation for the subsequently introduced measure-theoretic inference methodology. We derive uniform convergence of a family of time-dependent random measures, alongside weak invariance principles for the corresponding nonstationary random fields. This framework explicitly exposes how dependence and geometric complexity influence sample path regularity. Furthermore, because the rapid decay of small-ball probabilities can prohibit the existence of limiting distributions for supremum-based discrepancy measures, we develop $L^p$-based alternatives. By directly leveraging the introduced convergence results, this approach circumvents the need for higher-order $U$-process formulations. Finally, for spaces that do admit an isometric Hilbertian embedding, and where $U$-processes naturally arise, we establish limit theory for both degenerate and nondegenerate multi-parameter $U$-processes, and demonstrate that local discrepancy tests maintain asymptotic stability under dynamic parameter regimes.

08.
arXiv (CS.CV) 2026-06-15

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://github.com/LpyNow/MMA-82.

09.
arXiv (CS.CV) 2026-06-11

Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

10.
arXiv (quant-ph) 2026-06-16

Scheme for Transport-based Global Entanglement Distribution using Quantum Processors

arXiv:2606.15421v1 Announce Type: new Abstract: We propose a scheme for distributing entanglement over global distances in a heralded manner by using satellites to physically transport entangled processor nodes with rare-earth-ion qubits. A full analysis of channel losses, errors and background light is performed to determine the fidelity and number of entangled pairs that can be distributed between two ground stations. We show that the scheme works already with a single satellite and can distribute close to the theoretical maximum number of entangled pairs that can be generated in a satellite overpass. In addition, we argue that in theory transportation-based schemes outperform other satellite-based schemes and can be scaled up to a constellation without additional channel losses. Daytime operation seems feasible as long as the sky is clear, with an EPR pair fidelity ranging from 99.3% at shorter network lengths to 93.9% with global coverage and can be further improved by active error correction or entanglement purification.

11.
arXiv (CS.CV) 2026-06-16

Through-Foliage Surface-Temperature Reconstruction for Early Wildfire Detection

We present a method to reconstruct surface temperatures through forest vegetation by combining signal processing and machine learning, enabling fully automated aerial wildfire monitoring with drones for early fire detection. Synthetic aperture (SA) sensing reduces canopy occlusion but introduces thermal blur. To overcome this, we train a visual state space model to recover subtle thermal signals of partially occluded soil and fire hotspots from blurred data. To address limited real-world training data, we generate realistic surface temperature simulations using a latent diffusion model, temperature augmentation, and procedural thermal forest modeling. On simulated datasets, our method reduces RMSE by 2-2.5 versus conventional thermal and uncorrected SA imaging; in field experiments on hotspots, RMSE improved by 12.8-fold and 2.6-fold, respectively. Our approach also generalizes to other thermal signals, including human signatures, capturing morphology and extent – critical where simple thresholding fails – while conventional imaging struggles with partial occlusion.

12.
arXiv (math.PR) 2026-06-16

Sharp connectivity bounds for the vacant set of random interlacements

arXiv:2504.02777v2 Announce Type: replace Abstract: We consider percolation of the vacant set of random interlacements at intensity $u$ in dimensions three and higher, and derive lower bounds on the truncated two-point function for all values of $u>0$. These bounds are sharp up to principal exponential order for all $u$ in dimension three and all $u \neq u_\ast$ in higher dimensions, where $u_*$ refers to the critical parameter of the model, and they match the upper bounds derived in the article arXiv:2503.14497. In dimension three, our results further imply that the truncated two-point function grows at large distances $x$ at a rate that depends on $x$ only through its Euclidean norm, which offers a glimpse of the expected (Euclidean) invariance of the scaling limit at criticality. The rate function is atypical, it incurs a logarithmic correction and comes with an explicit pre-factor that converges to $0$ as the parameter $u$ approaches the critical point $u_*$ from either side. A particular challenge stems from the combined effects of lack of monotonicity due to the truncation in the super-critical phase, and the precise (rotationally invariant) controls we seek, that measure the effects of a certain "harmonic humpback" function. Among others, their derivation relies on rather fine estimates for hitting probabilities of the random walk in arbitrary direction $e$, which witness this invariance at the discrete level, and preclude straightforward applications of projection arguments.

13.
arXiv (CS.LG) 2026-06-19

Shifting-based Optimizable Linear Relaxations for General Activation Functions

arXiv:2606.20292v1 Announce Type: new Abstract: The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of activation functions. However, existing techniques depend on hand-crafted relaxations for each activation function. Extension to state-of-the-art activation functions therefore requires substantial manual effort. In contrast, our approach SLiR (Shifting-based Linear Relaxations) is broadly applicable, requiring only a Lipschitz constant or a set of critical points. SLiR parameterizes relaxations by their slope and computes the corresponding offset via a shifting procedure that ensures sound upper and lower bounds over the input domain, enabling efficient optimization while maintaining correctness. Our experiments show that SLiR produces tight relaxations across a wide range of practical activation functions and enables verification of up to 7.8x more properties compared to state-of-the-art methods.

14.
arXiv (CS.LG) 2026-06-19

Critical Percolation as a Synthetic Data Model for Interpretability

arXiv:2606.20347v1 Announce Type: new Abstract: Neural networks learn features that reflect the hierarchical, multi-scale structure of natural data. Synthetic datasets used to evaluate interpretability methods typically lack this structure, limiting their value as realistic toy models. To close this gap, we introduce a family of synthetic datasets consisting of hierarchical functions defined on critical mean-field percolation clusters embedded in a high-dimensional data space. The percolation data consists of sparse, low-dimensional fractal clusters with a power-law size distribution. Latent variables modeling a taxonomic hierarchy generate each data point's target value. The data model is analytically tractable with known critical exponents that fix its properties without requiring hyperparameter tuning. We leverage a mapping between percolation clusters, random trees, and additive coalescence to propose an almost linear-time algorithm to jointly sample a random tree and its hierarchical latent decomposition, enabling data generation at arbitrary scale. Using probing experiments, we find that the model's ground-truth latent variables can be linearly decoded from neural network activations. Together, sparsity, self-similarity, power-law statistics, and analytical tractability make critical percolation a principled testbed for interpretability research.

15.
arXiv (CS.CV) 2026-06-17

Vision-language models for chest radiography do not always need the image

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

16.
arXiv (quant-ph) 2026-06-12

Fibonacci Steady-States and Persistent Oscillations in an Ordered Multimode Dicke Model

arXiv:2606.13072v1 Announce Type: new Abstract: Ultracold atoms in multimode optical cavities provide a rich testbed for many-body phenomena enabled by light-mediated interactions. Recent experiments include realizations of spin glasses and associative memories, as described by multimode Dicke models with disordered couplings. However, the properties of multimode Dicke models with ordered coupling geometries remain largely unexplored. In this work, we investigate the stable steady-states of the multimode Dicke model with an ordered nearest-neighbor coupling geometry, where $n_c$ atomic clusters are coupled via $n_c-1$ cavity modes. We show that the number of mean-field stable steady-states in the superradiant phase exhibits Fibonacci scaling with the number of atomic clusters, and that a subset of these steady-states exhibit persistent oscillations. Using both the truncated Wigner approximation and the numerically-exact hierarchy of pure states, we further demonstrate that these features of the stable steady-state solutions persist for finite cluster sizes. Ordered multimode Dicke models, such as the nearest-neighbor coupling geometry considered here, are accessible with current experimental technologies and point toward a broader class of strongly interacting dissipative systems with similarly rich behavior.

17.
medRxiv (Medicine) 2026-06-17

Real-World Effectiveness and Safety of Avacopan in ANCA-Associated Vasculitis: A Systematic Literature Review and Meta-analysis

Background: The efficacy and safety of avacopan in ANCA-associated vasculitis (AAV) has been established in randomized trials of of avacopan as a glucocorticoid (GC) sparing therapy. However, real world evidence (RWE) has an important role in confirming effectiveness and evaluating safety in more generalizable settings. This study aimed to synthesize RWE on the effectiveness and safety of avacopan in adults with AAV. Methods: A systematic literature review and meta analysis of non interventional real world studies was conducted in accordance with Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) guidelines. Eligible studies included adults with AAV treated with avacopan in routine clinical practice. Pooled estimates of effectiveness and safety outcomes were calculated using random effects meta-analyses. Primary outcomes included remission at 6 and 12 months and sustained remission at 12 months. Secondary outcomes included relapse, GC use and dosing, hepatotoxicity, infections, and treatment discontinuation. Exploratory outcomes included changes in estimated glomerular filtration rate (eGFR) and dialysis related endpoints. Results: A total of 71 studies were included and contributed to quantitative analyses. Pooled remission for patients on avacopan was 87% (95% CI: 75%-94%) at 6 months and 93% (95% CI: 86%-97%) at 12 months, and sustained remission was 86% (95% CI: 74%-93%) at 12 months. Relapse at 12 months was low (7%; 95% CI: 4%-11%). GC use was 36% at both 6 and 12 months. Improvements in eGFR were observed at 6 months (18 mL/min/1.73 m2) and 12 months (18 mL/min/1.73 m2), and dialysis liberation was 66% in a limited subset. Among avacopan patients, 11% experienced any hepatotoxicity, including 7% with serious (defined as directly reported or requiring hospitalization) hepatotoxicity, while 7% experienced serious (defined as directly reported or requiring hospitalization) infection. Conclusions: In real world clinical practice, avacopan is associated with high remission rates, low relapse rates, and a consistent GC sparing effect, with effectiveness comparable to standard of care regimens. Findings support its clinical use with appropriate safety monitoring; however, the observed heterogeneity in hepatotoxicity and the limited comparative effectiveness evidence highlight areas requiring further investigation.

18.
arXiv (CS.CV) 2026-06-15

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

19.
arXiv (CS.AI) 2026-06-19

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

arXiv:2606.20074v1 Announce Type: cross Abstract: Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

20.
arXiv (CS.LG) 2026-06-16

Towards Data-Efficient Cross-Device Generalization of Grad-Shafranov Equilibria via Transfer Learning Neural Operator

arXiv:2606.15512v1 Announce Type: new Abstract: Real-time reconstruction of magnetohydrodynamic equilibria is essential for plasma shaping, stability assessment and feedback control in magnetic confinement fusion. However, Grad-Shafranov equilibrium calculations remain largely device-specific and iterative, limiting their use in latency-constrained control settings. Existing neural approaches can accelerate individual equilibrium predictions, but they do not generally provide reusable models across changing plasma boundaries or tokamak geometries. Here we show that equilibrium reconstruction can be recast as a cross-device operator learning problem. We develop a domain-specific neural operator framework that maps geometry and profile parameters directly to the poloidal flux field, replacing repeated solve-on-demand computation with amortized operator inference. Using the analytically tractable Solov'ev family as a controlled Grad-Shafranov testbed, we generate equilibria across eight geometrically distinct tokamak-like configurations and benchmark five neural operator architectures under four transfer-learning strategies. Single-geometry pretraining gives poor transfer to unseen devices, whereas multi-geometry pretraining enables data-efficient adaptation. The Wavelet Neural Operator gives the strongest cross-geometry performance, reaching mean relative L2 errors below 4% with 100 labelled target equilibria and below 2% with full fine-tuning. The predicted magnetic fields satisfy the divergence-free constraint to numerical precision, and four architectures achieve millisecond or sub-millisecond inference. These results identify neural operator pretraining as a route towards reusable, real-time equilibrium inference across fusion device configurations.

21.
arXiv (CS.LG) 2026-06-11

Adjoint Method versus Physics-Informed Neural Networks in PDE-Constrained Inverse Problems

arXiv:2606.12337v1 Announce Type: cross Abstract: Inverse problems governed by partial differential equations (PDEs) are central to computational mechanics and are commonly solved by adjoint-based optimization, while physics-informed neural networks (PINNs) have emerged as a flexible alternative. Their relative performance remains difficult to assess because the two approaches are often compared under different formulations, parameterizations, optimizers, and regularization choices. We present a fair comparison of adjoint optimization and PINNs for PDE-constrained inverse problems. From a common abstract formulation, we instantiate both methods on identical domains, governing equations, observation models, and regularization terms, while matching the optimizer, unknown parameterization, and arithmetic precision wherever applicable. The benchmarks include unsteady Burgers, noisy Darcy permeability inversion, three-dimensional Allen–Cahn reaction identification, and unsteady Navier–Stokes viscosity identification. The results show that the representation of the unknown largely determines the preferred method: grid-based fields favor the discrete adjoint, whereas neural representations are native to PINNs and relevant for closure and constitutive modeling. For time-dependent problems, adjoint inversion can be dominated by trajectory storage and differentiation, while PINNs provide satisfactory reconstructions at lower cost. A PINN-warm-started adjoint strategy then recovers adjoint-level accuracy at substantially reduced cost.

22.
arXiv (CS.CV) 2026-06-15

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

Temporal action segmentation (TAS) in untrimmed videos requires dense temporal supervision. However, most of the annotation cost is spent identifying action transitions where segmentation errors concentrate and small temporal shifts can disproportionately degrade segment-level metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these error-prone boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score. The boundary score fuses neighborhood uncertainty, class ambiguity, and temporal prediction dynamics to reveal the underlying importance of each frame. Importantly, our annotation protocol requests labels only at the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets. Gains are largest on datasets where performance is highly sensitive to boundary placement, as measured by edit and overlap-based F1 metrics.

23.
arXiv (CS.LG) 2026-06-11

Trajectory Geometry of Transformer Representations Across Layers

arXiv:2606.09287v2 Announce Type: replace Abstract: Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41–0.58, p

24.
arXiv (quant-ph) 2026-06-19

Space-time duality approach to (inhomogeneous) integrable quenches

arXiv:2606.20445v1 Announce Type: cross Abstract: Characterising the universal aspects of non-equilibrium quantum many-body dynamics is one of the key goals of this century's physics research. Progress, however, is hindered by the lack of general theoretical frameworks for studying interacting quantum matter far from equilibrium. A recent breakthrough has been the realization that several key non-equilibrium quantities, such as the rate of growth of entanglement or the fluctuations of conserved charges within finite subsystems, can be related to equilibrium properties through a space-time duality that effectively exchanges the roles of space and time. This observation effectively enables the study of non-equilibrium phenomena using tools and concepts borrowed from equilibrium statistical mechanics and thermodynamics. A first proof of principle of this framework, dubbed space-time duality approach (SDA), was provided by interacting integrable systems, where thermodynamic properties can often be characterized exactly, while dynamical quantities typically remain beyond analytical reach. Subsequent developments, however, revealed that the SDA suffered from an intrinsic ambiguity, restricting its applicability to homogeneous quenches and to charge fluctuations arising from symmetric initial states. Here we resolve this ambiguity from first principles and derive closed-form predictions for entanglement growth and charge fluctuations after general quantum quenches. We benchmark our results against the exact analytical solution of the Rule 54 quantum cellular automaton and extensive TEBD simulations of the XXZ chain. Moreover we show that, when specialised to the entanglement entropy, our framework naturally reproduces the predictions of the quasiparticle picture.

25.
arXiv (quant-ph) 2026-06-11

A Pfaffian quantum Hall state of ultracold bosons

arXiv:2606.12409v1 Announce Type: cross Abstract: Fractional quantum Hall states are a cornerstone of topological physics, hosting fractionally charged quasiparticles with exotic statistics that promise to enable topologically protected quantum information processing. Among these, the Pfaffian state introduced by Moore and Read implements a p-wave pairing structure that supports excitations with non-Abelian exchange statistics. Despite extensive study in electronic systems, direct access to its pairing structure has remained limited. Here we realize a three-particle bosonic Pfaffian state of ultracold $^{87}\mathrm{Rb}$ atoms in an optical lattice subject to a Floquet-engineered synthetic magnetic field. Using a Bayesian-optimized adiabatic protocol, we prepare a state exhibiting Pfaffian pairing correlations. Site-resolved measurements of multi-point density correlations reveal a pronounced suppression of short-range three-body coincidences, reflecting the underlying pairing structure. We further probe the state's transport response through Hall drift measurements. Our results establish a bottom-up approach to engineering non-Abelian topological order and lay the groundwork for future explorations of anyonic braiding in synthetic matter.