Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-16

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

arXiv:2605.13909v2 Announce Type: replace-cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

02.
arXiv (CS.LG) 2026-06-18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: https://doi.org/10.57967/hf/8695. Code: https://github.com/edstevenson/ThousandWorlds.

03.
arXiv (math.PR) 2026-06-11

Hilbert space embeddings of independence tests and interaction measures of several variables

arXiv:2411.08653v2 Announce Type: replace-cross Abstract: We present a unified theoretical framework for kernel-based measures of dependence on product spaces. Building on the ideas underlying distance covariance, distance multivariance, and the Hilbert-Schmidt Independence Criterion (HSIC), we define a new family of kernels on an $n$-fold Cartesian product, termed positive definite independent of order $k$ (PDI$_{k}$ kernels). These kernels extend the concepts of positive definite and conditionally negative definite kernels to higher orders and provide the foundation for generalized independence and interaction tests, such as the generalized Lancaster interaction of order $k$ ($\Lambda_{k}^{n}$), and the Streitberg interaction ($\Sigma$). Our analysis focuses on the continuous setting, where we prove a Kernel Mean Embedding Theorem for PDI$_{k}$ kernels and establish the corresponding integrability restrictions. Based on these results, we characterize how the Kronecker products of PDI kernels behave.

04.
arXiv (CS.CL) 2026-06-15

Implicit Reasoning for Large Language Model-based Generative Recommendation

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

05.
arXiv (CS.CV) 2026-06-16

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

06.
arXiv (CS.CV) 2026-06-19

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

07.
arXiv (CS.CL) 2026-06-18

Trust Region On-Policy Distillation

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

08.
arXiv (CS.AI) 2026-06-12

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

arXiv:2606.13385v1 Announce Type: cross Abstract: Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an attack-centric perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \sysname, a stakeholder-centric benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from stealthy parasitism (attack succeeds without disrupting the user's delegated task) to misaligned disruption (task disrupted without attack success) and compounded failure (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

09.
arXiv (CS.AI) 2026-06-18

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

arXiv:2606.18293v1 Announce Type: cross Abstract: Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

10.
arXiv (CS.CV) 2026-06-15

HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation

We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.

11.
arXiv (CS.AI) 2026-06-17

How Inference Compute Shapes Frontier LLM Evaluation

arXiv:2606.17930v1 Announce Type: new Abstract: AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

12.
medRxiv (Medicine) 2026-06-18

Multicluster measles outbreak with a substantial proportion of modified cases in Tokyo, Japan, January-May 2026

Tokyo experienced a measles outbreak (260 cases) in early 2026 despite elimination status. Adults aged 20-39 years were most affected, and 38% of cases were modified measles, increasing with prior vaccination. Although incidence rose until April, the effective reproduction number; R(t) fell below 1, consistent with outbreak control. Multiple clusters were identified, but many cases lacked epidemiological links, suggesting that modified measles is less likely to be considered in differential diagnosis. Intensive contact tracing and surveillance contributed to limiting transmission.

14.
arXiv (CS.CV) 2026-06-15

Scratched Lenses, Shifted Depth: Passive Camera-Side Optical Attacks

Physical adversarial attacks on vision systems are typically studied through scene manipulation, such as adversarial patches or projections, where the adversary controls what the camera observes. Camera-side attacks using stickers or auxiliary optics have also been explored, but they treat attacks as image-space perturbations from designed patterns. This misses how physical imperfections interact with scene-dependent lighting and optics. We identify a threat: passive lens-side damage that is persistent yet trigger-conditioned, producing optical artifacts that bias geometric inference under particular visual conditions. We instantiate this threat through Scratch-induced Lens Adversarial Streak Hijacking SLASH, a physical-world attack caused by small scratches on a camera lens or protective cover. Scratches interact with bright light sources and specular reflections to create structured streak artifacts that distort depth cues. Since the perturbation is fixed in the optical path but triggered by the scene, it is both persistent and selective. We formulate the attack in optical space, model the scratch pattern as a trigger-conditioned optical channel, and optimize one fixed configuration across diverse viewing conditions. We evaluate SLASH on monocular depth estimation and monocular 3D object detection in digital and real-world settings. Under the fixed-scratch constraint, directional depth shifts reach up to 32% relative error for monocular depth estimation, with consistent effects on monocular 3D object detection. Physical experiments confirm transfer to real camera recordings, inducing depth shifts above the model's natural prediction baseline. These findings reveal an attack surface where benign-looking hardware imperfections act as latent, scene-triggered adversarial mechanisms, challenging assumptions about physical robustness and motivating defenses for secure vision systems.

15.
arXiv (CS.CV) 2026-06-11

Cross-Domain Multi-Person Human Activity Recognition via Near-Field Wi-Fi Sensing

Wi-Fi-based human activity recognition (HAR) provides substantial convenience and has emerged as a thriving research field, yet the coarse spatial resolution inherent to Wi-Fi significantly hinders its ability to distinguish multiple subjects. By exploiting the near-field domination effect, establishing a dedicated sensing link for each subject through their personal Wi-Fi device offers a promising solution for multi-person HAR under native traffic. However, due to the subject-specific characteristics and irregular patterns of near-field signals, HAR neural network models require fine-tuning (FT) for cross-domain adaptation, which becomes particularly challenging with certain categories unavailable. In this paper, we propose WiAnchor, a novel training framework for efficient cross-domain adaptation in the presence of incomplete activity categories. This framework processes Wi-Fi signals embedded with irregular time information in three steps: during pre-training, we enlarge inter-class feature margins to enhance the separability of activities; in the FT stage, we innovate an anchor matching mechanism for cross-domain adaptation, filtering subject-specific interference informed by incomplete activity categories, rather than attempting to extract complete features from them; finally, the recognition of input samples is further improved based on their feature-level similarity with anchors. We construct a comprehensive dataset to thoroughly evaluate WiAnchor, achieving over 90% cross-domain accuracy with absent activity categories.

16.
arXiv (CS.AI) 2026-06-16

Recurrent Reasoning on Symbolic Puzzles with Sequence Models

arXiv:2606.15686v1 Announce Type: new Abstract: Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-optimal trajectories and a single interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling 10{,}817 unique puzzles and 285{,}933 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at $N{=}8$ to $10$. Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD accuracy on Block World; all models score 0.00\% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale. Pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.

17.
arXiv (quant-ph) 2026-06-16

Electronic Band Structure of Silicon Determined via a Variational Adiabatic Eigensolver: Theory and Experiment

arXiv:2606.16604v1 Announce Type: new Abstract: This work addresses the critical challenge of excited-state preparation for semiconductor band structure calculations. We introduce a variational adiabatic eigensolver (VAE) protocol that combines adiabatic evolution with variational optimization to prepare high-fidelity eigenstates on noisy intermediate-scale quantum (NISQ) devices. Applying a momentum-space truncation, we accurately compute the electronic band structure of silicon – an idealized infinite periodic system – using only a modest number of qubits. Our approach employs multi-qubit parameterized circuits and a phase-based loss function, overcoming limitations of conventional methods. These limitations include the circuit-construction difficulty in traditional adiabatic approaches and the reduced accuracy of variational quantum eigensolvers for excited states. Through rigorous numerical simulation and experimental implementation on a superconducting quantum processor, we successfully prepare silicon's valence-band and conduction-band eigenstates. Single-shot readout yields state fidelities exceeding 96%, and the measured energy expectations agree with theoretical band energies within 0.5 eV. Further refinement via single-frequency oscillation fitting reduces the energy deviation to below 0.01 eV. This framework provides a robust and practical pathway for precisely determining electronic structures in quantum materials.

18.
arXiv (CS.CV) 2026-06-17

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

19.
arXiv (CS.AI) 2026-06-16

Fine-Tuning a 7B Advisor on Free-Tier GPUs: An Adapter-Handoff Recipe and a Synthetic-Data Reliability Caution

arXiv:2504.15610v4 Announce Type: replace Abstract: Fine-tuning a 7B language model for specialized advising is attractive in resource-constrained settings, but multi-epoch runs routinely exceed the wall-clock limits of the free-tier GPUs (Kaggle, Colab) such users rely on. We report two things. First, a practical recipe: a three-epoch QLoRA fine-tune of Mistral-7B-Instruct-v0.3 (4-bit NF4, LoRA rank 16, via Unsloth) completed across two free-tier 16 GB GPUs (Tesla P100 then T4) by checkpointing only the small LoRA adapter (41.9M parameters) and resuming on the second machine. Adapter-only handoff is sufficient – optimizer and scheduler state need not be transferred – so the binding constraint is per-step VRAM and per-session wall-clock, not aggregate compute. Second, and more importantly, an honest evaluation that returns a cautionary result. On a blind held-out comparison against the un-fine-tuned base model, the fine-tuned model scored higher on similarity to the synthetic training distribution (BERTScore F1 +0.063, a fidelity not quality signal) but lower on advising quality: a blind LLM-as-judge preferred the base model on 46% of prompts versus 18%, and a source-verified factuality audit found four confident errors from the fine-tuned model on policy-sensitive topics against zero for the base. Auditing the training data with the same method, we find this is not a fine-tuning artifact: each audited error is already present in the Gemini-generated training answers, and a random-sample audit finds verifiable errors in a sizable fraction of responses (28-40%; single-judge, n=40). The data is therefore sufficient to account for the errors, which we attribute to the synthetic-data pipeline rather than the adapter-handoff method. We release the dataset, adapter, cross-GPU notebooks, and full evaluation harness so every result reproduces on a single 16 GB GPU.

20.
arXiv (CS.AI) 2026-06-12

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

arXiv:2606.13311v1 Announce Type: cross Abstract: Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1–False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

21.
arXiv (CS.CL) 2026-06-16

Depth-Attention: Cross-Layer Value Mixing for Language Models

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference–a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache–the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

22.
arXiv (quant-ph) 2026-06-15

Quantum-Classical Hierarchical Equations of Motion

Authors:

arXiv:2606.14363v1 Announce Type: new Abstract: We develop a quantum-classical hierarchical equations of motion (QC-HEOM) approach for simulating non-Markovian open quantum systems. The method combines the ensemble-averaged classical path reference of the quantum-classical path integral formalism with a hierarchy of auxiliary quantum influence functionals. By incorporating thermal fluctuations through an ensemble average over reference trajectories, the hierarchy is required to represent only the residual quantum memory associated with the imaginary part of the bath response function. Consequently, unlike conventional hierarchical equations of motion, QC-HEOM does not require Matsubara or Padé expansions of the thermal kernel and exhibits only weak temperature dependence of the hierarchy size. Furthermore, because thermal fluctuations are supplied through reference classical trajectories, the framework naturally extends beyond harmonic baths and enables the incorporation of anharmonic and molecular environments through externally generated trajectories. We derive the formalism and demonstrate its exactness for a harmonic bath. Applications to an asymmetric spin-boson model and the seven-site Fenna–Matthews–Olson complex illustrate the accuracy of QC-HEOM. It reproduces benchmark quasi-adiabatic path integral and hierarchical equations of motion results while requiring substantially fewer auxiliary objects, particularly at low temperatures. These results establish QC-HEOM as an efficient framework for treating residual quantum memory in quantum-classical descriptions of open-system dynamics. The separation of thermal fluctuations from residual quantum memory through the use of Wigner trajectories provides an approximate route toward hierarchical treatments of complex anharmonic environments that are inaccessible to conventional HEOM approaches.

23.
arXiv (CS.LG) 2026-06-15

Ensembling Sparse Autoencoders

arXiv:2505.16077v2 Announce Type: replace Abstract: Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we introduce and formalize SAE ensembles. Furthermore, we propose to ensemble multiple SAEs through naive bagging and boosting. In naive bagging, SAEs trained with different weight initializations are ensembled, whereas in boosting SAEs sequentially trained to minimize the residual error are ensembled. Theoretically, naive bagging and boosting are justified as approaches to reduce reconstruction error. Empirically, we evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that, compared to an expanded SAE that matches the number of features in the ensemble, ensembling SAEs improves the reconstruction of language model activations along with SAE stability. Additionally, on downstream tasks such as concept detection and spurious correlation removal, SAE ensembles achieve better performance, showing improved practical utility.

24.
arXiv (CS.CL) 2026-06-15

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

25.
arXiv (CS.CV) 2026-06-12

Comparing Commercial Depth Sensor Accuracy for Medical Applications

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.