Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

02.
arXiv (CS.CV) 2026-06-16

Multimodal LLM-Empowered Re-Ranking for Generalizable Person Re-Identification

Domain Generalizable (DG) person re-identification (Re-ID) has attracted growing research interest due to its potential for deployment in unseen real-world scenarios. Most existing approaches address DG Re-ID by focusing on training domain-generalizable encoders but ignore the possible refinements in inference stage. In contrast, this work explores an alternative direction which improves inference re-ranking to enhance DG Re-ID. Conventional re-ranking methods typically rely on neighborhood-based distances to refine the initial ranking list, inherently depending on features produced by the Re-ID encoder. However, they deteriorate on target domains since the encoder lacks sufficient generalizability to produce reliable feature distances on unseen scenarios. Inspired by the remarkable generalization capabilities of recent Multimodal Large Language Models (MLLMs), we propose an MLLM-empowered distance metric to improve re-ranking in DG Re-ID. Specifically, we first adapt an MLLM to Re-ID data through supervised fine-tuning, which incorporates a domain-agnostic prompt and a query-candidate hard mining scheme. Then, the adapted MLLM is employed to compute a $\mu$-distance during inference, which is robust to domain gap and significantly enhances subsequent re-ranking performance. Our approach is model-agnostic and can be seamlessly integrated into previous re-ranking frameworks. Extensive experiments demonstrate that our approach consistently yields substantial performance improvements across multiple DG Re-ID benchmarks. The code of this work will be released at https://github.com/RikoLi/MUSE soon.

03.
arXiv (quant-ph) 2026-06-19

Mitigating Trotter Errors via Post-Processed Symmetry Restoration

arXiv:2606.20242v1 Announce Type: new Abstract: Quantum simulation is a powerful tool for exploring complex quantum many-body systems such as condensed matter physics and gauge theories. Trotterization, which approximates the ideal time evolution operator by decomposing it into a sequence of local gate operations, is one of the most widely used quantum simulation algorithms. However, such Trotterized implementations generally fail to preserve the symmetries of the target Hamiltonian during compilation. As a result, they can drive quantum states out of symmetrically allowed subspaces, leading to unphysical dynamics and symmetry-violating algorithmic errors. In this work, we propose a symmetry-based Trotter error mitigation protocol using classical post-processing. By applying symmetry transformations to the initial state or interleaving them between discrete Trotter layers, and then averaging an ensemble of the resulting measurement outcomes via classical post-processing, our method systematically projects out the symmetry-violating components of the Trotter error while leaving the ideal dynamics unchanged. Importantly, this framework naturally accommodates non-local spatial symmetries and anti-unitary operations such as time reversal, which are difficult or impossible to implement directly with hardware-native quantum gates. We benchmark our protocol on the one-dimensional XY model and the one-dimensional Schwinger model. In the XY model, enforcing reflection symmetry suppresses the leading-order Trotter error, whereas in the Schwinger model, interleaving gauge transformations between Trotter layers enables gauge-twirling effectively to reduce unphysical violations of local Gauss's law. These results demonstrate that symmetry-based post-processing provides a depth-preserving route to substantially improving the fidelity of Trotterized quantum simulations on near-term devices.

04.
arXiv (CS.CL) 2026-06-12

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an intrinsically low-rank structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

05.
arXiv (CS.AI) 2026-06-15

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

arXiv:2605.29640v3 Announce Type: replace Abstract: Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.

06.
arXiv (CS.LG) 2026-06-15

Robin-Neumann Coupling of PINN and FEM Solvers: A Steklov-Poincaré View, with Application to Fluid-Structure Interaction with Contact

arXiv:2606.14181v1 Announce Type: cross Abstract: Physics-informed neural networks (PINNs) are meshless and carry moving geometry and topology change through resampling of collocation points; the finite-element method (FEM) is the workhorse for boundary-fitted discretisations. Coupling the two across a shared interface promises the best of both, yet existing PINN-FEM schemes are validated only empirically. We put the coupling on a domain-decomposition footing: viewing each solver as a Steklov-Poincaré (trace-to-flux) operator, we transfer the classical Dirichlet-Neumann (DN) divergence diagnosis and its Robin-Neumann (RN) cure, including a closed-form, sweep-free interface impedance, and prove a PINN-specific contraction theorem: a trained network realises only a perturbed Steklov operator with a per-step training residual, and RN still contracts, with no shared-eigenbasis hypothesis, to a floor set by the achieved training loss. Because a PINN has no stiffness matrix, we introduce a Fourier-mode interface probe that recovers the network's resolvable Steklov eigenvalues to within 0.5% and doubles as a diagnostic of the network's spectral cap. The theory predicts measured PINN-FEM contraction rates to within 7% on 1D and 2D Poisson couplings, and a two-slab analogue of the large-added-mass regime shows RN's per-mode impedance matching winning decisively where tuned scalar relaxation saturates. We demonstrate the framework on a Stokes/rigid-disc problem with Alart-Curnier contact: the meshless PINN fluid absorbs the topology change at contact by collocation exclusion alone, no remeshing and no cut cells, and the static-equilibrium contact reaction matches the submerged weight to 0.4% under mesh refinement. We quantify remaining limitations: the warm-started PINN drifts off the Stokes manifold over long horizons, and matched FEM-FEM benchmarks attribute pre-impact squeeze-film signatures to PINN under-resolution.

07.
arXiv (quant-ph) 2026-06-12

Kubo-Martin-Schwinger conditions for non-Hermitian systems

arXiv:2606.13251v1 Announce Type: new Abstract: We investigate the extension of the Kubo–Martin–Schwinger (KMS) thermal equilibrium condition to non-Hermitian Hamiltonians with real spectra and biorthogonal eigensystems, providing a systematic analysis through three complementary routes. Our central result is a thermodynamic characterisation of quasi-Hermiticity: for $H \in M_d(\mathbb{C})$ diagonalisable with real spectrum, the biorthogonal Gibbs functional $\omega_{\rm{bi}}(A) = Z_{\rm{bi}}^{-1} \sum_n e^{-\beta E_n}\langle\phi_n|A|\psi_n\rangle$ satisfies $\omega_{\rm{bi}}(A^\dag A) \geq 0$ for all $A$ if and only if $H$ is quasi-Hermitian. The proof constructs the metric $\eta$ directly from the eigenprojectors of $\omega_{\rm{bi}}$ via the Riesz representation theorem, with no prior choice of $\eta$, providing a metric-free certificate of quasi-Hermiticity outside the Mostafazadeh–Scholtz framework. Under the full quasi-Hermitian hypothesis, we prove that the $\eta$-Gibbs state $\omega_\eta(A) = Z_\eta^{-1}\, \rm{Tr}[\eta e^{-\beta H}A]$ satisfies all three analytic KMS conditions, using the Hadamard three-line theorem and Bari's theorem on Riesz bases. The result is non-trivial: the transported state $\hat\omega(X) = \rm{Tr}[e^{-\beta h}X\eta]/Z_\eta$ differs from the Gibbs state of the isospectral Hermitian partner $h = \eta^{1/2}H\eta^{-1/2}$ whenever $[\eta,h]\neq 0$, so the KMS property cannot be deduced from the Hermitian theory by similarity. The gap between this result and the full Haag–Hugenholtz–Winnink $C^*$-algebraic framework is identified. Failure modes at exceptional points and for complex spectra are analysed, and the relation to the Fagnola–Umanità quantum detailed balance condition for open systems is discussed.

08.
arXiv (CS.CV) 2026-06-11

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

09.
arXiv (CS.CL) 2026-06-11

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

10.
arXiv (CS.LG) 2026-06-16

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

arXiv:2605.30837v2 Announce Type: replace-cross Abstract: Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.

11.
arXiv (CS.AI) 2026-06-16

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

arXiv:2604.18827v2 Announce Type: replace-cross Abstract: Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling – even in the mouse visual cortex, a relatively simple system – models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

12.
arXiv (CS.AI) 2026-06-12

A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints

Authors:

arXiv:2606.12502v1 Announce Type: cross Abstract: We propose that value – the quantity goal-directed agents create, destroy, and exchange – is a lawful structural quantity in the same category as information. Following Shannon's method, we make one ruthless abstraction: value is the rate at which an agent converts a resource into goal-progress, relative to a frame fixed by its goal. A scale-invariance axiom forces a logarithmic measure, $V=\sum_i k_i \ln e_i$; compounding of a reinvested resource forces the same form via the ergodicity argument of Peters (2019). The two routes are kin rather than independent; their agreement is a consistency check, not an over-determination. We derive a coding theorem of value: $\Delta G \le I(X;Y)$, achieved by Bayes-proportional allocation; realized value decomposes as $G=D(q\|r)-D(q\|p)$, identifying misalignment with measurable waste. For populations, value is frame-relative while price is frame-independent; a fleet that pools its resource and fuses its perception inherits the ceiling $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$ (a corollary; an earlier sum-form claim was wrong and is corrected in v5). A dynamical layer yields an is/ought asymmetry from which alignment emerges as a control-stability condition with a closed-form residual. We test the single-frame laws on live language models in a pre-registered scale-up: perception mutual information tracks realized capability rather than parameter count (Spearman $\rho = 0.977$ pooled over 30 model$\times$domain points), out-of-sample $\Delta G$ tracks $I(X;Y)$, and over-confidence is measurable dissipation; a further pre-registered test shows the bridge is shape-invariant across four task shapes ($n=42$, slope 0.953). None of the mechanisms is individually new – generalized Kelly, Armstrong & Mindermann (2018), classical control; the contribution is their unification and the governance mapping (incentive design over oversight) that follows.

13.
arXiv (CS.AI) 2026-06-16

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

arXiv:2605.22664v3 Announce Type: replace Abstract: LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

14.
arXiv (CS.LG) 2026-06-16

DAL: A Practical Prior-Free Black-Box Framework for Piecewise Stationary Bandits

arXiv:2501.19401v5 Announce Type: replace Abstract: We introduce a practical, black-box framework termed Detection Augmented Learning (DAL) for the problem of piecewise stationary bandits without knowledge of the underlying non-stationarity. DAL accepts any stationary bandit algorithm with order-optimal regret as input and augments it with a change detector, enabling applicability to all common bandit variants. Extensive experimentation demonstrates that DAL consistently surpasses all state-of-the-art methods across diverse non-stationary scenarios, including synthetic benchmarks and real-world datasets, underscoring its versatility and scalability. We provide theoretical insights into DAL's strong empirical performance, complemented by thorough empirical validation.

15.
arXiv (CS.CV) 2026-06-16

Navigating Distribution Shifts in Medical Image Analysis: A Survey

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints – such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols – with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.

16.
arXiv (CS.AI) 2026-06-16

Artificial Intelligence Index Report 2026

arXiv:2606.15708v1 Announce Type: new Abstract: Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's impact are struggling to match the pace of the technology itself. That gap between what AI can do and how prepared we are to manage it runs through every chapter of this year's report. New in this edition, the report tracks how AI is being tested more ambitiously across reasoning, safety, and real-world task execution, and why those measurements are increasingly difficult to rely on. It also features new estimates of generative AI's economic value alongside emerging evidence of its labor market effects, an analytical framework on AI sovereignty, and a science chapter developed in collaboration with Schmidt Sciences. For the first time, the report features standalone chapters on AI in science and AI in medicine, reflecting AI's growing impact across these two domains.

17.
arXiv (CS.CV) 2026-06-16

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

Authors:

Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

18.
arXiv (CS.AI) 2026-06-16

A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

arXiv:2606.16933v1 Announce Type: cross Abstract: Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.

19.
arXiv (CS.LG) 2026-06-17

Stable and Steerable Sparse Autoencoders with Weight Regularization

arXiv:2603.04198v2 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

20.
arXiv (CS.CV) 2026-06-16

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

Authors:

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 – a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

21.
arXiv (CS.AI) 2026-06-18

Surrogate Benchmarks for Model Merging Optimization

arXiv:2509.02555v2 Announce Type: replace-cross Abstract: Model merging techniques aim to integrate the abilities of multiple models into a single model. Most model merging techniques have hyperparameters, and their setting affects the performance of the merged model. Because several existing works show that tuning hyperparameters in model merging can enhance the merging outcome, developing hyperparameter optimization algorithms for model merging is a promising direction. However, its optimization process is computationally expensive, particularly in merging LLMs. In this work, we develop surrogate benchmarks for optimization of the merging hyperparameters to realize algorithm development and performance comparison at low cost. We define two search spaces and collect data samples to construct surrogate models to predict the performance of a merged model from a hyperparameter. We demonstrate that our benchmarks can predict the performance of merged models well and simulate optimization algorithm behaviors.

22.
arXiv (CS.AI) 2026-06-16

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

arXiv:2606.16332v1 Announce Type: cross Abstract: Modern CPUs increasingly integrate matrix extensions, such as Arm Scalable Matrix Extension (SME), that provide high-throughput matrix execution within the CPU. For LLM inference, however, these units are not a universal replacement for conventional CPU cores: prefill, decode, attention, and KV-cache operations expose different arithmetic intensities, vector behavior, and layout requirements, while SME units and CPU cores still compete for shared memory bandwidth. This paper studies this mismatch through a roofline-based characterization of SME-enabled CPUs and uses the resulting model to guide operator-level execution choices. We present SMEPilot, an LLM inference engine that selects CPU-only, SME-only, or cooperative SME+CPU execution for each operator shape. SMEPilot partitions matrix work across SME and CPU cores at tile granularity, overlaps SME-suitable matrix stages with CPU-suitable vector stages in attention, and maintains layout state so packed tensor representations are reused rather than repeatedly rebuilt on critical paths. Across Llama-3.2-3B, Qwen3-4B, and Qwen3-30BA3B on phone, PC, and server platforms, SMEPilot improves end-to-end inference performance by up to 3.94$\times$.

23.
arXiv (CS.AI) 2026-06-11

SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks

arXiv:2606.12287v1 Announce Type: cross Abstract: The Transformer architecture is widely regarded as the most powerful tool for natural language processing, but due to a high number of complex operations, it inherently faces the issue of high energy consumption. To address this issue, we consider Spiking Neural Networks (SNNs), which are an energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their naturally event-driven approach to processing information. However, this inherently makes them difficult to train. Often, many SNN-based models circumvent this issue by converting pre-trained ANNs. More recently, attempts have been made to design directly trainable SNN-based adaptations of the Transformer model structure. Although the results showed great promise, the application field was computer vision. Moreover, the proposed model incorporates only encoder blocks. In this paper, we propose SpikeDecoder, a fully SNN-based implementation of the Transformer decoder block, for applications in natural language processing. In a series of experiments, we analyze the impact of exchanging different blocks of the ANN model with spike-based alternatives to identify trade-offs and significant sources of performance loss. We further investigate the role of residual connections and the selection of SNN-compatible normalization techniques. Besides the work on the model architecture, we formulate and compare different embedding methods to project text data into spikes. Finally, we demonstrate that our proposed SNN-based decoder block reduces the theoretical energy consumption by 87% to 93% compared to the ANN baseline.

24.
arXiv (CS.LG) 2026-06-18

Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference

arXiv:2605.20726v2 Announce Type: replace-cross Abstract: Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini-Hochberg procedure. This approach fails to provide high-probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite-sample, distribution-free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high-probability envelope for the empirical distribution function of null conformal p-values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope's shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real-data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.

25.
arXiv (CS.CL) 2026-06-11

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.