Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-11

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

02.
arXiv (CS.CV) 2026-06-17

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

03.
bioRxiv (Bioinfo) 2026-06-11

Calibrated Uncertainty Quantification for Patient-Level AML Drug Sensitivity Prediction Using Split Conformal Prediction

Accurate prediction of ex vivo drug sensitivity in acute myeloid leukemia (AML) patients from transcriptomic data is a critical challenge for precision oncology. Existing computational approaches have explored uncertainty quantification in cancer drug response prediction primarily using cell line data, while patient-level AML models typically rely on heuristic confidence measures rather than statistically calibrated uncertainty estimates. Here, we present a framework applying split conformal prediction to patient-level AML drug response modeling using the BeatAML 2.0 cohort. We trained Elastic Net and XGBoost regressors on bulk RNA-seq gene expression profiles from 318 AML patients, analyzing 34,764 patient-drug observations across 122 compounds. Baseline models achieved median Pearson R values of 0.291 (Elastic Net) and 0.281 (XGBoost) across 122 drugs. Wrapping these models with split conformal prediction yielded well-calibrated prediction intervals across three confidence levels: empirical coverages of 81.4%, 90.7%, and 95.5% against nominal targets of 80%, 90%, and 95%, respectively. Analysis of prediction interval widths revealed substantial drug-class-specific uncertainty patterns, with HDAC and BCL-2 inhibitors exhibiting markedly higher uncertainty than MDM2 inhibitors, suggesting a potential association between transcriptomic predictability and drug mechanism of action, although several drug classes were represented by only a small number of compounds. Predictive uncertainty was not significantly associated with ELN2017 molecular risk classification (Kruskal-Wallis p=0.395) or NPM1 mutation status (p=0.788). These results demonstrate that statistically valid uncertainty quantification can be achieved for patient-level AML drug response prediction despite substantial biological heterogeneity. to the best of our knowledge, no published study has applied split conformal prediction to patient-level ex vivo drug sensitivity prediction in the BeatAML cohort, providing a principled alternative to heuristic confidence scoring approaches. Keywords: Acute myeloid leukemia (AML); Ex vivo drug sensitivity; Conformal prediction; Uncertainty quantification; Precision oncology; BeatAML; Transcriptomic biomarkers; Machine learning.

04.
arXiv (CS.CL) 2026-06-17

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

05.
Nature (Science) 2026-06-10

Light slows down carbon nanotubes in water

Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’. Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’.

06.
arXiv (CS.CL) 2026-06-17

In-Context Environments Induce Evaluation-Awareness in Language Models

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent evaluation awareness. This raises concerns that models could strategically underperform, or sandbag, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent – execution gap reveals a monotonic resistance ordering: Arithmetic $

07.
arXiv (CS.CL) 2026-06-11

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability – typically introduced in post-training – to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models – trained with various pre-training recipes – on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

08.
arXiv (math.PR) 2026-06-19

Critical parameters of germ-monotone families of branching random walks

arXiv:2602.21062v2 Announce Type: replace Abstract: We introduce a broad class of families of branching random walks on a countable set $X$, which we refer to as germ-monotone branching random walks (GMBRWs). The processes in each family are parametrized by a positive parameter $\lambda>0$, which controls the overall reproductive speed, and they are monotonically increasing in $\lambda$ with respect to the germ order, a notion that extends classical stochastic domination. This framework encompasses a wide range of models, including classical continuous-time branching random walks, as well as discrete-time counterparts of certain non-Markovian processes such as ageing branching random walks. We define a general notion of critical parameter $\lambda(A)$ associated with each subset $A \subseteq X$, which serves as a threshold separating almost sure extinction in $A$ from positive probability of survival in $A$. This unifies and extends the classical global and local critical parameters $\lambda_w$ and $\lambda_s$, which can be recovered as special cases. We then investigate how modifications of the reproduction laws, either on a finite set or on a more general subset of $X$, affect these critical parameters. Our results extend earlier contributions in the literature.

09.
arXiv (CS.AI) 2026-06-12

Automated reproducibility assessments in the social and behavioral sciences using large language models

arXiv:2606.13670v1 Announce Type: new Abstract: Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

10.
arXiv (CS.CV) 2026-06-17

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

11.
bioRxiv (Bioinfo) 2026-06-17

DesignMaster: A Multi-Conditional Diffusion Framework for Rational PROTAC Design

Motivation: Proteolysis-targeting chimeras (PROTACs) enable targeted protein degradation through ternary complex formation with E3 ubiquitin ligase. However, the rational design of PROTACs remains highly challenging due to limited structure-activity relationship data and the vast conformational diversity of linkers. Existing computational approaches can be broadly divided into structure-based ternary modelling methods and fragment-based linker generation models. Although these approaches have advanced PROTAC design, they typically neglect key physicochemical constraints and linker-length control during the generation process, causing the generated PROTACs to lack balanced structural properties required for effective ternary complex formation with drug-like characteristics. Results: To address these limitations, we propose DesignMaster, a diffusion-based generative framework that explicitly incorporates linker length and physicochemical properties as controllable conditioning signals. DesignMaster employs an E(3)-equivariant graph Transformer with a gated multi-condition fusion module to inject linker length and physicochemical constraints throughout the diffusion process, enabling fine-grained and constraint-aware molecular generation. Experiments on PROTAC-DB 2.0 and 3.0 demonstrate that DesignMaster outperforms state-of-the-art baselines, with a 3.2% improvement in validity and a 34.4% improvement in recovery. The Case study shows DesignMaster achieves a 51.78% reduction in RMSD when predicting the linker of PROTAC BCPyr targeting 6W7O, highlighting its potential for practical structure-guided PROTAC design. Availability: The source code and datasets are available at https://github.com/ABILiLab/DesignMaster.

12.
arXiv (CS.LG) 2026-06-16

Towards Data-Efficient Cross-Device Generalization of Grad-Shafranov Equilibria via Transfer Learning Neural Operator

arXiv:2606.15512v1 Announce Type: new Abstract: Real-time reconstruction of magnetohydrodynamic equilibria is essential for plasma shaping, stability assessment and feedback control in magnetic confinement fusion. However, Grad-Shafranov equilibrium calculations remain largely device-specific and iterative, limiting their use in latency-constrained control settings. Existing neural approaches can accelerate individual equilibrium predictions, but they do not generally provide reusable models across changing plasma boundaries or tokamak geometries. Here we show that equilibrium reconstruction can be recast as a cross-device operator learning problem. We develop a domain-specific neural operator framework that maps geometry and profile parameters directly to the poloidal flux field, replacing repeated solve-on-demand computation with amortized operator inference. Using the analytically tractable Solov'ev family as a controlled Grad-Shafranov testbed, we generate equilibria across eight geometrically distinct tokamak-like configurations and benchmark five neural operator architectures under four transfer-learning strategies. Single-geometry pretraining gives poor transfer to unseen devices, whereas multi-geometry pretraining enables data-efficient adaptation. The Wavelet Neural Operator gives the strongest cross-geometry performance, reaching mean relative L2 errors below 4% with 100 labelled target equilibria and below 2% with full fine-tuning. The predicted magnetic fields satisfy the divergence-free constraint to numerical precision, and four architectures achieve millisecond or sub-millisecond inference. These results identify neural operator pretraining as a route towards reusable, real-time equilibrium inference across fusion device configurations.

13.
arXiv (CS.AI) 2026-06-11

Irresponsible AI: big tech's influence on AI research and associated impacts

arXiv:2512.03077v2 Announce Type: replace-cross Abstract: The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we invite AI researchers to counter big tech's influence in irresponsible AI development through strategies that build on the responsibility of implicated actors and collective action.

14.
arXiv (quant-ph) 2026-06-15

Quantitative and Optimal Device-Independent Lower Bounds on Detection Efficiency

arXiv:2511.19302v2 Announce Type: replace Abstract: This paper examines a quantitative and optimal lower bound on the detector efficiency in a (2,2,2) Bell experiment within a fully device-independent framework, whereby the detectors used in the experiment are uncharacterized. We provide a tight lower bound on the minimum efficiency required to observe a desired Bell-CHSH violation using the Navascués-Pironio-Acín (NPA) hierarchy, confirming tightness up to four decimal places with numerical optimization over explicit quantum realizations. We then introduce the effect of dark counts and demonstrate how to quantify the minimum required efficiency to observe a desired CHSH violation with an increasing dark count error. Finally, to obtain an analytical closed-form expression of the minimum efficiency, we consider the set of no-signaling behaviors that satisfy the Tsirelson bound, which are easier to characterize than the quantum set. Using such behaviors, we find a simple closed-form expression for a lower bound on the minimum efficiency which is monotonically increasing with the CHSH violation, though the analytically obtained lower bounds are meaningfully below the numerically tight lower bound.

15.
arXiv (CS.AI) 2026-06-19

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

arXiv:2606.19633v1 Announce Type: cross Abstract: Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .

16.
arXiv (CS.AI) 2026-06-17

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

arXiv:2606.18101v1 Announce Type: new Abstract: Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

17.
arXiv (CS.AI) 2026-06-16

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

arXiv:2602.08597v3 Announce Type: replace Abstract: Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.

18.
arXiv (CS.LG) 2026-06-12

Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning

arXiv:2601.15503v2 Announce Type: replace Abstract: Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in-situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.

19.
arXiv (CS.CV) 2026-06-15

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

20.
arXiv (quant-ph) 2026-06-12

Influence-solvability: a systematic theory of $(1+1)D$ solvability and its application to brickwork circuits

arXiv:2606.12538v1 Announce Type: cross Abstract: `Solvable' circuits, such as dual unitaries and its generalisations, have arisen as paradigmatic examples of tractable chaotic non-equilibrium dynamics, both in classical and quantum systems. However, while increasingly more complicated sufficient conditions have been proposed, a systematic theory classifying and understanding general features of solvable circuits is missing. We develop such a theory by introducing influence-solvable circuits, a class of $(1+1)D$ circuits whose influence matrix, which represents the `bath' generated by its own evolution, is given by a uniform MPS with finite bond-dimension $\chi$. This property allows for efficient computation of subsystem dynamics and essentially contains all known examples of solvable circuits. We derive a set of necessary and sufficient local conditions by using a version of the fundamental theorem of MPS for open boundary conditions. Next we apply our theory to brickwork circuits with $\chi=1$ influence-solvability and perform a systematic classification of classical brickwork circuits with local dimension up to $d=3$ and quantum brickwork circuits with $d=2$. Our search reveals new solvable circuits that are not captured by known solvability conditions.

21.
arXiv (CS.CL) 2026-06-17

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text–image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

22.
arXiv (CS.AI) 2026-06-11

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

arXiv:2602.08986v2 Announce Type: replace-cross Abstract: In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

23.
arXiv (CS.CV) 2026-06-15

Context-Guided Semantic Alignment for Feature Fusion Networks

Feature fusion networks are fundamental components in modern object detectors, aggregating multi-scale features to detect objects of varying sizes. However, directly fusing features from different pyramid levels often introduces semantic inconsistency due to their heterogeneous representations. In this paper, we propose Feature Interaction NEtwork (FINE), a lightweight semantic alignment module that refines low-level features via high-level contextual guidance using cross-level attention prior to fusion. To bridge the structural gap and ensure computational efficiency, we introduce an Alignment-Aware Token Sampling that aligns corresponding spatial regions across scales, reducing the attention complexity by an order of magnitude. The resulting attention weights generate a spatial-channel modulation map that is upsampled and applied to the low-level features via residual element-wise modulation. This mechanism ensures that the network selectively enhances semantically relevant pixels while preserving the sub-pixel localization accuracy necessary for dense prediction tasks. FINE is generally applicable to various detectors and consistently improves detection accuracy without compromising efficiency.

24.
arXiv (CS.CL) 2026-06-15

Multi-component Causal Tracing in Large Language Models

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

25.
arXiv (CS.CL) 2026-06-17

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) training-free cluster-based routing that exploits empirical priors for domain-specific alignment, and (2) RL-based multi-step routing that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.