Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-17

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

arXiv:2606.17435v1 Announce Type: new Abstract: Time-series forecasting models remain vulnerable to gradient-based adversarial attacks while existing defense mechanisms typically incur a trade-off in robustness for bounded response and compute cost. The problem is pronounced in Moving Target Defense where maintaining multiple randomized model instances substantially exacerbates the training overhead. In this work, we introduce MorphStrata, a student generation strategy with selective, layer-specific stochastic noise injection that extends the traditional Morphence defense. MorphStrata uses a Transformer backbone as the teacher and perturbs randomly selected architectural blocks to create structured heterogeneity across student models in response to varied data distributions and threat models. We evaluate against vanilla Transformer and Morphence backbones on a suite of benchmarks including the Jena Climate, Electricity Load Diagrams, and Appliances Energy Prediction using FGSM, BIM and PGD attacks across multiple attack strengths. Across datasets and attack regimes, the proposed ensemble maintains comparable adversarial RMSE. Specifically, for high entropy, periodic datasets as in the case of the AEP data, MorphStrata achieves the lowest RMSE across all attacks and perturbation budgets, improving over the static baseline by up to 24.11% and 97.97% under FGSM and BIM respectively at an epsilon value of 0.5 over 30 randomized trials. Targeting the layers to generate MorphStrata students accounts for less than 1% increase in train-times over the Morphence MTD baseline for most of the experiments, while accounting for double digit gains in adversarial RMSE reduction. We also observe a positive correlation between higher pairwise L2 distance (among generated students) and overall defense effectiveness. In summary, MorphStrata maintains adversarial robustness as an MTD defense at marginal cost deltas when compared to existing baselines.

02.
arXiv (CS.CL) 2026-06-11

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

作者:

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty – bounding what an agent may claim at termination – as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem – under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds – whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38–1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48–9.81] and 25.05% [22.48–27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design – an honest stall is recoverable; a confident wrong output shipped downstream is not.

03.
arXiv (CS.LG) 2026-06-19

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

arXiv:2606.19408v1 Announce Type: new Abstract: Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

04.
arXiv (CS.LG) 2026-06-18

KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction

arXiv:2506.13196v5 Announce Type: replace Abstract: Accurate prediction of protein-ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain-specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine-grained joint embeddings for prediction. Experiments on two benchmark datasets across both in-domain and cross-domain scenarios demonstrate that KEPLA consistently outperforms state-of-the-art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.

05.
arXiv (CS.LG) 2026-06-19

Towards Graph-Based Deep Learning for Map Generalization: Insights from Building Footprints Simplification and Aggregation

arXiv:2606.19956v1 Announce Type: new Abstract: Map generalization remains one of the fundamental tasks in cartography, especially for the simplification and aggregation of complex building footprints. This study presents the first exploratory application of graph-based deep learning to both tasks, reformulating simplification as node movement prediction and aggregation as link prediction within a unified graph learning framework. We evaluate representative graph neural network architectures (GCN, GAT, and GraphSAGE) on multi-scale building datasets, showing that GraphSAGE demonstrates relative strengths in link prediction accuracy, while also revealing persistent challenges in precise node movement prediction. Beyond quantitative performance, the results highlight that aggregation poses greater complexity and challenges than simplification, underscoring the difficulty of capturing higher-level spatial relationships in map generalization with current deep learning approaches. Although limitations such as data imbalance and the need for post-processing remain, the study provides valuable insights and methodological directions for advancing automated map generalization with deep learning approaches.

06.
arXiv (CS.AI) 2026-06-12

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

arXiv:2606.13658v1 Announce Type: new Abstract: This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-System Theory, Thinkframes, and System 0. It argues that while the first two capture important dimensions of AI's influence on individual reasoning and collective epistemic practices, System 0 occupies a theoretically distinctive position that neither can fully replicate. The paper introduces the concept of cognitive colonization, according to which AI systems can embed external interests within the architecture of the self in ways that are difficult for users to perceive. Because such systems are already widely deployed, understanding these invisible forms of influence is an urgent philosophical and practical task.

07.
arXiv (CS.AI) 2026-06-25

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

arXiv:2606.25325v1 Announce Type: new Abstract: We find that current emotion-oriented Omni-MLLMs still lack reliable omni-modal perception: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose OPPO (Omni-Perception Policy Optimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce MEP-Bench, a diagnostic benchmark that quantifies utilization and faithfulness. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and MME-Emotion, while substantially improving utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning.

08.
arXiv (CS.CL) 2026-06-12

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

09.
arXiv (CS.CV) 2026-06-24

Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models

Reducing visual token redundancy is critical for accelerating Multimodal Large Language Models (MLLMs) without degrading cross-modal reasoning performance. Existing token pruning methods typically rely on single-layer signals, such as attention scores or token similarities, which overlook the cross-layer transformation of visual representations and may exhibit positional bias in multimodal token sequences. To address this limitation, we propose a training-free token pruning framework based on Cross-Layer Spectral Evolution (CLSE). Instead of measuring token importance from single-layer feature magnitudes, CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. We observe that tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias. Extensive experiments on both image and video benchmarks demonstrate that CLSE achieves a superior trade-off between efficiency and accuracy under aggressive token reduction. Across multiple MLLMs, CLSE reduces FLOPs, KV cache memory, and latency while maintaining competitive or improved performance.

10.
arXiv (CS.CV) 2026-06-25

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.

11.
arXiv (CS.AI) 2026-06-16

Shachi: A Modular, Controllable Framework for LLM-Based Agent-Based Modeling of Emergent Collective Behavior

arXiv:2509.21862v3 Announce Type: replace Abstract: How collective behaviors emerge from the interactions of individual LLM-driven agents is a central question in artificial life, yet controlled study of these emergent dynamics has been hindered by the lack of a principled simulation framework for systematic experimentation. To address this, we introduce Shachi, a principled methodology and modular framework that decomposes an agent's cognition into core components: Configuration for intrinsic identity, Memory for contextual continuity, and Tools for extended capabilities, all orchestrated by an LLM reasoning engine. This decomposition treats each cognitive component as an independently controllable variable, enabling perturbation studies that trace how micro-level cognitive traits propagate into population-level dynamics. We investigate behavioral patterns across a 10-task benchmark spanning three levels of collective complexity. Shachi enables memory transfer across environment transitions, producing history-dependent behavioral shifts, and allows agents to simultaneously inhabit multiple environments, revealing cross-environment interference invisible in single-environment studies. Furthermore, in a real-world U.S. tariff shock case study, locally interacting agents with individually controlled cognitive components produce macro-level market dynamics directionally consistent with observed real-world outcomes. Our work provides a rigorous, open-source simulation framework for LLM-based ABM, aimed at fostering cumulative scientific inquiry into the emergent collective behaviors of interacting artificial agents.

12.
arXiv (CS.LG) 2026-06-25

Do Prompt-Elicited Trajectories Reflect Training-Time Reward Hacking? A Systematic Study on Monitoring Trainig-Time Reward Hacking in Code Generation

arXiv:2604.23488v2 Announce Type: replace Abstract: Reward hacking in code generation, where models exploit evaluation loopholes to obtain high reward without correctly solving the intended task, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies often rely on explicitly prompted hacking trajectories, but it remains unclear whether monitors trained on such data can detect reward hacks that arise without direct hacking instructions during RL training. In this work, we introduce Trace-and-Amplify, a framework for scalable curation of reward-hacking trajectories that arise during RL training without explicit hacking instructions. The framework uses unit-test tracers to identify hacking solutions when they occur and retains such trajectories for monitor training and evaluation. Through controlled comparisons between monitors trained on prompt-elicited hacking trajectories and training-time reward-hacking trajectories collected by Trace-and-Amplify, we find that (1) prompt-elicited-data-trained monitors often fail to generalize to trajectories curated by our framework, and (2) monitors trained on our Trace-and-Amplify trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that prompted reward hacking data may not fully reflect training-time reward-hacking behaviors, and that relying solely on these data can lead to misleading conclusions. Codebase is available at https://github.com/LichenLillc/CoTMonitoring.git

13.
arXiv (CS.AI) 2026-06-16

A Multi-Level Architecture for Reusable Materials Ontologies – The OntoCrafter Ceramics Ontology (OCO) as Reference Implementation

arXiv:2606.14814v1 Announce Type: cross Abstract: The Materials Science and Engineering ontology landscape is fragmented along multiple axes simultaneously. Horizontally: a recent survey identified 94 ontologies of which over 40 are structurally incompatible; each new application domain – ceramics, polymers, batteries, smart materials – typically restarts ontology design from scratch. Vertically: EU regulation (CSRD, CSDDD, PPWR, CBAM, R2R, AI Act, ESPR) forces material, manufacturing, supply-chain, and lifecycle data into integrated digital product passports, leaving ontologies that only address horizontal fragmentation incomplete for any contemporary consumer. And mechanistically: a vocabulary that records that BNT-BT has $d_{33} \approx 580$ pC/N stores a fact but cannot surface why – Bi-6s$^2$ lone-pair stereo-activity, anomalous Born effective charges, soft modes, defect chemistry – without a systematic explanation skeleton. We propose a multi-level modular architecture with two independent classification axes – level of abstraction (L0 bridges, L1 material-agnostic laboratory-notebook, L2 material-class-specific, L3 categorical reasoning) and consumer audience (material vs. compliance) – in which the material-specific level is internally organised by a seven-tier mechanistic-explanation skeleton (Symmetry, Energy/DFT, Thermo/CALPHAD, Kinetics, Microstructure, Defect chemistry, Bonding) applicable to any crystalline ionic oxide. The level-and-audience modularity dissolves the horizontal fragmentation, the compliance audience absorbs the vertical regulation pressure, and the seven-tier organisation of Level 2 delivers the mechanistic explanation depth. We instantiate the architecture as the OntoCrafter Ceramics Ontology (OCO v0.94): 5,196 classes across 44 modules; 167,348 OWL axioms (40,454 logical); 1,674 properties; 829 cross-ontology bridge mappings; 1,172 SHACL shapes; 163 published competency questions.

14.
arXiv (CS.CL) 2026-06-11

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

15.
arXiv (math.PR) 2026-06-19

Establishing an $\Omega(\sqrt{d})$ complexity lower bound for PDMP samplers and how to break it: a sub-$\sqrt{d}$ algorithm for Gaussian-tailed targets

arXiv:2606.19909v1 Announce Type: cross Abstract: Despite the theoretical appeal of their non-reversibility, to date, no Piecewise Deterministic Markov Process (PDMP) samplers have been developed that scale better than $\mathcal{O}(\sqrt{d})$ in computational complexity with respect to the target dimension $d$. We prove that this is a fundamental limitation by establishing an $\Omega(\sqrt{d})$ lower bound on the algorithmic complexity of PDMP samplers in a standard setup. By relaxing the assumption that the target density must remain invariant at all continuous times, we then demonstrate how to bypass this barrier. Specifically, we introduce a novel PDMP sampling scheme and show that it achieves an empirical complexity of $\mathcal{O}(d^\alpha)$, where $\alpha \in [0.2, 0.3]$ for Gaussian-tailed targets. In addition, this PDMP scheme is locally adaptive in both trajectory length and distance between velocity updates.

16.
bioRxiv (Bioinfo) 2026-06-16

A Transformer-derived transcriptomic score associates with ex-vivo drug response in AML

Background Drug-tolerant persister (DTP) cell states have been implicated in relapse across multiple cancers, including acute myeloid leukaemia (AML) [1,2]. Methods that score such states from transcriptomic data, generalise to held-out samples, expose calibrated probability outputs, and link predictions to candidate biology are useful for prioritising follow-up experimental work. Existing transcriptomic methods for scoring drug-tolerant or persister-like states largely rely on fixed gene signatures or general-purpose cell-type classifiers adapted post hoc (scPred, scANVI, scClassify); deep-learning approaches developed specifically for AML drug-tolerant persister scoring with calibrated probability outputs, prespecified thresholds, and transparent external validation against ex-vivo drug-response data are, to our knowledge, lacking. Our approach addresses this gap by combining a Transformer teacher with a knowledge-distilled 1,000-gene student, prespecified threshold {tau} = 0.31, and direct evaluation against BeatAML drug-AUC. Our in silico approach aims to fill this gap of non-existent analytical methods to identify and mark the DTP cells. Methods We trained a Transformer classifier on a pooled scRNA-seq corpus of nine samples (six from GSE123902 -lung adenocarcinoma metastasis, normal, and primary tumour [4] -plus three primary AML samples; 32,342 cells, 13,369 common genes), with stratified 5-fold cross-validation at the cell level, a 20% held-out test split, and a prespecified probability threshold selected on out-of-fold predictions. A 1,000-gene student model was trained by knowledge distillation [5]. For every input cell, the student outputs a probability between 0 and 1 (hereafter "the score") representing predicted membership in the positive training class. The trained model was applied without re-tuning to five external or independent application cohorts: 39 primary AML donors[in-house]; GSE74246[6]; BeatAML (n = 452 with linked ex-vivo drug-AUC; n = 405 with overall-survival metadata)[7]; TCGA-LAML (n = 149)[8]; and an in-house n = 10 scRNA-seq cohort with linked survival. Survival and drug-response data were not used during training, threshold selection, or tuning. The score was anchored mechanistically against CRISPR/DepMap essentiality[9], pathway enrichment, and a normal-tissue-filtered surface-protein candidate list (HPA[11], GTEx[12]). To assess concordance between transcriptomic prioritisation and protein-level evidence, each ranked candidate was additionally annotated with two HPA-derived flags: HPA_surface_protein (Yes/No, derived from HPA Protein class and Subcellular location fields, identifying genes annotated as plasma-membrane, GPCR, ion-channel, transporter, receptor, or CD-marker) and HPA_antibody_reliability (Enhanced, Supported, Approved, Uncertain, or Not available, per HPA antibody validation tier). Annotations were merged on HGNC symbol; 248 of 250 candidates (99.2%) matched. Two candidates using the older CORF nomenclature did not auto-match HPA's lowercase convention and were resolved manually. HPA's per-gene RNA-protein numeric correlation is published only on per-gene web pages and not in the bulk download; we therefore used the detection-level and antibody-reliability tiers as the operational concordance filter. Results Cross-validation area under the receiver operating characteristic curve (AUROC) was 0.936 +/- 0.014 (held-out test 0.941, Matthews correlation coefficient (MCC) 0.696, F1-score 0.895). The 1,000-gene student showed Spearman {rho} {approx} 0.96 with the teacher and >85% class agreement at the prespecified threshold. The principal external result was in BeatAML: the score correlated with ex-vivo drug-response AUC across seven AML-relevant drugs, with consistent per-drug Spearman correlations (r = 0.41-0.53, all p < 0.05). The aggregate correlation across 3,164 patient-drug pairs from 452 patients was r = +0.482 and is reported as a summary, recognising that pairs from the same patient are not fully independent. The score did not stratify overall survival in TCGA-LAML or in the in-house n = 10 cohort, in part because predicted high-score fractions saturated. At the prespecified threshold the score did not separate cell types in GSE74246, indicating that absolute calibration is cohort-dependent. Compared against logistic regression, random forest, the LSC17 stemness signature, and a mean-expression baseline on the same gene panel, the Transformer was the most stable model under aliquot-grouped cross-validation and the only one to transfer with strong, positive correlation to BeatAML drug-AUC. The mechanistic candidate-target pipeline produced a 250-candidate ranked surface-protein list (full breakdown in Results); FLT3 and CD33 were recovered from the unbiased ranking as positive controls. Conclusion We present a Transformer-derived transcriptomic score that addresses the lack of validated computational methods for identifying drug-tolerant persister-like states in AML. The score shows external rank-order association with ex-vivo drug response, providing a research-use tool for prioritising candidate persister-associated transcriptional programs for follow-up. Together, these results support the score as a research-use transcriptomic ranking tool for AML drug-response-associated states. The strongest external support comes from the consistent association with BeatAML ex-vivo drug-response AUC. The fixed probability threshold did not transfer reliably across all cohorts, so threshold-based classification should require cohort-specific recalibration. The score is not validated for clinical decision-making and is not proposed as a survival predictor. The candidate-target list is a starting point for functional follow-up. Keywords. AML; ex-vivo drug response; single-cell RNA-seq; Transformer; knowledge distillation; transcriptomic score; BeatAML; surface-protein target prioritisation.

17.
arXiv (CS.AI) 2026-06-24

Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation

arXiv:2606.24096v1 Announce Type: cross Abstract: Robust perception underpins autonomous driving, and most recent progress comes from scaling the model-larger backbones, foundation models, and cooperative multi-agent fusion. We pursue a complementary, upstream question: what should the camera itself measure? Using a differentiable RAW-to-task pipeline, we decompose which sensor degrees of freedom benefit dense prediction. Learning the spectral colour-filter-array (CFA) weights is the dominant lever, improving mIoU by +0.017 (KITTI-360) and +0.023 (ACDC) over a fixed camera. In contrast, point-spread-function (optics) co-design is net-negative (-0.020 mIoU on KITTI-360) - a consequence of the data-processing inequality, which also bounds the task information that any downstream model, however large or cooperative, can recover. Noise co-optimisation is marginal, and counter to intuition enlarging the CFA tile beyond 2x2 consistently hurts, as the filters are confined to the rank three sRGB input. Because the intervention is at the sensor, the gains are model-agnostic; we validate robustness on ACDC's fog, night, rain, and snow, and conclude with a simple recipe: learn the 2x2 CFA weights and keep an identity PSF.

18.
arXiv (CS.AI) 2026-06-16

Beyond Weights and Gradients: A Taxonomy of Federated Learning Messages

arXiv:2606.16891v1 Announce Type: cross Abstract: Federated Learning is rapidly evolving beyond the exchange of traditional model weights and gradients, yet existing definitions fail to capture the full scope of modern payloads like synthetic data and federated analytics. This paper addresses the gap by proposing a formal mathematical definition of a federated message that accounts for both utility and privacy. We introduce a taxonomy that organizes these exchanges into three categories: model structures, statistical summaries, and data-conditioned representations. By evaluating these groups based on computational demands, communication costs, and privacy risks, we provide a clearer understanding of the trade-offs involved in decentralized training. Our review of 202 recent publications highlights a significant shift since 2021 toward diverse messaging paradigms, signaling a move away from standard deep learning updates toward more specialized information sharing. This framework provides a structured path for future research to optimize federated systems for varying hardware and security requirements.

19.
arXiv (CS.CV) 2026-06-16

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

We propose CLAD, a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

20.
arXiv (quant-ph) 2026-06-16

Exact Many-body Quantum Dynamics in One-Dimensional Baths via Collective Spins

arXiv:2505.00588v2 Announce Type: replace Abstract: Computing the exact dynamics of many-body quantum systems becomes intractable as system size grows. Here, we present a symmetry-based method that provides an exponential reduction in the complexity of a broad class of such problems $\unicode{x2014}$ qubits coupled to one-dimensional electromagnetic baths. We identify conditions under which partial permutational symmetry emerges and exploit it to group qubits into collective multi-level degrees of freedom, which we term ''superspins.'' These superspins obey a generalized angular momentum algebra, reducing the relevant Hilbert space dimension from exponential to polynomial. Using this framework, we efficiently compute many-body superradiant dynamics in large arrays of qubits coupled to waveguides and ring resonators, showing that $\unicode{x2014}$ unlike in conventional Dicke superradiance $\unicode{x2014}$ the total spin length is not conserved. At long times, dark states become populated. We identify configurations where these states exhibit metrologically useful entanglement. Our approach enables exact treatment of complex dissipative dynamics beyond the fully symmetric limit and provides a rigorous benchmark for approximate numerical methods.

21.
arXiv (CS.CV) 2026-06-25

Teach-to-Reason: Competition-Guided Reasoning with a Self-Improving Teacher

Chest X-ray visual question answering (CXR VQA) requires models not only to predict correct answers, but also to produce reliable medical reasoning. However, existing reinforcement-learning-based training typically relies on answer-level rewards, which are often too coarse to improve chain-of-thought (CoT) quality and can become ineffective when group-level advantages collapse to zero. We propose Teach-to-Reason (T2R), a framework that introduces comparison-based supervision into CoT optimization through a self-improving Teacher and a competition-guided Reasoner. As the Teacher is iteratively strengthened via self-competition, the Reasoner is optimized against progressively stronger Teacher-generated references. We further introduce a case-wise reward design that preserves the original reward-induced positive/negative partition when it is informative, and restores supervision from competition scores when the original reward signal degenerates. Experiments on multiple CXR open-ended VQA benchmarks show that T2R consistently outperforms strong baselines, indicating that comparison-based supervision, when integrated in a controlled and principled manner, provides a more effective training signal for reasoning optimization.

22.
medRxiv (Medicine) 2026-06-12

Reduced nighttime smartphone use among cohabiting partners: a longitudinal study under the lens of social control of health behaviors theory

Objective: We examined the link between cohabitation with a partner and nighttime smartphone use through the social control of health behavior theory. Background: Nighttime smartphone use is a behavioral risk factor for sleep problems. While previous research has predominantly focused on individual-level risks of sleep disturbances, the role of social context remains underexplored. Theoretical frameworks, specifically the Social Control of Health Behavior, suggest that social relationships regulate health-related behaviors; however, it is unclear how far this regulation extends to modern digital behaviors among couples. Method: We analyzed survey data from three waves of the SmartSleep Study (2018, 2020, and 2023; total N = 25,028), including a longitudinal follow-up subset (N = 1,003). We tested multivariate associations between living with a partner, changes in cohabitation status and frequent nighttime smartphone use by fitting generalized linear mixed-effects models. Additionally, we mapped the complex interplay between indicators of social integration, social support, smartphone use, and sleep quality using hierarchical clustering of non-linear correlations. Results: Cohabiting participants had lower odds of frequent nighttime smartphone use compared to those living alone (OR = 0.66; 95% CI: 0.61, 0.72). This lower risk was driven primarily by cohabitation with a partner (OR = 0.49; 95% CI: 0.36, 0.66). Longitudinal analysis supported these findings, showing that sustained cohabitation was associated with less frequent nighttime use (OR = 0.56; 95% CI: 0.38, 0.82). Clustering analysis revealed that indicators of social integration and support clustered with favorable sleep quality. Conclusion: Our findings suggest that the health-protective effects of cohabitation with a partner extend to digital behaviors. Consistent with social control of health behavior theory, the presence of a partner appears to reduce frequent nighttime smartphone use, highlighting the critical importance of considering social context when addressing digital health hygiene and promoting sleep.

23.
arXiv (CS.AI) 2026-06-11

Planning under Distribution Shifts with Causal POMDPs

arXiv:2602.23545v2 Announce Type: replace Abstract: In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $\alpha$-vector-based POMDP methods.

24.
arXiv (CS.CL) 2026-06-24

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

作者:

Function-vector (FV) heads are identified by the magnitude of their causal contribution to in-context rule tasks, and the resulting top set is treated as a single functional class. We show this hides a sign structure. Under a sign-preserving criterion (refined direct logit attribution, validated head by head with path patching) the FV population splits into two opposing groups: writers push the rule-correct logit up, cancellers push it down, and ablating both together moves the readout less than the sum of the two. The split is causal and reproducible. It holds in all but two of the fifteen (model, task) cells we test, spanning three architectures and six Pythia scales, and a sign-shuffle null rejects the single-class account in all but one of the six main cells. It is also invisible to magnitude-only ranking, which surfaces whichever group locally dominates and misses the other, so any function vector or ablation built that way silently averages a promoting and a suppressing mechanism. Cancellers are not attention sinks, induction heads, or copy-suppression heads, and their causal effect is larger than that of magnitude-matched non-FV controls. Zero-ablating them recovers $+0.13$ to $+0.29$ nats on the correct label in every main cell, and shifts accuracy by $+2$ to $+7$ pp in the same direction.

25.
arXiv (CS.AI) 2026-06-11

A New Perspective on Precision and Recall for Generative Models

arXiv:2511.02414v3 Announce Type: replace Abstract: With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.