Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-18

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

02.
arXiv (CS.LG) 2026-06-15

Beyond a Single Explanation of the Adam–SGD Gap

arXiv:2606.14259v1 Announce Type: new Abstract: Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam–SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a crossover batch size at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.

03.
arXiv (CS.LG) 2026-06-11

Family-Aware Residual Architecture for Predicting Quantum Circuit Simulation Performance

arXiv:2606.11620v1 Announce Type: cross Abstract: Approximate tensor-network simulators enable classical simulation of quantum circuits beyond the reach of exact methods, but selecting optimal approximation parameters – such as bond dimension thresholds – remains a costly trial-and-error process. We present a family-aware neural architecture that predicts both the minimum approximation threshold required to achieve target fidelity and the expected wall-clock runtime for quantum circuit simulation, given only the circuit's OpenQASM description and execution context. Our key insight is that quantum circuits from different algorithmic families (e.g., QFT, Grover, VQE) exhibit fundamentally distinct simulation cost profiles due to their differing entanglement structures. We employ family-conditioned residual corrections – additive, family-specific adjustments atop a shared backbone, drawing on established conditional computation techniques – enabling the model to capture both universal circuit properties and algorithmic nuances. The architecture incorporates a pretrained family classifier (97.5% accuracy) and domain-informed algorithm fingerprint features derived from gate-composition heuristics. Evaluated on circuits spanning 7–130 qubits across 10 algorithm families, our system achieves 79.5% exact threshold accuracy (91.2% within one rung) and $R^2 = 0.82$ runtime correlation, with inference completing in approximately 50 ms – replacing trial-and-error simulation runs that may take minutes to hours. Ablation studies confirm that family-aware modeling provides the single largest performance improvement (+3.2 percentage points), validating the hypothesis that algorithm family is a first-class feature for simulation cost prediction.

05.
arXiv (CS.CL) 2026-06-16

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

06.
Nature (Science) 2026-06-11

Daily briefing: Deep-sea whale graveyard is a treasure trove of fossils

作者:

Researchers have uncovered more than 400 fossilized whale bones in an ocean-floor chasm. Plus, the working lives of scientists, in pictures, and how AI could slow the pace of research publication for the better. Researchers have uncovered more than 400 fossilized whale bones in an ocean-floor chasm. Plus, the working lives of scientists, in pictures, and how AI could slow the pace of research publication for the better.

07.
arXiv (CS.LG) 2026-06-17

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

arXiv:2606.18209v1 Announce Type: new Abstract: Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

09.
medRxiv (Medicine) 2026-06-17

Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

Importance. Large language models (LLMs) increasingly inform mental health decisions by patients and clinicians. Inference-time activation steering can shift model behavior on a target dimension without altering weights or prompts and without disclosure to users, allowing treatment recommendations to be silently changed for commercial or ideological reasons. Objective. To determine whether directional activation steering can shift an open-weights LLM's depression treatment recommendations. Design, Setting, and Participants. This non-human subjects study applied directional activation steering to an open-weights LLM (DeepSeek V4 Flash) responding to 12 depression-advice scenarios (4 favoring medication, 4 favoring avoidance, 4 neutral), generated at 30 amplitudes from -1.5 to +1.5 in 0.1 increments plus an unsteered baseline. Exposures. A single steering direction contrasting antidepressant medication with self-directed approaches (diet, exercise, meditation, dietary supplements), constructed from 16 paired training prompts and applied at the attention output of every transformer block; weights and system prompt were held constant. Main Outcomes and Measures. The extent to which medication and four self-care categories were addressed, scored 0 to 3 by a human-validated LLM rater (Claude Opus 4.7), the medication-versus-self-care balance, and clinician referral, estimated per unit of amplitude using mixed-effects models with a scenario random intercept. Results. Across 372 generations, steering produced a graded, dose-dependent shift in the medication-versus-self-care balance, which declined by 0.32 per unit of amplitude (beta=-0.32; 95% CI, -0.39 to -0.25; P < .001); medication extent fell and self-care extent rose. The shift was largest for scenarios with no stated treatment preference (beta = -0.44; 95% CI, -0.54 to -0.34; P < .001). A clinician referral appeared in 322 of 372 responses (87%) and did not vary with steering amplitude (P = .63). Conclusions and Relevance. In this open-weights LLM providing depression treatment information, inference-time activation steering shifted treatment recommendations without altering weights, prompt structure, or safety outputs, with the largest effect among users expressing no treatment preference. These findings suggest a need for LLM disclosure standards and independent auditing as such models inform clinical decisions.

10.
arXiv (CS.CL) 2026-06-15

Multimodal Speaker Identification in Classroom Environments

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

11.
arXiv (quant-ph) 2026-06-12

To Cool, or Not to Cool? Displacement Sensing with Hot Quantum States

arXiv:2606.13650v1 Announce Type: new Abstract: Quantum-enhanced displacement sensing with bosonic systems is typically formulated assuming that the oscillator is cooled close to its ground state before nonclassical probe preparation. We investigate whether such near-ground-state initialization is necessary, or whether sensitive probes can instead be generated directly from thermal states. We analyze hot quantum probes produced by squeezing, number-raising, and Schrödinger-cat-state generation applied to thermal inputs. We identify two distinct mechanisms by which thermal mixedness can remain compatible with enhanced displacement sensitivity. First, projecting a mixed probe onto a definite parity sector removes the usual thermal suppression of the displacement quantum Fisher information, which can then increase with initial thermal occupation. Second, coherent superpositions of opposite displacements can retain sensitivity through coherence between their displaced components, even when the underlying state is mixed. We use these two mechanisms to classify hot-state protocols according to whether their sensitivity comes from parity selection, coherence between displaced components, or both. Finally, we formulate an experimentally relevant optimization problem comparing initial cooling with direct hot-state preparation under realistic decoherence and show that complete cooling is not universally optimal. Our results establish hot-state engineering as a route to quantum-enhanced bosonic displacement sensing without mandatory ground-state initialization.

12.
PLOS Medicine 2026-05-13

On the evolution of the company we keep: Implications for infectious disease modeling

by Joël Mossong Whom we meet shapes how infections spread. Where earlier focus of mathematical epidemiology was on incorporating age, more recent work has begun to reveal the importance of socioeconomic aspects for understanding and managing future epidemics. In this Perspective, Joël Mossong discusses the importance of understanding social contacts and how they have evolved for infectious disease modeling, and the need to factor in additional considerations such as ethic and socioeconomic backgrounds.

13.
arXiv (CS.CL) 2026-06-11

Models That Know How Evaluations Are Designed Score Safer

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

14.
arXiv (CS.CV) 2026-06-18

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

15.
arXiv (CS.CL) 2026-06-18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

16.
arXiv (CS.LG) 2026-06-18

Complementary Attention Head Pruning for Efficient Transformers

arXiv:2606.19150v1 Announce Type: new Abstract: The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the "proximity bias" of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model's intermediate layers.

17.
arXiv (CS.CV) 2026-06-15

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose BoRAD, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

18.
arXiv (CS.AI) 2026-06-15

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

arXiv:2604.14892v3 Announce Type: replace-cross Abstract: Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

19.
arXiv (math.PR) 2026-06-11

Exact Fourier dimensions of dyadic Mandelbrot cascades on curves of nonvanishing curvature under minimal integrability

arXiv:2606.11758v1 Announce Type: new Abstract: We prove an exact Fourier-dimension formula for scalar dyadic Mandelbrot cascades pushed forward to fixed C^2 Jordan curves with nonvanishing curvature. Let W be in the minimal Kahane-Peyriere regime, let the scalar dyadic cascade live on T = R/Z, and let gamma map T to R^2 be a fixed C^2 Jordan curve with nonvanishing curvature, parametrized at constant speed. For the push-forward measure mu_gamma, we prove that, almost surely on non-extinction, its Fourier dimension is A_loc(W), the usual local exponent obtained by optimizing over q>1 from the moment expression involving E[W^q]. The upper bound follows from the scalar circle local-dimension theorem, bi-Lipschitz transfer to the fixed curve, and a deterministic curved-support obstruction for Fourier dimension. The lower bound follows from a fixed-curve finite-r annular theorem, which gives summable annular Fourier decay under a single finite moment witness. The main analytic input is a deterministic phase-geometry package for fixed nondegenerate C^2 curves: stationary tubes, derivative bands, and phase-bin coefficient estimates replacing the explicit trigonometric structure available on the unit circle.

20.
arXiv (quant-ph) 2026-06-11

Multipartite reference-frame-independent quantum cryptographic communication

arXiv:2606.12284v1 Announce Type: new Abstract: Reference frame mismatch among communication parties introduces errors in quantum cryptographic protocols. As the number of participants increases, aligning reference frames becomes increasingly difficult, complicating multipartite quantum cryptographic implementations. Here, we theoretically and experimentally investigate multipartite reference-frame-independent (RFI) quantum cryptographic communication using Greenberger-Horne-Zeilinger (GHZ) states. We generalize the bipartite RFI security parameter $C$ to an $N$-party parameter $C_N$ and derive the asymptotic secret key rate expressed solely in terms of experimentally accessible quantities. We analyze the key rate under global and local depolarizing noise models and find that increasing the number of parties $N$ enhances robustness against global depolarizing noise while increasing vulnerability to local channel noise. We also present a proof-of-principle experimental demonstration of four-party RFI quantum cryptographic communication using four-photon GHZ states, confirming the reference-frame invariance of both the $C_4$ parameter and the secret key rate under various reference frame rotations.

21.
arXiv (CS.AI) 2026-06-12

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

arXiv:2606.13405v1 Announce Type: new Abstract: LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

22.
arXiv (CS.CV) 2026-06-12

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

23.
arXiv (math.PR) 2026-06-16

Risk or Replace: Efficient Asymptotics for Data-Driven Maintenance

arXiv:2606.14706v1 Announce Type: cross Abstract: Condition-based maintenance (CBM) is an approach that plans interventions for deteriorating systems according to their observed operational state. CBM reduces unplanned downtime and extends usable lifetime. We study a heterogeneous population of components that degrade over time according to a stochastic processes with non-negative and i.i.d. increments that are characterized by component-specific parameters that remain unobservable to the decision maker. We rely on degradation data to estimate these parameters and determine replacement actions at equidistant epochs. The goal is to minimize the long-run average cost, which incorporates fixed replacement costs, failure costs, and operating costs. This problem can be formulated as a high-dimensional partially observable Markov decision process (POMDP), which is generally intractable. We develop a tractable, data-driven CBM policy that estimates the optimal policy of a hypothetical Oracle that has full information of the underlying degradation parameters and call this policy the Estimated Oracle's Optimal Policy (EOP). We introduce a scaling regime where both the failure thresholds and cost parameters increase proportionally, reflecting practical settings in which component lifetimes and maintenance costs are large relative to the time between two consecutive CBM decision moments. We show that the regret of the EOP, defined as the difference between its long-run average cost and that of the Oracle, converges to zero in the scaling regime when the parameter estimator is consistent. Across extensive experiments using both real and simulated data, the EOP achieves very low regret and, whenever the optimal POMDP policy can be computed exactly, a negligible optimality gap.

24.
arXiv (CS.CV) 2026-06-12

Fully Distributed Multi-View 3D Tracking in Real-Time

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

25.
arXiv (CS.AI) 2026-06-19

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

arXiv:2606.20520v1 Announce Type: cross Abstract: Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.