Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-11

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy–compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

02.
arXiv (CS.LG) 2026-06-18

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

arXiv:2606.19297v1 Announce Type: new Abstract: Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

03.
arXiv (CS.CV) 2026-06-16

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

04.
arXiv (CS.CL) 2026-06-16

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

05.
arXiv (CS.LG) 2026-06-18

Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks

arXiv:2605.13566v2 Announce Type: replace Abstract: Land Surface Temperature (LST) is a key variable for various applications, such as urban climate and ecology studies. Yet, existing satellite-derived LST products provide either high spatial or high temporal resolution, resulting in a fundamental trade-off between the two. To address this trade-off, we combine observations from a geostationary and a polar orbiting satellite and provide LST fields at high spatial and high temporal resolution (1 km at 15-min intervals). We demonstrate their application for intraday forecasting of LSTs. To estimate LST fields at high spatiotemporal resolution, a U-Net model is trained to map LST fields from SEVIRI/MSG (3 km and 15 min resolution) to LST fields from Terra/Aqua MODIS (1 km, 4 overpasses per day) that are collocated in space and time. The presented model has been trained on LSTs across large European cities with a population exceeding 1 million inhabitants, and achieves an RMSE = $1.92${\deg}C and near-zero bias MBE = $0.01${\deg}C on the hold-out test set. As a second step, we present an LST nowcasting model based on ConvLSTM architecture, trained across downscaled LST fields with forecast lead times of 15 to 75 minutes. The nowcasting model outperforms a persistence and a Climatological Rolling Median benchmarks, with RMSEs of $0.57$ to $1.15${\deg}C for the considered lead times and biases ranging from $-0.1$ to $0.14${\deg}C. An additional validation conducted against independent MODIS overpasses confirms robust performance. Our LST forecast model at high spatiotemporal resolution is directly applicable to operational satellite-based LST monitoring.

06.
Nature (Science) 2026-06-08

Distributed control circuits across a brain-and-cord connectome

Just as genomes revolutionized molecular genetics, connectomes (maps of neurons and synapses) are transforming neuroscience. To date, the only organisms with complete connectomes are worms1–3, sea squirts4, and comb jellies5 (103–104 synapses). By contrast, the fruit fly is more complex (108 synaptic connections), with a brain that supports learning and spatial memory6,7 and an intricate ventral nerve cord analogous to the vertebrate spinal cord8–12. Here we report the first densely-reconstructed adult fly connectome that unites the brain and ventral nerve cord, and we leverage this resource to investigate principles of neural control. We show that effector neurons (motor neurons, endocrine cells, and efferent neurons targeting the viscera) are primarily influenced by sensory neurons in the same body part, forming local feedback loops. These local loops are linked by long-range circuits involving ascending and descending neurons organized into behavior-centric modules. Single ascending and descending neurons are often positioned to influence the voluntary movements of multiple body parts, together with the endocrine cells or visceral organs that support those movements. Brain regions involved in learning and navigation supervise these circuits. These results reveal an architecture that is distributed, parallelized, and embodied, reminiscent of distributed control architectures in engineered systems13,14.

07.
arXiv (CS.AI) 2026-06-16

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

arXiv:2606.15575v1 Announce Type: new Abstract: Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

08.
arXiv (quant-ph) 2026-06-16

Forbidden transitions in superconducting artificial atoms

arXiv:2606.06069v2 Announce Type: replace Abstract: Artificial atoms built from Josephson junctions have become a powerful tool to explore the limits of quantum optics due to their strong coupling to electromagnetic fields and their sensitivity to changes at the single-photon level. This sensitivity to quantum fluctuations complements their metrological and computational use, which are based on the precise oscillating frequency of the underlying supercurrents. We present here a theory for Josephson junctions immersed in electromagnetic fields where focus is shifted from temporal correlations and towards spatial ones. Unlike the commonly used circuit and black-box descriptions, our work is based on a microscopic model that enables systematically accounting for the effect of the spatial and vectorial profile of an electromagnetic field over a junction. As an example of the interactions that emerge in such a setup, we investigate the possibility of driving a junction via a quadrupole transition, using typical experimental parameters in existing devices. With the transition being dependent on the gradient of the electric field – rather than its intensity – the junction can be excited in a region where the electric field vanishes.

09.
arXiv (quant-ph) 2026-06-15

Quantum Horizon: An evaluation of quantum computing as a threat to Bitcoin and Ethereum

arXiv:2606.14484v1 Announce Type: new Abstract: Quantum computing poses a real, broad-based, but bounded and substantially mitigable threat to Bitcoin and Ethereum. We separate the two quantum algorithms that public discussion routinely conflates: Shor's algorithm breaks the elliptic-curve signatures (ECDSA over secp256k1, BLS over BLS12-381) that authorize spending, whereas Grover's algorithm does not meaningfully threaten proof-of-work mining, which is protected by a merely quadratic speedup, fault-tolerant per-operation costs, a square-root parallelization wall, and difficulty adjustment. Folding hardware scaling, the falling resource requirement, a fault-tolerance readiness lag, and expert surveys into a single Monte-Carlo forecast yields a wide, bimodal arrival distribution for a cryptographically relevant quantum computer: about a one-in-six chance by 2035, near 30% by 2040, and about 60% by 2050. Exposure is concentrated and mostly migratable: of Bitcoin's roughly six million quantum-exposed coins only about 2.3 million are irreducibly at risk, while 50 to 65% of Ether sits at key-revealed accounts that can adopt post-quantum signatures. A timely migration beats even an optimistic 2035 machine, so the binding constraint is governance, not technology. A survey of the top twenty cryptocurrencies finds none fully post-quantum. Reproducible models accompany every quantitative claim.

10.
arXiv (CS.AI) 2026-06-18

Towards Multi-Agent-Simulation-Based Community Note Evaluation

arXiv:2606.18268v1 Announce Type: cross Abstract: Community-based fact-checking that relies on cross-consensus is expanding rapidly on social media platforms. However, the delay and low-ratio of cross-consensus community fact-checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large-scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from $\mathbb{X}$. We then propose MultiCom, a persona-guided multi-agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out-of-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro-F1 60.1%) on the evaluation set.

11.
arXiv (quant-ph) 2026-06-16

Achieving double-logarithmic precision dependence in optimization-based quantum unstructured search

arXiv:2603.26039v3 Announce Type: replace Abstract: Grover's algorithm is a fundamental quantum algorithm that achieves a quadratic speedup for unstructured search problems of size $N$. Recent studies have reformulated this task as a maximization problem on the unitary manifold and solved it via linearly convergent Riemannian gradient ascent (RGA) methods, resulting in a complexity of $O(\sqrt{N/M}\log (1/\varepsilon))$, where $M$ denotes the number of target items and $\varepsilon$ denotes the success probability error. In this work, we adopt the Riemannian modified Newton (RMN) method to solve the quantum search problem, under the assumption that the ratio $ M/N$ is known. We show that, in this setting, the Riemannian Newton direction is collinear with the Riemannian gradient in the sense that the Riemannian gradient is always an eigenvector of the corresponding Riemannian Hessian. This structure removes the overhead of Hessian inversion and allows the proposed RMN method to retain the local quadratic convergence in terms of the error $\varepsilon$. More precisely, we rigorously prove an overall complexity of $O(\sqrt{N/M}+\log\log(1/\varepsilon))$. Furthermore, our approach remains Grover-compatible, namely, it relies exclusively on the standard Grover diffusion and oracle operators to ensure algorithmic implementability, and its parameter update process can be efficiently precomputed on classical computers.

12.
medRxiv (Medicine) 2026-06-17

Treatment of Multi-Drug-Resistant Tuberculosis with Second-Line All-Oral Drugs in Ghana: Incidence of Adverse Events.

Introduction: The treatment of multidrug-resistant tuberculosis (MDR-TB) remains challenging due to the toxicity of second-line medications and suboptimal treatment outcomes. This study aimed to determine the incidence of adverse events and identify factors associated with these events in patients undergoing treatment for MDR-TB with second-line all-oral drugs in Ghana. Methods: This retrospective cohort study reviewed the medical records of 384 MDR-TB patients treated with second-line all-oral drugs at selected health facilities in Ghana, including the Greater Accra Regional Hospital, Eastern Regional Hospital, and Kumasi South Hospital. Data were extracted using the Kobo Collect tool, capturing patient demographics, baseline clinical and laboratory characteristics, treatment regimens, and adverse events. The study period spanned from 2020 to August 2024. Results: The study included a total of 384 MDR-TB patients, with a mean age of 45 years (SD = 15). The majority of patients were male (65.78%), and most were within the 45-64 years age group (33.85%), followed by those aged 25-44 years (31.25%). Regionally, the highest number of cases were reported from the Greater Accra Region (39.06%), followed by the Eastern Region (31.25%) and Kumasi South Hospital (29.69%). Approximately one in four patients (25%) presented with comorbidities, with HIV being the most common (19.5%). The most frequently reported adverse events were diarrhea (14%), dizziness (13.7%), and vomiting (12.3%). Most of these were mild to moderate in severity and tended to decrease as treatment progressed. Severe adverse events, such as leukopenia and acute kidney injury, were rare, occurring in less than 5% of patients. Over the course of treatment, gastrointestinal adverse events such as vomiting and nausea showed a significant decline, indicating possible patient adaptation or improved clinical management. Results from the multivariate Poisson regression analysis revealed that age and comorbidities were significant predictors of adverse events. Patients aged 65 years and above had a 56% lower risk of developing adverse events compared to younger patients (Adjusted Risk Ratio [aRR] = 0.44, 95% CI: 0.25-0.79, p = 0.005). Conversely, patients with comorbid conditions such as diabetes or hypertension were approximately 2.6 times more likely to experience adverse events compared to those without comorbidities (aRR = 2.65, 95% CI: 1.58-4.43, p < 0.001). The effect of sex was not statistically significant after adjustment (aRR = 1.03, 95% CI: 0.70-1.50, p = 0.86). At the end of the treatment period, 74.9% of patients achieved successful outcomes, including both those who were cured and those who completed treatment without being classified as cured. However, 25.1% had unsuccessful outcomes, which included treatment failure, relapse, or death. Conclusion: In conclusion, adverse events are common in the treatment of MDR-TB with second-line All-Oral drugs, with gastrointestinal adverse events being the most prevalent. These findings highlight the importance of monitoring and managing adverse events to optimize treatment outcomes for MDR-TB patients in Ghana.

13.
arXiv (CS.LG) 2026-06-18

Task-Restricted Symmetries in Recurrent Weight Space

arXiv:2606.18457v1 Announce Type: new Abstract: Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input-output rollout nearly unchanged on a task distribution, while similar-scale changes can destroy the same behavior. We study this redundancy in one-layer tanh RNNs using ordered real Schur coordinates. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed. In a fixed-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay. Across flip-flop, sine generation, and context-dependent integration, the loss-preserving ablation profile varies across tasks and trained solutions. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space. Schur-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation.

14.
arXiv (CS.LG) 2026-06-11

Modelling magnetic material properties with uncertainty-aware neural networks

arXiv:2606.11870v1 Announce Type: cross Abstract: Machine learning is increasingly applied to accelerate the discovery of novel materials by exploring large compositional and structural design spaces. Yet, the scarcity of high-quality data and the frequent need for out-of-distribution prediction introduce substantial uncertainty, making the assessment of model reliability essential. In this work, we investigate uncertainty quantification as a means to evaluate model confidence in the context of permanent magnet research. In a first study, we benchmark classical and modern machine learning models for predicting intrinsic magnetic properties, focusing on the quality of their uncertainty estimates. We apply Gaussian negative log-likelihood loss and dropout-based Bayesian approximation as practical strategies for estimating predictive uncertainty. In a second study, we transfer these architectural features for uncertainty estimation to a more complex task: predicting coercivity from microstructural information using a graph neural network. Together, these studies demonstrate that uncertainty quantification not only enhances the trustworthiness of predictions but is also transferable across different modeling tasks.

15.
arXiv (CS.CL) 2026-06-16

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

16.
arXiv (CS.CV) 2026-06-16

Power Battery Detection

Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

17.
arXiv (CS.AI) 2026-06-16

IoT-Zoo: A Container-Based Framework for Heterogeneous IoT Device Profiles and Reproducible Traffic Capture

arXiv:2606.15653v1 Announce Type: cross Abstract: The validation of networking and security solutions for the Internet of Things (IoT) requires realistic and reproducible experimental data. However, existing platforms often achieve scalability by replicating a limited set of device types, which restricts profile diversity and fails to capture the heterogeneity of real-world IoT environments. In this paper, we present IoT-Zoo, a container-based testbed designed to support reproducible experimentation through heterogeneous, dataset-driven IoT device profiles. Built upon Containernet, IoT-Zoo automates the deployment of multi-domain scenarios and supports real application protocols such as MQTT and RTSP. The platform provides a single-command interface for environment provisioning and automated traffic capture (PCAP), enabling the generation of consistent traffic baselines and reducing the operational effort required to evaluate networking and security solutions.

18.
arXiv (CS.CV) 2026-06-19

FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

19.
arXiv (CS.CL) 2026-06-11

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

20.
arXiv (CS.AI) 2026-06-19

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

arXiv:2512.20014v3 Announce Type: replace-cross Abstract: While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

21.
arXiv (CS.AI) 2026-06-18

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

arXiv:2606.18557v1 Announce Type: new Abstract: A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

22.
arXiv (quant-ph) 2026-06-16

Quantum Measurement and Continuous Markov Processes

arXiv:2606.15958v1 Announce Type: new Abstract: These are the lecture notes for a course on diffusive quantum measuring instruments. They were prepared and delivered at the Perimeter Institute on Mondays and Thursdays, from 2:30 to 4:00 PM, beginning October 27th, 2025 and ending December 11th, 2025. These lectures were recorded and can be found at https://pirsa.org/c25038.

23.
arXiv (CS.AI) 2026-06-15

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

作者:

arXiv:2606.14589v1 Announce Type: cross Abstract: LLM agent systems increasingly run as long-lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal-assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root-cause postmortems, in which one meta-pattern – a failure whose error signal never reaches a human in actionable form – manifested at least 28 times. We derive a five-class, mechanism-oriented taxonomy: (A) environment and platform quirks, (B) design-assumption mismatches, (C) error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error – the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail-plausible: gray failure's differential observability escalated – the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user-view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking – audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity – the longest-lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.

24.
arXiv (quant-ph) 2026-06-19

Sparse positive maps on qutrits with exact nondecomposability thresholds and PPT-entanglement transitions

arXiv:2606.19765v1 Announce Type: new Abstract: We study a family of sparse positive maps on qutrits for which positivity, decomposability, and PPT entanglement can all be analysed explicitly. The block structure of the associated Choi matrices reduces positivity to a Hermitian biquadratic form and leads to exact positivity boundaries for three representative parametric families. For the same families we determine the exact transition between decomposable and non-decomposable maps and construct associated PPT states of two classes. The first consists of witness-adapted deformations naturally tied to the non-decomposability analysis. The second consists of analytically tractable families whose full PPT-entangled branch is detected by fixed positive maps, yielding exact thresholds between separability and bound entanglement. For the trace-preserving subclass, we further compare positivity with a recent eigenvalue bound for 2-positive maps, thereby making the gap between positivity and higher-order positivity fully explicit within this family.

25.
PLOS Medicine 2026-05-14

First-trimester nonsteroidal anti-inflammatory drugs exposure and risk of major congenital malformations: A retrospective register-based cohort study

by Ariel Avraham Hasidim, Itamar Ben Shitrit, Daphna Idan, Tal Michael, Amalia Levy, Gali Pariente, Eitan Lunenfeld, Sharon Daniel Background Pain and fever are common in early pregnancy, yet their management poses a major clinical dilemma. Although not confirmed, recent studies have raised safety concerns regarding acetaminophen. Evidence on the use of nonsteroidal anti-inflammatory drugs (NSAID) in the first trimester remains inconclusive. This uncertainty has left clinicians with limited evidence to guide treatment decisions. This study evaluated the association between first-trimester NSAID exposure and the risk of major congenital malformations (MCMs) in a large, population-based cohort of pregnancies. Methods and findings We conducted a population-based retrospective cohort study within the Southern Israeli Pregnancy Registry (siPREG) project, including all singleton pregnancies of women aged 15–45 years resulting in live births, stillbirths, or elective terminations for fetal malformations at a Soroka University Medical Center between 1998 and 2018. Pregnancies exposed to established teratogens, multiple gestations, and those with documented genetic or chromosomal anomalies were excluded. First-trimester NSAID exposure was defined by pharmacy dispensations (overall and by specific agents). MCMs were identified from linked clinical, hospitalization, and termination records through the first postnatal year.Propensity scores were estimated using covariates selected via a directed acyclic graph, including maternal age, ethnicity, diabetes, medical indication for NSAID use, exposure to other antipyretics, obesity, smoking, folic-acid use, gravidity, perinatal care, and year of pregnancy. Generalized full matching was used to balance covariates. Adjusted risk ratios were derived using weighted Poisson regression with G-computation, and two-way cluster-robust standard errors, jointly clustering by maternal identifier and matching subclass. Sensitivity analyses included a dose–response assessment across defined-daily-dose (DDD) categories and a tipping-point analysis evaluating the impact of potential misclassification from unrecorded over-the-counter NSAID use.A total of 264,858 singleton pregnancies were included in the final cohort; 20,202 (7.6%) were exposed to NSAID, most commonly ibuprofen (5.1%), diclofenac (1.6%), and naproxen (1.2%). NSAID exposure, in total and as individual agents, was not associated with MCMs overall (8.2% versus 7.0%; matched-adjusted-Relative Risk (aRR) = 0.99 (95% CI [0.90,1.10])) or with organ-system-specific MCMs, including cardiovascular (matched-aRR = 1.05 (95% CI [0.92,1.20]), musculoskeletal (matched-aRR = 1.03 (95% CI [0.77,1.39])), central nervous system (matched-aRR = 0.77 (95% CI [0.53,1.11])), cleft palate (matched-aRR = 0.95 (95% CI [0.47–1.91])), gastrointestinal (matched-aRR = 1.03 (95% CI [0.64–1.63])), and genitourinary (matched-aRR = 0.99 (95% CI [0.72,1.35])) malformations. Dose–response analyses showed no significant association with MCMs across cumulative NSAID exposure: short-term (1–7 DDD, matched-aRR = 1.06 (95% CI [0.97,1.15]), medium-term (8–21 DDD, matched-aRR = 1.10 (95% CI [0.99,1.22]), and long-term (>21 DDD, matched-aRR = 1.24 (95% CI [0.94,1.63])). The main limitation was the potential for minor exposure misclassification due to over-the-counter availability of ibuprofen, although sensitivity analyses simulating such misclassification suggested minimal impact on the risk estimates. Conclusion In this large, population-based cohort, we found no evidence supporting an association between first-trimester exposure to NSAID and MCMs, providing reassuring evidence regarding their fetal safety in early pregnancy.