Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
Nature (Science) 2026-06-10

Mutation-dependent responses to sleep and exercise in clonal haematopoiesis

Clonal haematopoiesis (CH) activates inflammation and increases the risk of atherosclerosis1,2. Whether lifestyle alters CH clone expansion or the phenotypic programming of CH mutant cells, thereby affecting atherosclerosis, is unknown. Here, in humans and mice and across mutations in Jak2, Tet2, Trp53 and Dnmt3a, we demonstrate mutation-dependent responses to sleep and exercise in CH and show that mutant cells are uniquely sensitive to lifestyle. In two human datasets, moderate-to-vigorous physical activity was associated with lower prevalence of non-DNMT3A-driven CH. In atherogenic mice with Jak2V617F or Tet2 loss of function (LOF), but not Trp53 LOF or Dnmt3aR878H CH, uninterrupted sleep or exercise curtails clone expansion. In CH with the Jak2V617F mutation, sleep and exercise reduces clone expansion by selectively reprogramming mutant, but not cohabitant wild type, haematopoietic progenitor cells towards antiproliferative and metabolically healthy phenotypes by tempering bone marrow macrophage–haematopoietic progenitor cell IL-1β signalling. Sleep or exercise also lessens Jak2V617F-driven, Tet2 LOF-driven and Trp53 LOF-driven, but not Dnmt3aR878H-driven, atherosclerosis by locally reprogramming mutant vascular macrophages, independent of peripheral clone dynamics. In Jak2V617F, but not adjacent wild type, aortic macrophages, uninterrupted sleep blunts CLEC4E-dependent inflammasome activation, consequently diminishing lesions. Exercise, meanwhile, activates PAC1+ neurons in the locus coeruleus, raising the levels of peripheral noradrenaline, which signals through adrenergic receptor β2 (ADRβ2) whose expression is preserved by exercise in Jak2V617F, but not cohabitant wild type, aortic macrophages, selectively repressing their inflammatory programming and atherosclerosis. Our findings establish that healthy lifestyles gene-specifically diminish CH and selectively reprogram mutant haematopoietic progenitor cells and macrophages to maintain cardiovascular health. Sleep and exercise can slow clonal haematopoiesis and limit mutant cell-driven atherosclerosis.

02.
arXiv (CS.AI) 2026-06-19

Thermodynamic Measure of Intelligence

arXiv:2606.20231v1 Announce Type: new Abstract: Can intelligence be measured? We propose that intelligence can be defined as the lawful amplification of rare but valid futures: a system increases the probability of outcomes that would be unlikely under passive dynamics but remain admissible under the constraints of the domain. We start with the premise that an intelligent system must model the world and its own place within it. Because the system is part of the world it models, this leads naturally to recursive self-simulation: the system represents futures in which its own actions are part of the trajectory. Our central results give a necessity statement and a conditional near-sufficiency statement connecting this architecture to a precise thermodynamic measure of lawful amplification of rare-valid futures: high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity; conversely, when rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift approaches the actuation-limited optimum. Thus recursive self-simulation is not merely a plausible feature of intelligence but, under the stated assumptions, is necessary and nearly sufficient for high thermodynamic intelligence. The resulting framework makes intelligence measurable on a universal scale, from passive matter and feedback controllers, large language models, and humans as text generators to Maxwell-demon-like information engines.

03.
arXiv (CS.CL) 2026-06-11

When Roleplaying, Do Models Believe What They Say?

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

04.
arXiv (CS.AI) 2026-06-24

LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context

arXiv:2606.24585v1 Announce Type: new Abstract: While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistants, to assess the impact of overrefusal on legal prompts. Surprisingly, we find that authority-style prefixes (``you are acting as an assistant of the national supreme court'', ``[...] defense lawyer'') systematically increase refusal rates by 2–20x over the no-prefix baseline, while a known role-play jailbreak prefix shows mixed effects, sharply increasing refusals in some models and barely shifting them in others. The finding suggests that small on-prem deployable LLMs are unstable under contextual framings that a real institutional user might naturally introduce, and further investigation is essential to minimize opportunities for bias.

05.
arXiv (CS.CV) 2026-06-25

UniTeD: Unified Temporal Diffusion for Joint Perception and Planning in Autonomous Driving

Diffusion models have shown strong potential for multi-modal planning in end-to-end autonomous driving. However, most existing methods confine diffusion to the planning module, conditioning on fixed outputs from separate discriminative perception networks. This decoupled design propagates perception errors to the planner, increasing optimization difficulty and reducing robustness. To overcome these limitations, we propose UniTeD, a Unified Temporal Diffusion framework that jointly models perception and planning through iterative denoising in a shared generative space. By enabling bidirectional information exchange, the framework facilitates mutual refinement between tasks and improves robustness via noise-conditioned multi-task training. We further extend this unified diffusion paradigm to a streaming setting by incorporating temporal context. A Temporal Transition Module (TTM) is introduced to resolve the noise-level mismatch between historical and current frames. In addition, we propose an Anchor Refresh Strategy (ARS) to alleviate the training-inference distribution shift commonly observed in sparse diffusion-based end-to-end driving frameworks. Without bells and whistles, UniTeD achieves state-of-the-art performance across multiple benchmarks, surpassing both recent discriminative end-to-end methods and diffusion-based planning approaches.

06.
arXiv (CS.CV) 2026-06-25

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation. Codes and data are publicly available at https://grenoble-zhang.github.io/SymphoMotion/.

07.
Nature (Science) 2026-06-17

Reimagining machine vision with optical computing

作者: 未知作者

A general-purpose artificial-intelligence vision system for use in image-sensing devices has been developed by embedding fundamentals of core computer-vision operations into a light-manipulating planar material called an optical metasurface. A prototype enables accurate, real-time perception and processing across diverse tasks, suggesting that this could be a solution for rapid, low-energy, on-device vision intelligence. A specialized ‘metasurface’ can preprocess incoming scene information on image-generating devices.

08.
arXiv (CS.LG) 2026-06-15

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating the use of approximate inference. We propose EM-NeSy which casts probabilistic NeSy learning as an instance of the Expectation-Maximization (EM) algorithm. In the expectation step, we compute the posterior over the neurally predicted symbols conditioned on the label via probabilistic inference. In the maximization step, we update the neural parameters based on this posterior using gradient descent only through the neural component. This formulation unlocks the full potential of the EM algorithm for NeSy learning. It allows NeSy to extend naturally to approximate reasoning without any additional modifications or differentiability requirements of the symbolic component. Furthermore, it recovers the standard end-to-end gradient-based NeSy setting under exact inference. Our experimental results demonstrate the scalability and computational efficiency of EM-NeSy.

09.
arXiv (CS.CV) 2026-06-16

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

10.
arXiv (CS.CV) 2026-06-16

Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

11.
arXiv (CS.CL) 2026-06-12

Epistemic Constitutionalism Or: how to avoid coherence bias

作者:

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

12.
arXiv (CS.AI) 2026-06-17

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

arXiv:2606.17642v1 Announce Type: new Abstract: Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

13.
arXiv (CS.CV) 2026-06-11

Feature extraction for plant growth estimation

Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (``bccr-segset``) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.

14.
arXiv (CS.AI) 2026-06-11

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

arXiv:2606.11473v1 Announce Type: cross Abstract: Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

15.
arXiv (CS.AI) 2026-06-16

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

arXiv:2606.07015v2 Announce Type: replace-cross Abstract: While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

16.
arXiv (CS.CV) 2026-06-15

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

作者:

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

17.
arXiv (CS.LG) 2026-06-15

Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

arXiv:2606.14334v1 Announce Type: new Abstract: High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of $k$-NN-based diffusion geometry estimators, while enabling amortized inference that is up to $400\times$ faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

18.
arXiv (CS.CL) 2026-06-17

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

20.
arXiv (CS.CV) 2026-06-18

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback–Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100–300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

21.
arXiv (CS.AI) 2026-06-24

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory

作者:

arXiv:2606.20737v2 Announce Type: replace Abstract: We study factual edit propagation in a controlled synthetic knowledge-graph QA setting using a 2x2 grid that crosses loop recurrence with shared-memory access: a dense transformer (Dense), a looped transformer (Loop), a dense backbone with shared memory (Dense+Mem), and a looped backbone with shared memory (loop-memory coupling, LMC). The two factors dissociate. For learning, both routes to repeated shared access – looped recomputation and repeated memory rereading – cross the out-of-distribution (OOD) grokking barrier that Dense fails, so repeated shared access is the behavioral regularity, not a specific architecture. For editing, the substrates split along a different axis: applying a single localized factual edit (conditioned on direct success) and measuring 2-hop propagation on a shared pre-edit-correct set, the edit propagates strongly in both memory-bearing cells (LMC 0.78-0.92, Dense+Mem 0.71-0.96) and only weakly in the memory-free ones (Loop 0.04-0.30, Dense 0.00-0.03). The split is along the memory axis, not the loop axis: every memory-bearing seed exceeds every memory-free seed, with no detectable difference between the two memory cells. Crucially Dense+Mem has no recurrence, so the propagating ingredient is an addressable site that an edit can write to and later computation rereads, not loop recomputation; Loop is at best a partial intermediate. The affordance survives coarsening the store (N=128 to N=13): propagation attenuates but the memory/no-memory split persists, so fine granularity buys precision rather than the affordance itself. These results dissociate learning competence from editing affordance – repeated shared access suffices to grok, but edit propagation depends on whether the substrate exposes an addressable memory that the forward computation can write to and later reread, an affordance that loop recurrence provides only partially.

22.
arXiv (CS.CV) 2026-06-17

Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting

Long-term visual acuity (VA) forecasting after anti-VEGF therapy is important for counseling and follow-up planning in diabetic macular edema (DME), yet remains challenging when only early post-treatment findings are available. While prior OCT-based methods mainly focus on short-term response or single-endpoint prediction, multi-horizon VA forecasting from early longitudinal data remains insufficiently under-explored. In this study, we assembled a real-world cohort of 188 anti-VEGF–treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that combines baseline and month-1 OCT features with tabular variables to capture disease status and early treatment response. ReVA integrates spatial OCT attention, dependency-aware tabular encoding, and cross-modal fusion to predict patient-specific long-term VA trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management. Code and pretrained models will be released on https://github.com/nguyenpbui/ReVA.

23.
arXiv (CS.AI) 2026-06-15

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

arXiv:2606.13731v1 Announce Type: new Abstract: Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI

24.
arXiv (CS.CL) 2026-06-17

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a security-oriented framework for risk identification, evaluation, and mitigation in a multi-agent GIS system while maintaining adaptability to broader agentic architectures. We test the agentic system of a commercial geospatial partner while developing a modular state-machine-based orchestration framework that abstracts agent behavior into reusable components. We evaluate robustness using a red-teaming framework with an adaptive attacker LLM and a deterministic judge that produces binary outcomes with supporting rationales across multi-turn attacks. We further improve resilience with a prompt optimization framework that treats prompts as structured signatures and injects adversarial demonstrations, enabling systematic security improvements without degrading task performance.

25.
arXiv (CS.CV) 2026-06-19

GenTrack: A New Generation of Multi-Object Tracking

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack