Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-11

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

arXiv:2606.11662v1 Announce Type: new Abstract: Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.

02.
arXiv (CS.CV) 2026-06-16

Temporal Difference Learning for Diffusion Models

Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

03.
arXiv (CS.LG) 2026-06-16

Privacy from Symmetry: Orthogonally Equivariant Transformers for LLM Inference

arXiv:2606.16461v1 Announce Type: new Abstract: Running large language models locally is often impractical, pushing inference on sensitive text to third-party providers. Split inference partially mitigates this by keeping tokens on the client and sending only hidden representations, but these representations can still be recovered via nearest-neighbor search against the public embedding table. We propose an orthogonal obfuscation procedure in which the client multiplies embeddings by a secret orthogonal matrix before transmission. To enable correct inference under arbitrary rotations, we introduce ConjFormer, a transformer variant that is exactly $\mathrm{O}(d)$-equivariant via a lightweight normalization change (scalar RMSNorm) together with blockwise orthogonal conjugation of all linear weights. As a result, the server performs the full forward pass entirely in the rotated basis and never observes unrotated hidden states. Experiments on GPT-2 and Llama 3.2 1B models fine-tuned on PubMed show that orthogonal obfuscation eliminates direct cosine nearest-neighbor inversion and reduces token recovery from over 35% top-10 to at most 1.3%, while increasing perplexity by only 0.4% after fine-tuning. These results indicate that enforcing symmetry at the architectural level can provide a practical defense for privacy-preserving LLM inference without noise injection or heavy cryptographic machinery.

04.
arXiv (CS.CL) 2026-06-18

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

05.
arXiv (math.PR) 2026-06-11

Second-order PACF asymptotics and discrimination between fractional Gaussian noise and $\operatorname{FARIMA}(0,d,0)$

作者:

arXiv:2605.31416v2 Announce Type: replace-cross Abstract: Fractional Gaussian noise and $\operatorname{FARIMA}(0,d,0)$ have the same long-memory pole $|\theta|^{-2d}$ and hence the same leading PACF law $\alpha(n)\sim d/n$. We show that this agreement breaks at the first non-universal order. For $0

06.
arXiv (CS.AI) 2026-06-19

Towards Engineering Scaling Laws with Pretraining Data Composition

arXiv:2606.19781v1 Announce Type: cross Abstract: Neural scaling laws describe how model performance improves as a power law in compute, model size, and dataset size. While well-established for large language models, these relationships are emerging for large models in particle physics. As with language, empirical studies show that the performance scales as a power law. However, unlike natural language or image domains, fundamental physics has high-fidelity simulators that produce synthetic data cheaply. This favors scaling regimes where additional data is cheaper than additional parameters, and allows the pretraining dataset itself to be engineered to influence the scaling. For the task of classifying hadronic jets produced in collisions of high-energy particle beams, we show that the scaling behavior can be engineered towards requiring more data rather than larger models by inclusion of pretraining data which is more diverse and better aligned with the downstream classification task.

07.
arXiv (CS.CV) 2026-06-19

World Engine: Towards the Era of Post-Training for Autonomous Driving

Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

08.
medRxiv (Medicine) 2026-06-12

The Clinical Characteristics and mortality outcomes of Atrial fibrillation complicating Heart failure with reduced ejection fraction: A prospective study from South Africa

Background: A growing burden of cardiovascular risk factors has raised cardiovascular disease-related mortality in Sub-Saharan Africa (SSA), driving higher prevalence of heart failure with reduced ejection fraction (HFrEF) and its complication with atrial fibrillation (AF). No prospective study has examined AF's clinical impact on HFrEF in SSA. Aim: To determine AF prevalence in HFrEF, describe HFrEF-AF clinical characteristics, and determine AF's impact on mortality. Methods: In this prospective observational study at a tertiary hospital in Johannesburg, 136 HFrEF patients were enrolled and categorised as HFrEF- SR (sinus rhythm) or HFrEF-AF. Baseline clinical characteristics and biochemistry were recorded. Comprehensive echocardiography including left atrial strain by 2D speckle-tracking was performed. Median follow-up was 30.6 months. Results: AF was present in 28 patients (21%). The mean age was 58.7 {+/-} 14.9 years (52.9% male) and differed between groups (p < 0.001). Hypertensive heart disease was the leading cause of HFrEF (36%). Compared with SR, HFrEF-AF patients had poorer health status (KCCQ 27 [16-43] vs 45 [32-60], p < 0.001) and lower left atrial strain (26.2 {+/-} 11.3%, p < 0.001). Guideline-directed medical therapy was suboptimal in the AF group: anticoagulation use was higher than SR (60% vs 9.5%, p < 0.001) but overall inadequate; HFrEF-AF patients received lower median doses of carvedilol (15.6 mg vs 25 mg, p = 0.002) and enalapril (10 mg vs 20 mg, p = 0.004), and fewer received spironolactone (50% vs 75.3%, p = 0.013). Survival was significantly lower in HFrEF-AF (0.41 [0.22-0.61]) versus SR (0.73 [0.61-0.82], p < 0.001). Independent predictors of mortality included prior stroke, lower TAPSE and KCCQ, and higher E/e' and heart rate. Conclusion: AF is common among HFrEF patients in this SSA cohort (though lower than in high-income countries) and associates with worse clinical status, suboptimal therapy, and higher mortality.

09.
arXiv (CS.CL) 2026-06-19

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

10.
arXiv (CS.AI) 2026-06-15

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

arXiv:2606.13802v1 Announce Type: cross Abstract: Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.

11.
arXiv (quant-ph) 2026-06-11

Collective neutrino oscillations: Many-body non-forward effects and non-classicality

arXiv:2606.12404v1 Announce Type: cross Abstract: Neutrino evolution in dense astrophysical environments is typically described either within a quantum kinetic framework, which neglects the build-up of multi-body correlations, or through simplified many-body calculations that allow significant entanglement to develop. In this work, we compare these two approaches in a simple neutrino-gas configuration, with particular emphasis on the role of non-forward scattering processes. These effects are incorporated either through a collision term in the kinetic description, or by considering the full neutrino-neutrino many-body Hamiltonian. We highlight differences between the two descriptions in both their characteristic timescales and asymptotic behavior. Motivated by the natural suitability of quantum computing for many-body calculations, we further investigate the non-classicality of neutrino evolution, discussing Trotter error scaling, along with the associated costs of constructing quantum circuits in terms of entangling gates and non-Clifford gates. We find that the resources needed for neutrino many-body evolution are on the low end of typical high-energy physics problems and on the mid to high end with respect to quantum chemistry problems. For the full Hamiltonian, resource requirements increase relative to the truncated version. We emphasize the importance of efficient fermion-to-qubit encodings, which are essential for reducing the substantial computational resources required for such simulations.

12.
arXiv (CS.CL) 2026-06-18

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

13.
medRxiv (Medicine) 2026-06-22

Characteristics and Outcomes of Gene-Elusive Dilated Cardiomyopathy

Background and Aims Genetic testing in dilated cardiomyopathy (DCM) guides risk stratification and family screening. Likely pathogenic or pathogenic (LP/P) variants are identified in approximately one-third of patients, leaving many without a genetic diagnosis. Cohort studies suggest that "gene-elusive" patients have a lower risk of adverse events. This study aims to better characterise this group and identify factors associated with adverse outcomes. Methods Consecutive and unrelated DCM patients undergoing genetic testing and returning no LP/P variants were retrospectively recruited and compared to two control cohorts of DCM patients carrying LP/P variants in LMNA and TTN for a primary composite endpoint of end-stage heart failure (ESHF) or malignant ventricular arrhythmia (MVA). Results Among patients without prior MVA, the composite endpoint occurred in 36/423 (8.5%) gene-elusive, 14/39 (35.9%) LMNA and 11/100 (11%) TTN cardiomyopathy patients (log-rank p

14.
arXiv (quant-ph) 2026-06-15

Merged amplitude encoding for Chebyshev quantum Kolmogorov–Arnold networks: trading qubits for circuit executions

arXiv:2603.02818v3 Announce Type: replace Abstract: Quantum Kolmogorov–Arnold networks based on Chebyshev polynomials (CCQKAN) evaluate each edge activation function as a quantum inner product, creating a trade-off between qubit count and the number of circuit executions per forward pass. We introduce merged amplitude encoding, a technique that packs the element-wise products of all $n$ input-edge vectors for a given output node into a single amplitude state, reducing circuit executions by a factor of $n$ at a cost of only 1–2 additional qubits relative to the sequential baseline. The merged and original circuits compute the same mathematical quantity exactly; the open question is whether they remain equally trainable within a gradient-based optimization loop. We address this question through numerical experiments on 10 network configurations under ideal, finite-shot, and noisy simulation conditions, comparing original, parameter-transferred, and independently initialized merged circuits over 16 random seeds. Wilcoxon signed-rank tests show no significant difference between the independently initialized merged circuit and the original ($p > 0.05$ in 28 of 30 comparisons), while parameter transfer yields significantly lower loss under ideal conditions ($p < 0.001$ in 9 of 10 configurations). On 10-class digit classification with the $8\times8$ MNIST dataset using a one-vs-all strategy, original and merged circuits achieve comparable test accuracies of 53–78\% with no significant difference in any configuration. These results provide empirical evidence that merged amplitude encoding preserves trainability under the simulation conditions tested.

15.
arXiv (quant-ph) 2026-06-11

Exact Dynamics of Topological Order Across a CDW–SPT Transition

arXiv:2606.11303v1 Announce Type: cross Abstract: We investigate the nonequilibrium dynamics of a one-dimensional interacting system across a transition from a charge-density-wave (CDW) phase to a symmetry-protected topological (SPT) phase. Starting from a CDW initial state, we study both sudden quenches and slow ramps into the SPT regime. While the CDW order melts under both protocols, the fate of topological order is sharply different. Following a sudden quench, long-range SPT order does not emerge because the post-quench state contains a finite density of excitations above the topological ground state. In contrast, slow ramps allow the system to follow the instantaneous ground state away from the critical region, enabling the buildup of SPT order with deviations governed by Kibble-Zurek defect production. The dynamics is solvable via a unitary mapping to a quadratic fermionic Hamiltonian, allowing us to compute the Loschmidt echo, correlation functions, and string correlator. The Loschmidt rate function exhibits cusps signaling dynamical quantum phase transitions, while the correlation dynamics reveal the contrasting mechanisms governing quenches and ramps across the transition. These results demonstrate that entering the topological regime is not sufficient for the emergence of topological order; the decisive factor is the suppression of excitation production during the evolution.

16.
arXiv (CS.AI) 2026-06-19

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

arXiv:2501.17015v2 Announce Type: replace Abstract: Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

17.
arXiv (CS.CL) 2026-06-11

LifeSentence: Language models can encode human life course trajectories from longitudinal panel data

Forecasting human life outcomes is important to gain insights into how individuals attain long and healthy lives. Conventional statistical approaches yield limited accuracy, potentially due to discarding the sequential structure of the life course. Modern methods such as transformer architectures require large scale training data that most longitudinal panel studies lack. Here we introduce LifeSentence, a model for life-course reasoning that bridges large language models with longitudinal panel data. By representing each life event as a structured natural-language record and instruction-tuning a pretrained 24-billion-parameter language model across an 18-task evaluation taxonomy spanning prediction, robustness and reasoning, LifeSentence supplements panel data with distributional knowledge already encoded during pretraining. Trained on approximately 65,000 individuals from the German Socio-Economic Panel - roughly 45 times fewer than prior transformer-based approaches - LifeSentence outperforms classical and deep learning baselines across all task families, achieving a threefold improvement in joint event-and-timing prediction from best baselines and 91.2% Kendall's tau when reconstructing chronological order from timestamp-stripped event sets. Without explicit supervision, the model recovers documented patterns of social stratification, including the education premium, the gender wage gap and the motherhood penalty, from discrete event sequences alone. A natural-language interface further enables qualitatively new research queries, such as connecting an early-life history to a specified late-life endpoint, establishing LifeSentence as both a predictive tool and a probe for counterfactual exploration of human biographies.

18.
arXiv (CS.CV) 2026-06-17

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

19.
arXiv (CS.CL) 2026-06-11

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

A social highlighter's most useful signal – which passages a crowd of readers marks – exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies – it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

20.
arXiv (CS.AI) 2026-06-11

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

arXiv:2509.23248v3 Announce Type: replace Abstract: The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

21.
medRxiv (Medicine) 2026-06-11

Genetic Susceptibility to Incisional Hernia: Evaluation of Hernia Polygenic Risk Scores

Objectives: Incisional hernia (IH) affects 13-30% of people after abdominal surgery, resulting in substantial morbidity and costs. While clinical risk factors have been studied extensively, genomic risk for IH is incompletely understood. We aimed to evaluate the impact of polygenic risk scores (PRS) on IH risk prediction. Methods] We created and evaluated three PRS for abdominal hernia, ventral hernia and latent hernia susceptibility for prediction of IH in an institutional biobank. The primary outcome was defined as the diagnosis or repair of an IH based on ICD-9/10-CM/PCS and CPT codes. Clinical covariates included age, sex, body mass index (BMI), smoking status, index procedure type, and perioperative surgical site infection. A phenome-wide association study (PheWAS) was performed to assess clinical associations with increased PRS. We then tested the ability of the PRS to improve prediction for IH by modeling clinical covariates with and without PRS in patients who underwent abdominal surgery. Model performance was assessed using 10 iterations of 5-fold cross-validation to estimate Brier scores and area under the receiver operating characteristic curve (AUROC), which were compared using cross-model Bayesian analysis of variance. Results: In 55,809 subjects, assessed PRS was significantly associated with incisional, umbilical, and ventral hernia on PheWAS, with 1.19 greater odds of developing IH per 1-SD increase in PRS (95% CI: 1.13-1.25, P < 0.001). Of 9,909 subjects who underwent qualifying abdominal surgery, 706 developed IH. In this cohort, the latent hernia susceptibility PRS was associated with a 16% increased hazard of developing IH per 1-SD increase (HR 1.16; 95% CI: 1.07-1.26; P < 0.001). Compared to a predictive model using clinical covariates (Brier score = 0.047, 95% CI: 0.046-0.048; AUROC = 0.660, 95% CI: 0.653-0.666), addition of the PRS showed similar Brier score and AUROC estimates (Brier score = 0.047, 95% CI: 0.046-0.048; AUROC: 0.667, 95% CI: 0.661-0.673) at five years. Cross-model Bayesian analysis demonstrated >99% probability of practical equivalence when trying to detect a difference of [&ge;] 0.02. Conclusion: All three PRS for hernia were independently associated with IH, suggesting that genomic factors contribute significantly to IH development. However, none of the three PRS meaningfully improved clinical IH risk prediction in patients who underwent abdominal surgery. This suggests that clinical comorbidities and surgical techniques may be equally as important as genomic architecture.

22.
arXiv (CS.CL) 2026-06-16

HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

23.
arXiv (CS.CL) 2026-06-15

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

24.
arXiv (CS.CL) 2026-06-16

Can LLM Coding Agents Reason About Time Series?

Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

25.
arXiv (CS.CV) 2026-06-19

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .