Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-22

Midlife Measures of General Cognitive Performance in the National Longitudinal Study of Adolescent to Adult Health (Add Health)

Objective: The Add Health Cognitive Assessment, Physical, and Sensory Function Protocol (Add CAPS) was developed to assess cognitive, physical, and sensory function in early midlife in a nationally representative sample in the United States. Using Add CAPS, we developed two general cognitive performance measures. Methods: The sample included 2,525 participants from Add Health Wave VI who completed an in- home assessment of cognitive performance. Confirmatory factor analysis (CFA) was used to derive two general cognitive performance (GCP) scores: (1) a five-domain score based on originally designed cognitive domains (Add CAPS GCP), and (2) a modified score aligned with the Harmonized Cognitive Assessment Protocol (HCAP) framework (Add CAPS GCP-H). We evaluated model fit using Root Mean Square Error of Approximation (RMSEA), Standardized Root Mean Square Residual (SRMR), and Comparative Fit Index (CFI) and tested factor scores for criterion validity. Results: Both models showed good fit (Add CAPS GCP: RMSEA = 0.025, SRMR = 0.031, CFI = 0.968; Add CAPS GCP-H: RMSEA = 0.027, SRMR = 0.033, CFI = 0.962), indicating that they adequately represent the underlying GCP construct. Discussion: The Add CAPS cognitive battery captures a robust, hierarchical structure of GCP across alternative domain specifications. The derived factor scores provide a valuable method for characterizing a person's cognitive baseline during midlife. Importantly, the Add CAPS GCP-H enhances comparability with the HCAP network, supporting cross-cohort analyses of cognitive aging.

02.
arXiv (math.PR) 2026-06-18

Stability of Khintchine-type inequalities via log-monotonicity

arXiv:2606.19313v1 Announce Type: new Abstract: We investigate Khintchine-type inequalities for the weighted sums $S=\sum_ka_kX_k$ of independent copies of a symmetric random variable $X$. We show how log-monotonicity of the sequence $r_k(X)=k! \mathbb{E}[X^{2k}]/(2k)!$ implies sharp comparisons between the $L_p$ and $L_2$ norms of $S$ for every even integer $p\geq 2$, extending classic Khintchine-type inequalities and yielding new results in the log-convex setting. We also investigate the stability of our inequalities. Our first stability inequality sharpens the classic inequality by a deviation of the coefficient vector from the coordinate extremizers, while the second quantifies deviation from the Gaussian limit. Our results recover recent stability inequalities for random signs and apply to a broad class of distributions, including type-$\mathscr{L}$ random variables, ultra sub-Gaussian random variables and Gaussian mixtures.

03.
arXiv (quant-ph) 2026-06-19

Accelerated Rydberg electromagnetically induced transparency quantum memory via shortcuts to adiabaticity

arXiv:2603.18399v2 Announce Type: replace Abstract: Electromagnetically induced transparency (EIT) enables coherent light-matter storage, forming the basis of photonic quantum memories that are essential for scalable quantum networks and distributed quantum computing. However, accelerating the storage process violates the adiabatic condition, resulting in the excitation of the lossy intermediate state and a reduction in writing efficiency. We propose and numerically investigate a high-speed, high-fidelity quantum storage scheme by incorporating a shortcut-to-adiabaticity (STA) technique based on counter-diabatic (CD) driving. By introducing a precisely engineered auxiliary field into a conventional EIT system, our protocol significantly shortens the writing time beyond the conventional adiabatic limit while effectively suppressing the transient population of the lossy intermediate state. Furthermore, our scheme demonstrates strong flexibility in pulse design, remaining effective across different temporal profiles of both the control and signal fields. It also exhibits robustness against imperfections in the CD drive. Even with imperfect single-photon writing and non-ideal Rydberg blockade, the scheme retains clear advantages, maintaining high storage performance and overcoming the intrinsic speed-fidelity trade-off of traditional EIT protocols. These features pave the way for fast and robust quantum devices suitable for high-throughput quantum repeaters and advanced quantum information processing.

04.
arXiv (CS.AI) 2026-06-12

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

arXiv:2606.12921v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

06.
arXiv (CS.CL) 2026-06-16

SkillWiki: A Living Knowledge Infrastructure for Agent Skills

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at https://github.com/Huangdingcheng/SkillWiki.

07.
arXiv (quant-ph) 2026-06-11

An iterative Ising decoder for quantum error correction codes

arXiv:2606.12301v1 Announce Type: new Abstract: The Ising framework maps the decoding problem in quantum error correction onto ground-state optimization of a classical Hamiltonian, in which $X$-$Z$ error correlations enter as cross terms. Under phenomenological depolarizing noise, the exact joint formulation contains up to 8-body interactions for the toric code and 10-body for the $6.6.6$ color code. These high-order terms degrade solver convergence, inflate runtime, and raise the auxiliary spin overhead when embedding into native 2-body Ising hardware. In this work, we propose the iterative low-order decoding (ILOD) algorithm, which alternates between $X$- and $Z$-type sub-Hamiltonians, approximating cross-type correlations through Bayesian priors that reweight each type's couplings using the other type's inferred error configuration. This halves the maximum body count of interaction terms in the Hamiltonian, accelerating the solver, restoring convergence at larger code distances, and reducing the total spin count for 2-body embedding by a factor of $2.5$. For the toric code, ILOD attains a threshold of $4.73%$ versus $4.83%$ for the joint formulation, with the empirical runtime ratio scaling as $(0.81)^d$. For the $6.6.6$ color code, their thresholds agree within statistical uncertainty for small code distances, and ILOD remains convergent for larger distances where the joint formulation fails to converge despite a larger annealing budget.

08.
arXiv (math.PR) 2026-06-16

The existence of invariant sublinear expectations for $G$-SDEs

arXiv:2606.15203v1 Announce Type: new Abstract: In this paper, we study the existence of invariant sublinear expectations of Markovian semigroups on sublinear expectation spaces. To achieve this, we establish a complete metric space of sublinear expectations, on which we extend Harris' method to the nonlinear setting on the convergence of sublinear semigroups. We then explore two cases of $G-$diffusions by studying the Lyapunov function and the local Doeblin condition. One is the $G-$Brownian motion on the unit circle which is the case studied in Feng and Zhao [Zhaonon], but with the new method. Another is the multidimensional $G-$SDEs on the whole space $\mathbb{R}^d$. We establish, for the first time in the literature, the existence of the invariant sublinear expectation for $G-$SDEs under the non-degenerate and weakly dissipative assumption. For this, we prove that for a class of $G-$SDEs, the $G-$expectation can be represented as the supremum of the semigroup of a family of SDEs, of which the regularity is obtained by considering the Bismut-Elworthy-Li formula and the Denis-Hu-Peng representation for the distribution of $G-$Brownian motions.

09.
medRxiv (Medicine) 2026-06-15

Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study

Background: Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) free-text comments contain actionable feedback, but timely, scalable, and affordable sentiment analysis remains challenging for health systems that rely on third-party vendors. Objectives: To evaluate cost-performance tradeoffs between a cost-optimized and a flagship large language model (LLM) for aspect-based sentiment analysis of HCAHPS comments, using human inter-rater agreement as a reproducibility benchmark. Methods: We analyzed 512 free-text HCAHPS comments collected from two community hospitals in calendar year 2023. Six trained reviewers (medical students, recent medical graduates, and practicing internists) independently assigned positive, negative, or neutral labels to each comment-aspect pair; the majority label among three reviewers formed the consensus reference standard. Two OpenAI models - GPT-5-nano (cost-optimized) and GPT-5 (flagship) - were prompted in a zero-shot setting via the OpenAI API. We calculated pairwise Cohen's {kappa} to establish a human inter-rater baseline, then compared each model's labels to the consensus using Cohen's {kappa}, accuracy, weighted F1, and per-call cost and latency. Results: Mean human inter-rater agreement was {kappa} = 0.79 (substantial). Both LLMs exceeded this baseline (cost-optimized {kappa} = 0.85; flagship {kappa} = 0.85) with nearly identical accuracy (0.92) and weighted F1 (0.93 vs. 0.93). Performance was strong on positive (F1 ~ 0.97) and negative (F1 ~ 0.90) classes but poor on the underrepresented neutral class (F1

10.
medRxiv (Medicine) 2026-06-11

Validity and Limitations of the Empatica E4 Wristband for Autonomic and Thermoregulatory Sleep Monitoring Against Concurrent Polysomnography: A Wearanize+ Dataset Study

The Empatica E4 wristband provides continuous multi-modal physiological monitoring including blood volume pulse (BVP), electrodermal activity (EDA) and skin temperature (TEMP) but its validity for sleep-stage-specific autonomic and thermoregulatory monitoring has not been systematically evaluated against concurrent polysomnography (PSG). Using the Wearanize+ dataset which provides synchronised PSG, Empatica E4, and Zmax EEG recordings from 100 home-recorded participants; a systematic validation of Empatica E4 physiological signals against PSG ground truth across five sleep stages was conducted. Of 100 participants, 92 had Empatica data; 69 met Zmax EEG signal quality criteria and formed the analysis sample. Heart rate (HR) from the pre-computed Empatica HR channel showed valid stage-specific patterns (Wake: 70.9 bpm, N3: 61.2 bpm) and moderate inter-device MeanNN correspondence with PSG ECG (Spearman r=0.35-0.42 across stages). Skin temperature showed the expected thermoregulatory pattern (Wake: 33.92C, N3: 35.48C) and is recommended for downstream analyses. Tonic EDA showed an inverted stage pattern attributable to wrist sweat accumulation during deep sleep, representing a known confound for wrist-worn EDA during sleep. Phasic EDA showed plausible patterns and may be used with caution. These findings establish a validated feature set for Empatica E4 sleep research and directly inform multimodal psychiatric biomarker studies using the Wearanize+ dataset.

11.
medRxiv (Medicine) 2026-06-22

Leishmaniasis on YouTube: a critical appraisal of the quality, reliability, and transparency of educational content

Background: Leishmaniasis is a neglected tropical disease of significant global public health importance, for which accurate information is essential to support prevention and early care-seeking, particularly in endemic, resource-limited settings. YouTube is a widely used source of health information, but the quality and reliability of leishmaniasis-related content have not been evaluated. We aimed to assess the quality, reliability, and transparency of English-language YouTube videos on leishmaniasis. Methods: We conducted a cross-sectional analysis of YouTube videos retrieved via the YouTube Data API on 15 June 2026 using the terms "leishmaniasis," "cutaneous leishmaniasis," and "visceral leishmaniasis." After applying eligibility criteria and screening the 150 most-viewed eligible videos, 48 videos were included. Two reviewers independently assessed each video using the modified DISCERN (mDISCERN) tool, the Global Quality Score (GQS), and the JAMA benchmark criteria, with disagreements resolved by consensus. Inter-rater agreement was assessed using the intraclass correlation coefficient (ICC), and associations were examined using Spearman's rank correlation. Results: Of 402 videos retrieved, 48 met the inclusion criteria. The median GQS was 3.00 (IQR 2.00-4.00) and median mDISCERN was 3.00 (IQR 2.38-4.50), indicating moderate quality and reliability, while the median JAMA score was 2.00 (IQR 1.00-2.00), reflecting limited transparency; no video met all four JAMA criteria. The overwhelming majority of videos (47/48, 97.9%) were of professional or institutional origin. Inter-rater agreement was good to excellent (ICC 0.883 for GQS, 0.896 for mDISCERN, 1.000 for JAMA). The instruments were strongly inter-correlated (mDISCERN-GQS rho = 0.841, p < 0.001). Quality scores did not correlate positively with views, likes, or video duration; comments correlated weakly and negatively with mDISCERN (rho = -0.337, p = 0.031) and JAMA (rho = -0.381, p = 0.014). Conclusions: YouTube videos on leishmaniasis are of moderate quality and reliability but limited transparency, and are produced almost exclusively by professional sources. Video popularity, length, and age were not indicators of quality. There is a need for experts and institutions to produce clearly authored, well-sourced, and transparent educational content on this neglected tropical disease.

12.
PLOS Computational Biology 2026-06-18

Ten simple rules for turning your qualifying exam into an NIH-style fellowship proposal: A guide for graduate students

by Courtney Peña-Lima, Cameron S. Bader, Brendan K. Ball, Troy C. Dildine, Mekhala V. Dissanayake, Iris van ‘t Erve, Albina Ibrayeva, Amy Nippert, M.K. Quinn, Chelse Spinner, Samuel Thompson, Antonio Tomasso, Crystal M. Botham Qualifying exams, often referred to as “quals” or candidacy exams, are an important milestone in doctoral programs. Although the style of quals varies greatly by program and institution, it is usually a proposal that requires students to develop research ideas as well as their scientific writing skills. Many quals are modeled after funding mechanisms that graduate students can apply to and on a topic that the student will pursue in their dissertation. This paper offers graduate students a step-by-step guide on how to turn their quals into a fellowship-style research proposal, using National Institutes of Health (NIH) mechanisms as a benchmark, as this is the norm within US research institutions. This paper will be most useful for students who have completed or are in the process of completing proposal-based qualifying exams, usually in the second year of a doctoral program.

13.
arXiv (CS.AI) 2026-06-12

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

作者:

arXiv:2606.12969v1 Announce Type: new Abstract: The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions – such as querying knowledge bases or generating work orders – to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

14.
arXiv (CS.AI) 2026-06-16

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

arXiv:2606.16902v1 Announce Type: cross Abstract: This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking

15.
arXiv (CS.AI) 2026-06-12

MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

arXiv:2606.13119v1 Announce Type: cross Abstract: Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

16.
arXiv (CS.LG) 2026-06-18

Knockoffs-based False Discovery Rate Control and Simplification for Deep Neural Networks

arXiv:2606.04404v2 Announce Type: replace-cross Abstract: The deep neural network is a widely used framework in machine learning that has been widely applied in various fields. However, deep neural networks often involve a large number of parameters and inputs, many of which may be irrelevant to the goal or true output. These parameters and input variables not only increase computational complexity, but also contribute to additional computational cost. One solution to this problem is knockoff methods, which have proven successful in controlling false discovery rates in high-dimensional regression. Building on the knockoff methods and using the regularised neural network, this paper proposes three variable screening methods under the condition of controlling false discovery rates: one layer filter, multiple layers filter, and variable weight aggregation filter. In comparison with existing algorithms, we find that our algorithms show satisfactory performance.

17.
arXiv (CS.LG) 2026-06-12

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

arXiv:2606.13532v1 Announce Type: cross Abstract: Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

18.
arXiv (CS.AI) 2026-06-11

Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

arXiv:2606.11379v1 Announce Type: new Abstract: Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term "agent" for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline's single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.

19.
arXiv (CS.CL) 2026-06-12

Reward Modeling for Multi-Agent Orchestration

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

20.
arXiv (CS.LG) 2026-06-15

An Attention-based Model for Robust Forecasting with Missing Modality

arXiv:2606.13970v1 Announce Type: cross Abstract: Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.

21.
arXiv (CS.CL) 2026-06-11

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

22.
arXiv (CS.CL) 2026-06-19

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

24.
arXiv (CS.CL) 2026-06-19

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

25.
medRxiv (Medicine) 2026-06-15

Quantitative Gait Categorization in Parkinson's Disease with and without Freezing of Gait

Background: Freezing of gait (FOG) is a disabling and often underrecognized feature of Parkinsons disease (PD). Objective gait analysis may improve characterization of this motor symptom. Objective: To compare quantitative 3D gait parameters in PD with FOG (PDF) and PD without FOG (PDNF) in a routine clinical cohort. Methods: We retrospectively analyzed a sequential sample of 180 patients with PD referred for motion analysis between 2020 and 2024. All patients underwent 3D motion capture in the off-medication state. Eighteen gait outcomes spanning pace, rhythm, postural control, variability, and asymmetry domains were derived from steady-state walking tasks. FOG status was determined using physician documentation and Movement Disorder Society Unified Parkinsons Disease Rating Scale (MDS-UPDRS) items. Group differences between PDF (n=99) and PDNF (n=81) were evaluated using independent samples t-tests, with outcomes adjusted for disease duration and corrected for multiple comparisons. A secondary analysis among PDF compared those in Hoehn and Yahr (H&Y) stage [&ge;]III to those in H&Y [&le;]II. Results: PDF had longer disease duration, higher OFF MDS-UPDRS III scores, and higher Hoehn and Yahr stage than PDNF but were similar in age and sex. After adjusting for disease duration and multiplicity, PDF demonstrated reduced step length, stride length, and forward velocity, and greater cadence variability, while most postural control, and asymmetry measures were comparable between groups. Among PDF, advanced H&Y stage was associated with impaired pace and rhythm, similar to previous reports among PD in general. Conclusion: In this large, sequential, clinically referred cohort, FOG was associated with more advanced PD and specific impairments in pace and gait variability. These findings support comprehensive 3D gait analysis as an objective tool to better delineate FOG-related gait abnormalities and identify features that may predict FOG, informing targeted interventions.