Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (math.PR) 2026-06-15

A random approach to the multibonacci sequence

arXiv:2606.14294v1 Announce Type: cross Abstract: This paper presents a random approach to the multibonacci sequence. We generalise the model introduced by Benjamin, Levin, Mahlburg, and Quinn, which is based on a random tiling method using dominoes and squares that leads to the Fibonacci sequence, and which was extended to the tribonacci case in a previous work by the authors. Our approach employs tiling with linear $k$-ominoes, $k=1,\ldots,s$, combined with specific colouring, to generate a weighted multibonacci sequence. For a natural random variable~$X$ defined by this model, we establish the distribution of $X$ in terms of multibonacci numbers and compute $\mathbb{E}[X] = 2^{s+1}-3$.

02.
arXiv (CS.LG) 2026-06-12

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

arXiv:2606.12740v1 Announce Type: new Abstract: The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

03.
arXiv (CS.CV) 2026-06-18

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

04.
arXiv (CS.CV) 2026-06-11

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

05.
arXiv (math.PR) 2026-06-16

Cluster sizes in subcritical soft Boolean models

arXiv:2404.13730v2 Announce Type: replace Abstract: We consider the soft Boolean model, a model that interpolates between the Boolean model and long-range percolation, where vertices are given via a stationary Poisson point process. Each vertex carries an independent Pareto-distributed radius and each pair of vertices is assigned another independent Pareto weight with a potentially different tail exponent. Two vertices are now connected if they are within distance of the larger radius multiplied by the edge weight. We determine the tail behaviour of the Euclidean diameter and the number of points of a typical maximally connected component in a subcritical percolation phase. For this, we present a sharp criterion in terms of the tail exponents of the edge-weight and radius distributions that distinguish a regime where the tail behaviour is controlled only by the edge exponent from a regime in which both exponents are relevant. Our proofs rely on fine path-counting arguments identifying the precise order of decay of the probability that far-away vertices are connected.

06.
arXiv (CS.CV) 2026-06-16

Dual-branch Prompting for Multimodal Machine Translation

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

07.
arXiv (CS.LG) 2026-06-16

Q-Learning with Fine-Grained Gap-Dependent Regret

arXiv:2510.06647v2 Announce Type: replace-cross Abstract: We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. We address this limitation by establishing fine-grained gap-dependent regret bounds for both UCB-based and non-UCB-based algorithms. In the UCB-based setting, we develop a novel analytical framework that explicitly separates the analysis of optimal and suboptimal state-action pairs, yielding the first fine-grained regret upper bound for UCB-Hoeffding (Jin et al., 2018). To highlight the generality of this framework, we introduce ULCB-Hoeffding, a new UCB-based algorithm inspired by AMB (Xu et al.,2021) but with a simplified structure, which enjoys fine-grained regret guarantees and empirically outperforms AMB. In the non-UCB-based setting, we revisit the only known algorithm AMB, and identify two key issues in its algorithm design and analysis: improper truncation in the $Q$-updates and violation of the martingale difference condition in its concentration argument. We propose a refined version of AMB that addresses these issues, establishing the first rigorous fine-grained gap-dependent regret for a non-UCB-based method, with experiments demonstrating improved performance over AMB.

08.
arXiv (CS.CL) 2026-06-12

ProPlay: Procedural World Models for Self-Evolving LLM Agents

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

09.
arXiv (CS.CL) 2026-06-11

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read–Write–Assess–Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

10.
arXiv (CS.AI) 2026-06-15

Hyperdimensional computing for structured querying on tabular data embeddings

arXiv:2606.13871v1 Announce Type: new Abstract: Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality – while uniquely enabling reliable identification of zero-match predicates through its principled thresholds.

11.
arXiv (CS.CL) 2026-06-11

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

12.
arXiv (CS.AI) 2026-06-17

Handling Feature Heterogeneity with Learnable Graph Patches

arXiv:2606.17667v1 Announce Type: cross Abstract: In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

13.
medRxiv (Medicine) 2026-06-15

Recruitment, Retention Approaches and Community Engagement in the THRIVE pilot Trial: Lessons Learned from a Food is Medicine Trial

Background: Recruitment of underrepresented populations, including Black and Hispanic populations, for Food is Medicine (FIM) and cardiovascular trials, may pose significant challenges. Methods: We implemented a multi-component recruitment approach for the THRIVE (AdapTive personalized dietitian coacHing and messaging with pRoduce prescrIptions to improVE healthy dietary behaviors) pilot trial to engage primarily Black and Hispanic adults in a Food is Medicine for hypertension intervention. The recruitment approaches included community engagement at approximately 40 community events (cultural festivals and neighborhood gatherings); partnerships with 8 community and faith-based service hubs and food distribution sites; recruitment through safety net primary care clinics, digital outreach via the study website, and social media campaigns; and direct recruitment at places of worship. We report lessons learned from the community engagement process, recruitment efficiency, representativeness, and retention outcomes. Results: Within 6 months, the enrollment target was exceeded by 40%, with an accrual index of 1.04. Over 1,000 individuals were reached through the direct-to-community engagement process, while faith-based partnerships engaged about 900 adults. There were 2,673 visits to the study webpage, and social media achieved 12,259 impressions with 399 clicks. About 95% of participants resided within 10 miles of the faith-based recruitment sites. Face-to-face engagement at the food distribution sites within faith-based organizations or community service hubs outperformed digital methods. Faith leader endorsements and follow-up in-person meetings (following unsuccessful email outreach) dramatically increased recruitment. Regarding retention, pre-randomization attrition was 6%, and 82% of participants completed the study. Conclusion: Culturally tailored, community-engaged recruitment grounded in faith-based and local community partnerships, was highly effective in engaging Black and Hispanic populations in this FIM cardiovascular trial. This provides a replicable model for implementing equitable and sustainable cardiovascular health interventions.

14.
arXiv (CS.AI) 2026-06-19

SoftSkill: Behavioral Compression for Contextual Adaptation

arXiv:2606.20333v1 Announce Type: new Abstract: Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.

15.
arXiv (CS.AI) 2026-06-12

Multiagent Protocols with Aggregated Confidence Signals

arXiv:2606.13591v1 Announce Type: new Abstract: Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

16.
medRxiv (Medicine) 2026-06-18

MOSAIC: Methylation-Oriented Site Analysis and Information Classifier for Robust Epigenomic Classification of Acute Leukemia in Clinical Cohorts with Variable Tumor Purity

DNA methylation-based classification offers a rapid diagnostic complement to conventional molecular workflows in acute leukemia. Existing classifiers are trained on array-derived reference cohorts whose construction favors specimens with adequate tumor content, leaving clinically relevant low-purity specimens underrepresented and classifier robustness in this regime uncharacterized. On held-out low-purity specimens, existing classifiers were concordant with expert pathology in only 7 of 10 (MARLIN) and 5 of 10 (ALMA) cases, motivating a classifier built to maintain accuracy at low tumor purity. We developed MOSAIC (Methylation-Oriented Site Analysis and Information Classifier), a neural network classifier built to maintain accuracy across the full range of tumor purities encountered in clinical practice. MOSAIC is a neural network trained on publicly available array-based methylation data augmented with native methylation calls from Oxford Nanopore sequencing. MOSAIC was evaluated on low-purity specimens held out entirely from training. On these held-out low-blast leukemia specimens, all below 25% blasts and including a case at 1.4%, MOSAIC was concordant with expert pathology in every case, recovering the correct subtype where diluted disease signal would otherwise be mistaken for normal or unrelated tissue. Gradient-based saliency analysis showed that the network relies on a partially distinct set of discriminative CpG probes when classifying low-blast specimens. MOSAIC demonstrates that augmenting training with clinically representative clinical specimens yields methylation-based leukemia classification that maintains effectiveness under the variable tumor purity of real clinical cohorts.

17.
arXiv (CS.AI) 2026-06-16

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

arXiv:2606.01365v2 Announce Type: replace Abstract: Failure-aware observability diagnoses wasted computation in multi-agent LLM systems before final-answer evaluation can explain what went wrong. We propose a trace-based framework for a three-agent architecture – orchestrator, search agent, and execution agent – that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM-as-judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10-task Level-2 pilot uses warnings to diversify search or require evidence, reducing post-warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.

18.
arXiv (CS.AI) 2026-06-17

Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era

arXiv:2605.29179v2 Announce Type: replace-cross Abstract: Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.

19.
PLOS Computational Biology 2026-06-05

Heuristic multi-site optimization for protein sequence design using Masked Protein Language Models

作者:

by Lijuan Wang, Yuze Wang, Chen Qiu, Liwei Xiao, Xianliang Liu, Junjie Chen Protein sequence design for tailored functional properties is a fundamental task in protein engineering, with critical applications in drug discovery and therapeutic development. Efficient navigation of the combinatorial vastness of protein sequence space to identify functional variants remains a formidable challenge. Conventional approaches, which predominantly rely on template-based local search or single-residue mutagenesis, are constrained by their susceptibility to local optima and their potential risk of destabilizing native structural stability. In this study, we introduce ProtHMSO, a heuristic multi-site optimization framework leveraging masked protein language models (ProtLMs) for context-aware sequence exploration. ProtHMSO mimics natural evolutionary mechanisms by employing ProtLM-derived substitution probabilities to guide heuristic searches for synergistic mutations, thereby constraining combinatorial search spaces through evolutionary and biophysical priors. ProtHMSO is further applied to replace the exploration strategies in genetic algorithms (GAs) and Monte Carlo tree search (MCTS) for improving their convergence efficiency. Benchmark experiments demonstrate that protein sequences generated by ProtHMSO exhibit superior functional performance and closer alignment with natural sequence distribution, compared with state-of-the-art methods. These advancements highlight that ProtHMSO has strong potential and compatibility to accelerate functional protein discovery, offering a robust framework for efficient and context-aware exploration of protein sequence space.

20.
arXiv (CS.CL) 2026-06-15

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

21.
arXiv (CS.AI) 2026-06-17

First, do NOHARM: towards clinically safe large language models

arXiv:2512.01241v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

22.
arXiv (CS.AI) 2026-06-12

A Tutorial on World Models and Physical AI

作者:

arXiv:2606.12783v1 Announce Type: new Abstract: World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

23.
PLOS Medicine 2026-06-23

Comparisons of core component delivery in cardiac rehabilitation programs by country income classification and decade based on the 2025 Global Audit Update: A survey study

by Gabriela Lima de Melo Ghisi, Rachael P. Carson, Karam Turk Adawi, Rongjing Ding, Warner M. Mampuya, Mariya P. Jiandani, Jimena Martinez, Monserrat Cruz Rivero, Claudia V. Anchique, Dinah L. van Schalkwijk, Jonathan Gallagher, Buket Akinci, Dion Candelaria, Jirapa Champaiboon, Daniel F. Quesada-Chaves, Tone M. Norekvål, Iwona Szadkowska, Borut Jug, Evangelia Kouidi, Marta Supervia, Won-Seok Kim, Chamila Mettananda, Lilian Mbau, Gulsim T. Aimakova, Sherry L. Grace, on behalf of the ICCPR Global Cardiac Rehabilitation Audit Update Investigators Background Cardiovascular disease (CVD) remains a leading global health burden. Cardiac rehabilitation (CR) is essential to reducing morbidity and improving patient outcomes. Since the COVID-19 pandemic, CR delivery worldwide has evolved, yet these changes have not been systematically charactemkjrized. The objective of this study was to characterize globally: (1) the delivery of core CR components, including risk factors assessed, patient education practices, and program resources; (2) differences in these elements by country income classification and relative to the initial 2016 Global CR Audit. Methods and findings A cross-sectional Audit update was conducted. Program-level data were collected from May 1st to September 1st 2025 using a REDCap survey adapted from previous Audits. Eligible respondents were leads of phase II/post-discharge CR programs providing at least an initial assessment, structured aerobic exercise, and ≥1 additional core component. ICCPR associations and local leaders supported program identification. Main outcomes were core components delivered (10 assessed), risk factors assessed (14 assessed), patient education dose (hours/patient/program), and program resources (17 assessed). Generalized linear mixed models (GLMM) tested differences by income classification and (when applicable) changes since 2016. Of 7,025 programs identified globally, 1,505 (62% median country response rate) initiated a survey from 90/113 (80%) countries with CR. The median number of core components offered was 8/program (p25, p75 = 6, 10), with upper-middle income countries offering significantly more components overall (median = 9), and also high-income countries offering more than low-income countries (8 versus 6, p 

24.
arXiv (CS.CL) 2026-06-12

Occupational Prompting Reveals Cultural Bias in Large Language Models

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart–Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

25.
arXiv (CS.CL) 2026-06-17

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English–Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English–Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.