Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-16

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

arXiv:2606.08151v2 Announce Type: replace Abstract: Modern large language model (LLM) agents do not simply need longer contexts; they need decision-relevant evidence at the moment of action. We study decision-aware context selection: ranking retrieved files, tests, traces, rules, and memories by their expected effect on an agent's next action rather than by semantic similarity alone. We present the Counterfactual-Inspired Context Layer (CICL), which builds an instance context graph, estimates decision-oriented utility for candidate units, and compresses selected evidence into typed memory cards. The same schema can be instantiated with hosted LLM judges, local surrogates, or lightweight rankers, making the selection protocol auditable across model choices. On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-Plus reranking of BM25 top-50 candidates improves hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show that CICL identifies action-critical evidence: removing the top-utility semantic unit reduces F1 from 0.245 to 0.000. In selected-then-compressed mode, memory cards save 44.93 tokens per query while preserving selected evidence. CICL provides a practical layer for measuring, ranking, and compressing decision-critical context for tool-using agents. Code is available at https://github.com/stephen-guan-researcher/CICL.

02.
arXiv (CS.AI) 2026-06-15

Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories

arXiv:2511.07368v3 Announce Type: replace-cross Abstract: Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and test-time scaling (TTS). While recent work highlights the role of exploration in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing paths rather than expanding the reasoning scope, raising the question of why exploration helps if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering the some symmetry) reasoning steps as low versus high probability Markov transitions. In this tractable model, pretraining corresponds to tree-graph discovering, while post-training corresponds to CoT reweighting. We provably show that, both RLVR and ORM/PRM would favor heavily to several high-probability paths, and thereby forget rare-but-crucial CoTs. Building on this, we further prove that exploration strategies such as rejecting easy instances and KL regularization help preserve rare CoTs. Empirical simulations corroborate our theoretical results.

03.
arXiv (CS.CV) 2026-06-12

HairPort: In-context 3D-aware Hair Import and Transfer for Images

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

04.
arXiv (CS.AI) 2026-06-18

Quality Perceptions and Intended Engagement in Response to AI-Generated and AI-Assisted News

arXiv:2409.03500v4 Announce Type: replace-cross Abstract: The increasing use of artificial intelligence (AI) in news production raises important questions about how audiences perceive and respond to AI-generated journalism. This preregistered survey experiment (N = 599, German-speaking Switzerland) examines (i) perceptions of article quality (measured as credibility, readability, and expertise) across news excerpts that were human-written, AI-assisted, or fully AI-generated, and (ii) self-reported intentions to engage following disclosure of AI involvement. Participants rated two short news excerpts before learning how they had been produced. Articles across all conditions were evaluated similarly in perceived quality. After disclosure, participants in the AI-assisted and AI-generated conditions reported a higher willingness to continue reading their assigned articles compared to the control group, but future willingness to read AI-generated news did not differ across conditions. Overall, the findings suggest that readers assess AI-generated and human-written news comparably in quality, while disclosure of AI use can momentarily increase curiosity or interest without yet changing longer-term reading intentions.

05.
arXiv (math.PR) 2026-06-24

Typical geometry of self-repelling polymers in a constant force field

arXiv:2606.24352v1 Announce Type: cross Abstract: We study a general class of self-repelling polymers on $\mathbb Z^2$, including the simple random walk, the self-avoiding walk and the repulsive Domb-Joyce model, in the presence of a constant force field acting on each monomer. Conditioning the polymer to have fixed length and fixed endpoints, we identify the limiting free energy and prove that typical trajectories concentrate exponentially near a deterministic macroscopic shape. This shape is characterized as the unique minimizer of a variational problem and can be interpreted as a geodesic of a height-dependent Finsler metric. We also analyze two limiting regimes with universal features: for small field strength, in the symmetric case, the geodesic is close to a classical catenary, while for large field strength it converges to a universal polygonal shape governed by the nearest-neighbor lattice constraint.

06.
arXiv (CS.AI) 2026-06-12

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

arXiv:2606.12852v1 Announce Type: new Abstract: Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of what-where-when memory from which-why reasoning. To address this, we propose WISE (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

07.
arXiv (CS.CL) 2026-06-24

When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs

Discrete diffusion language model (DLM) fine-tuning inherits inexpensive diagnostics from denoising-time confidence monitors, but their PEFT-training meaning is untested. We test top-1 argmax concentration as a collapse warning. Across 816 LoRA/PEFT configurations from three DLM families, the warning fires for every configuration while logs record 0/816 actual collapses at the 200 step horizon, giving zero precision. The cause is pre-equilibrium saturation: top-1 concentration is already high before optimization and quickly becomes insensitive to final training stability. We then evaluate max LoRA gradient norm, a parameter-side signal that samples gradient routing rather than token concentration. On a pooled held-out LLaDA-family split, a train-optimized threshold identifies top-decile final-loss configurations with precision 0.68 and F1=0.79, above the all-positive top-1 baseline even at the lower split-bootstrap confidence bound. Autoregressive controls and cross-family threshold failures bound the result to short-horizon DLM-LoRA inspection rather than a universal collapse detector. Workflow: drop top-1 as a PEFT alarm, log max-gradient early in training, and calibrate thresholds per DLM family before routing runs for inspection.

08.
arXiv (CS.CL) 2026-06-24

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not reliably memorized by language models. We study a stage-wise alternative in which candidate-pool construction is optimized for recall and a constrained LLM adjudicator is optimized for candidate selection. On the full 3,744-value-set RASC test split, Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval increases candidate-pool recall from the original RASC retrieval baseline of 0.553 to 0.730; on the held-out-publisher stratum, pool recall is 0.655. The higher-recall pool alone is not sufficient: applying the original SAPBert cross-encoder to this expanded pool gives full-test macro F1 of 0.287 and held-out-publisher macro F1 of 0.233. Replacing the stage-2 selector with blinded GPT-5 adjudication over the same pool increases full-test macro F1 to 0.549 and held-out-publisher macro F1 to 0.533. These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.

09.
arXiv (CS.LG) 2026-06-15

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

10.
arXiv (CS.LG) 2026-06-19

Structure-Oriented Randomized Neural Networks for Poisson-Nernst-Planck and Poisson-Nernst-Planck-Navier-Stokes Systems

arXiv:2606.19912v1 Announce Type: cross Abstract: We develop a structure-oriented randomized neural network framework, termed SO-RaNN, for the Poisson-Nernst-Planck (PNP) system and the Poisson-Nernst-Planck-Navier-Stokes (PNP-NS) system. The decoupled linearized subproblems are solved iteratively by randomized neural networks in a space-time framework. For the concentration variables, a pointwise cut-off is used to enforce positivity at the value level, and discrete mass-scaling factors are computed at selected correction instants and interpolated in time, so as to ensure exact mass matching at those instants and to promote approximate mass preservation between them. To introduce an auxiliary discrete dissipation mechanism, we further employ an SAV-type post-processing correction, which yields monotonicity of the SAV auxiliary variable under the ideal SAV update. For the PNP-NS system, a structure-preserving randomized neural network (SP-RaNN) is used for the velocity field, so that the velocity approximation satisfies the incompressibility constraint pointwise by construction. On the theoretical side, we derive residual-based estimates for the raw, uncorrected RaNN solvers of the linearized subproblems, formulate a conditional local-in-time convergence result for the raw outer Picard iteration of the PNP system, and analyze the value-level positivity correction together with the mass-correction and SAV post-processing steps. For the PNP-NS system, we establish an approximation result for the SP-RaNN space and provide a conditional error statement for the corresponding linearized Oseen-type problem. Numerical experiments demonstrate approximation accuracy in the source-driven manufactured tests and illustrate the intended value-level positivity correction, selected-time mass matching, computed free-energy curves based on the final gauge-fixed potential, and divergence-free approximation in benchmark tests.

11.
arXiv (CS.AI) 2026-06-18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

arXiv:2606.19245v1 Announce Type: new Abstract: Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

12.
bioRxiv (Bioinfo) 2026-06-11

GermRL: Alleviating The Germline Bias In Autoregressive Antibody Language Models Through Reinforcement Learning

Antibodies are powerful therapeutics whose antigen specificity arises from sequence diversity shaped during development. Recently, language models trained on large antibody repertoire datasets have enabled the generation and screening of novel candidates, but these models retain a strong germline bias. As AI adoption increases in therapeutic workflows, it is crucial to develop models that harness the diversity of antibodies necessary for the discovery of mutations that encode desirable properties. Previous work explored the germline bias in masked antibody language models, yet the bias in generative autoregressive language models has not yet been addressed. Here, we present GermRL, a lightweight and modular reinforcement learning (RL) framework capable of alleviating the germline bias in pre-trained antibody autoregressive language models through group relative policy optimization (GRPO). GermRL achieves consistent one-shot generation of antibodies that satisfy specified mutation thresholds from germline while maintaining structural plausibility. Under the lowest and highest mutation thresholds tested (5 and 35 mutations from germline), GermRL scores 0.992 and 0.950 pass@1, respectively, compared to 0.398 and 0.034 for the pre-trained language model. Within GermRL, we introduce a key pair of modifications to GRPO that increase training efficiency by discouraging reward hacking under our antibody application. Furthermore, comparison of RL generated and natural antibody sequences reveals how RL based optimization can explore alternative evolutionary mutational patterns and residue compositional strategies while preserving key global properties of natural antibodies, including identifiable germline assignments, embedding-level similarity and comparable developability profiles. Thus, RL-trained generative models optimized to promote antibody mutations through diversity from germline provide a promising framework for navigating the antibody sequence landscape, enabling exploration of novel yet biologically plausible candidates for therapeutic design.

13.
arXiv (CS.AI) 2026-06-16

Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course

arXiv:2606.16842v1 Announce Type: cross Abstract: Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

14.
arXiv (CS.CV) 2026-06-16

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

15.
arXiv (CS.CV) 2026-06-12

Towards More General Control of Diffusion Models Using Jeffrey Guidance

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

16.
arXiv (CS.CV) 2026-06-24

Trimming the Long-Tail of Visual World Modeling Evaluation

Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

17.
arXiv (CS.LG) 2026-06-18

Towards Anomaly Detection on Relational Data

arXiv:2606.18621v1 Announce Type: new Abstract: Relational databases are widely used for managing structured data in real-world systems. Detecting anomalies from such relational data is crucial for identifying fraud, risks, and abnormal behaviors, yet remains under-explored. The key challenges lie in the intrinsic complexity of relational data: multi-table attributes are high-dimensional and heterogeneous, making sparse abnormal clues easy to overwhelm by normal or irrelevant information; and anomalies may further manifest as abnormal connection patterns across different foreign-key relations, which existing tabular and graph anomaly detection methods are ill-suited to capture. To address them, we propose RelAD, a reconstruction-based framework that captures anomalies from both attribute and relational edge reconstruction. RelAD contains two core modules: conditional sparse-gated attribute reconstruction, which suppresses redundant multi-table attributes and emphasizes abnormal semantic blocks, and dual-view multi-relational edge reconstruction, which detects relation-specific abnormal connections from both intrinsic and behavioral entity profiles. The resulting attribute and relational signals are integrated through a lightweight fusion module to produce the final anomaly score. We further construct 6 benchmark datasets with systematic anomalies, on which extensive experiments show that RelAD consistently outperforms other baselines while achieving competitive efficiency.

18.
arXiv (CS.CV) 2026-06-15

ADAPT: An Autonomous Forklift for Construction Site Operation

Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.

19.
arXiv (CS.LG) 2026-06-18

Knockoffs-based False Discovery Rate Control and Simplification for Deep Neural Networks

arXiv:2606.04404v2 Announce Type: replace-cross Abstract: The deep neural network is a widely used framework in machine learning that has been widely applied in various fields. However, deep neural networks often involve a large number of parameters and inputs, many of which may be irrelevant to the goal or true output. These parameters and input variables not only increase computational complexity, but also contribute to additional computational cost. One solution to this problem is knockoff methods, which have proven successful in controlling false discovery rates in high-dimensional regression. Building on the knockoff methods and using the regularised neural network, this paper proposes three variable screening methods under the condition of controlling false discovery rates: one layer filter, multiple layers filter, and variable weight aggregation filter. In comparison with existing algorithms, we find that our algorithms show satisfactory performance.

20.
medRxiv (Medicine) 2026-06-17

Sao Tome and Principe on the verge of eliminating lymphatic filariasis as a public health problem: evidence from IDA impact assessment surveys

Background Accelerated efforts to eliminate lymphatic filariasis (LF) as a public health problem have been supported by the introduction of the triple-drug regimen of ivermectin, diethylcarbamazine and albendazole (IDA) in endemic settings. In Sao Tome and Principe, nationwide mass drug administration (MDA) with diethylcarbamazine and albendazole was implemented in 2018, followed by IDA in 2019 and 2020. This study assesses progress towards elimination using post-MDA impact assessment surveys conducted after cessation of treatment. Methods Cross-sectional surveys were conducted among adults aged 20 years and older in 2022 and again between December 2024 and January 2025. Circulating filarial antigen (CFA) was detected using the filarial test strip (FTS). Individuals who tested positive were examined for microfilaremia using nocturnal calibrated thick blood smear microscopy. Additionally, programme data on MDA coverage and morbidity were obtained from national surveillance records. Results Three rounds of nationwide MDA achieved high epidemiological coverage (86.4% in 2018, 74.2% in 2019 and 80.0% in 2020). The impact assessment surveys conducted in 2022 evaluated 14 132 adults, with 21 individuals (0.15%) testing positive for CFA, while the follow-up survey conducted between December 2024 and January 2025 assessed 14 653 adults and detected seven positive cases (0.05%). No microfilariae were detected among the 28 antigen-positive individuals examined using nocturnal calibrated thick blood smears. National morbidity records documented 190 cases of lymphoedema and nine cases of hydrocoele. Conclusions Infection indicators remain well below WHO decision thresholds, suggesting that LF transmission is unlikely to be sustained. Sao Tome and Principe appears to be close to eliminating LF as a public health problem. However, strengthening morbidity management services will be essential to support the preparation of the national elimination dossier.

21.
arXiv (CS.LG) 2026-06-24

Machine Learning Classification and Portfolio Construction: Does the Loss Function Matter?

arXiv:2108.02283v3 Announce Type: replace-cross Abstract: Classification outperforms regression across matched machine learning models in portfolio construction. A stacking ensemble of gradient boosted tree, random forest, and neural network yields a value-weighted annualized Sharpe ratio of 1.83 for classification and 1.11 for regression. This outperformance persists in multiclass settings, across subsamples, and after transaction costs. Spanning tests show that classification retains economically large alphas after we control for regression, whereas regression alphas shrink substantially once we control for classification. These results indicate that classification extracts more return information than matched regression. Our diagnostics trace classification's advantage to sharper and more precise separation of return deciles.

22.
arXiv (CS.AI) 2026-06-24

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

arXiv:2606.23238v2 Announce Type: replace Abstract: Logical reasoning is essential for reliable AI, yet existing benchmarks are largely first-order-logic-centric, focusing on object-level deduction over fixed predicates. This misses many realistic scenarios where models must reason over rules, predicates, functions, constraints, and decision procedures themselves. We introduce HOLMES (Higher-Order Logic Meets real-world Explainable Symbolic reasoning), the first real-world benchmark for higher-order symbolic reasoning in LLMs, containing 1379 instances. Built on higher-order logic, HOLMES pairs natural-language problems with HOL formalizations, ground-truth answers, verifiable reasoning traces, and fine-grained controllable reasoning factors across law and finance. Experiments show that current LLMs still struggle on HOLMES, with an average accuracy of only 50.64% and the best model reaching 59.54%. Our analyses further reveal that high final-answer accuracy can mask shortcut reasoning in conflict-resolution settings, while performance drops sharply under scope-conditioned and compositional reasoning. These findings identify higher-order symbolic reasoning as a key bottleneck for building reliable and verifiable LLMs. The project code and dataset are publicly available at https://github.com/wuyucheng2002/HOLMES.

23.
medRxiv (Medicine) 2026-06-23

Multidimensional motivation in aging: a person-centred framework spanning goal-directed behaviour, social reward and pleasure

Motivational changes are determinants of healthy aging, social engagement, and functional independence, and may signal early neurodegenerative risk. Existing assessment approaches in aging typically treat motivation as a unitary construct. Here, we introduce MotDem, an age-appropriate measure of motivation co-designed with people living with dementia, carers, and clinicians. Across a broad adult lifespan sample (18-80 years), MotDem revealed a robust three-domain motivational architecture encompassing goal-directed behaviour, social reward, and pleasure, with a fourth satiety factor retained as exploratory. This structure was replicated in an independent older cohort (45-80 years) from a different national context. MotDem showed strong convergence with established measures of apathy and anhedonia, alongside more modest associations with depressive symptomatology. Together, these findings show that motivational aging is multifaceted and poorly captured by traditional unitary assessment. MotDem provides a multidimensional framework for measuring distinct motivational drivers of heterogeneous aging trajectories, with implications for resilience, wellbeing, and neurodegenerative risk.

24.
bioRxiv (Bioinfo) 2026-06-11

GeroQubit: a lightweight, honesty-first de-novo design platform for geroscience-native small molecules with calibrated uncertainty

作者:

Computational molecule generation has outpaced its own credibility. We present GeroQubit, a GPU-free de-novo design platform that organizes candidates along a target x tissue x hallmark model and reports every signal alongside its measured baseline. We treat our tissue aging-signature readout as a mechanistic structural prior that we explicitly disclose is not validated against lifespan, and we surface efficacy only through a structure-to-lifespan k-NN whose weak but real signal (leave-one-out rho ~ 0.145) is wrapped in empirically-calibrated conformal intervals (90% target, 90.3% measured coverage). On a held-out retrospective recovery of ~1,940 ChEMBL binders against decoys, the score reaches ROC-AUC 0.945 with ~20x enrichment at 1% (BEDROC 0.91) and survives a scaffold-disjoint split - yet we report that it collapses to near-random (AUC 0.62) on genuinely novel chemotypes. Molecules are assembled reaction-first, so every candidate carries a verified synthetic route and atom-level synthon provenance; ADMET is handled as a multi-objective Pareto problem. We frame the disclosed weak signals and the hard-case failures not as flaws but as the honest, decision-useful output the field's own critics demand.

25.
arXiv (CS.LG) 2026-06-11

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

arXiv:2601.21817v3 Announce Type: replace-cross Abstract: Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.