Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-18

Target-confidence Recourse Using tSeTlin machines: TRUST

arXiv:2606.18832v1 Announce Type: cross Abstract: Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

02.
arXiv (CS.CL) 2026-06-18

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

03.
arXiv (CS.AI) 2026-06-12

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

arXiv:2606.12485v1 Announce Type: cross Abstract: Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

04.
arXiv (CS.LG) 2026-06-12

Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training

arXiv:2601.17654v2 Announce Type: replace Abstract: The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive and contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus on optimizing either dynamic or static energy consumption. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time-energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time-energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.

05.
arXiv (CS.CV) 2026-06-11

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%–91% recognition – too low – and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

06.
arXiv (CS.AI) 2026-06-16

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

arXiv:2309.07401v2 Announce Type: replace-cross Abstract: Deep neural networks (DNNs) show great promise for solving partial differential equations (PDEs), but their deep architectures introduce complex, large-scale, non-convex optimization challenges. Nonlinear PDEs, like the viscous Burgers' equation, compound these difficulties due to steep gradients and shock-like solutions. To address this, we propose a two-stage multi-grade deep learning (TS-MGDL) method. In the first stage, shallow networks are trained progressively grade by grade to fit the target function from low- to high-frequency components; previously learned grades are frozen, and each new residual block is trained solely to minimize the remaining approximation error. The second stage unfreezes and retrains selected layers using the first-stage network as initialization, achieving an interpretable, stable hierarchical refinement while mitigating optimization complexity. Furthermore, we theoretically prove that each grade and stage in TS-MGDL monotonically reduces the loss function under an appropriate optimization strategy. Numerical experiments on 1D, 2D, and 3D viscous Burgers' equations demonstrate that TS-MGDL significantly outperforms single-grade learning (SGL), reducing predictive errors by up to a factor of 60.

07.
medRxiv (Medicine) 2026-06-18

Multicluster measles outbreak with a substantial proportion of modified cases in Tokyo, Japan, January-May 2026

Tokyo experienced a measles outbreak (260 cases) in early 2026 despite elimination status. Adults aged 20-39 years were most affected, and 38% of cases were modified measles, increasing with prior vaccination. Although incidence rose until April, the effective reproduction number; R(t) fell below 1, consistent with outbreak control. Multiple clusters were identified, but many cases lacked epidemiological links, suggesting that modified measles is less likely to be considered in differential diagnosis. Intensive contact tracing and surveillance contributed to limiting transmission.

08.
arXiv (CS.LG) 2026-06-16

Cross-Silo De-Anonymization Under Local Differential Privacy: Threat Model, Phase Transition, and Coordination Necessity

arXiv:2606.16763v1 Announce Type: cross Abstract: When a person's records appear in k independent data silos, each protected by (epsilon, delta)-differential privacy, standard composition yields a valid (k*epsilon, k*delta)-DP guarantee for the joint output. This worst-case bound, however, does not answer the concrete inference question: at what k can an adversary actually identify a target person? This paper develops the information-theoretic framework needed to answer that question. We introduce cross-silo person-level DP (XSP-DP), a Pufferfish-style privacy notion whose adjacency relation captures all records of a single person across all silos simultaneously, and verify that the standard basic composition bound carries over to this adjacency model. Within this framework we prove that de-anonymization undergoes a phase transition at k* = Theta(log n / epsilon^2) (population size n, per-silo RR parameter epsilon): a Fano lower bound shows any estimator fails for k > k*. An explicit XOR + randomized-response construction demonstrates information synergy: each silo's output is individually uninformative about the target, yet the joint mutual information is strictly positive. For non-coordinated binary randomized-response mechanisms, we prove that de-anonymization is inevitable once k exceeds the threshold, establishing that cross-silo coordination is necessary. These results provide a baseline threat model and Theta-level threshold for cross-silo inference attacks under local DP.

09.
medRxiv (Medicine) 2026-06-10

Transcriptomic Architecture of Type 2 Diabetes in Human Pancreatic Islets:An Integrative Meta-Analysis and Machine Learning Framework for Biomarker Discovery

作者:

Background. Type 2 diabetes mellitus (T2D) is defined by progressive pancreatic {beta}-cell dysfunction whose molecular underpinnings remain incompletely understood. Single-cohort transcriptomic analyses of donor islets have yielded heterogeneous gene lists of limited cross-study reproducibility, constraining both mechanistic interpretation and biomarker development. Methods. We combined two complementary analytical strategies applied to four public human islet transcriptomic cohorts (GSE25724, GSE20966, GSE38642, and GSE164416; n = 7-57 donors per contrast). For the integrative arm, three microarray datasets and one bulk RNA-seq dataset were processed independently and unified through gene-level random-effects meta-analysis, hallmark pathway scoring (GSVA/MSigDB), and iterative module refinement, yielding a two-axis disease framework. For the diagnostic arm, a consensus multi-method machine learning pipeline, combining LASSO penalized logistic regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest importance scoring, was applied to 184 differentially expressed genes from the RNA-seq cohort, with all normalization steps performed within leave-one-out cross-validation (LOOCV) folds to prevent data leakage. Machine learning classification of the RNA-seq cohort was additionally subjected to external transportability testing in the independent bulk human islet RNA-seq cohort GSE50244 using an overlap-restricted reduced score and a threshold fixed in the discovery cohort. Results. Meta-analysis across all four cohorts identified 337 high-confidence T2D-associated genes (96.1% directional concordance in beta-cell-enriched tissue). These were distilled into two refined 14-gene modules: ImmuneStress (MICB, HLA-DRA, HLA-DPA1, IL1R2, and others) and BetaCellIdentitySecretion (RASGRP1, PPP1R1A, SLC2A2, and others), whose composite IsletDysfunctionScore provided the most stable cross-platform separation of non-diabetic from T2D islets (Hedges' g = 1.80, p = 9.83 x $10^-17$, $text{I}^2$= 0%). Consistent with progressive disease, IsletDysfunctionScore increased monotonically from non-diabetic to impaired glucose tolerance to T2D. Separately, the machine learning pipeline derived a 10-gene diagnostic panel: GABRA2, SLC2A2, ARG2, DKK3, PRIMA1, TAFA4, HHATL, PARVG, RNU1-70P, and the novel lncRNA ENSG00000284653, that achieved perfect discrimination in LOOCV (AUC = 1.000, sensitivity = 1.000, specificity = 1.000, zero misclassifications across all 57 donors). A leakage-verification experiment confirmed that this performance reflected genuine biological signal: global quantile normalization prior to cross-validation collapsed AUC to 0.380. External testing showed that 8 of the 10 panel genes were measurable in GSE50244. The frozen 8-gene reduced score retained strong discrimination (external AUC = 0.907), with 6 of 8 genes preserving directional concordance, but the discovery-derived threshold did not transfer because the external score distribution was shifted upward and compressed, yielding complete sensitivity but zero specificity at the frozen cutoff Conclusions. Integrating pathway-level meta-analysis with machine learning classification, we present a coherent two-axis model: immune/stress activation and loss of beta-cell identity/secretory competence, together with a compact, biologically interpretable 10-gene diagnostic signature. Panel genes converge on GABA signaling, glucose transport, arginine metabolism, WNT pathway inhibition, and a novel lncRNA, providing both mechanistic hypotheses and high-priority targets for external validation. These findings offer a reproducible transcriptomic scaffold for future mechanistic, biomarker, and clinical translation studies of human islet dysfunction. They also support external transportability of the core biological signal, while indicating that absolute operating thresholds are cohort-dependent and would require recalibration before deployment in independent datasets.

10.
arXiv (CS.CL) 2026-06-18

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

11.
arXiv (CS.AI) 2026-06-17

Towards Distributed Inference of LLMs on a P2P Network

arXiv:2606.17059v1 Announce Type: cross Abstract: Prefix caching can reduce LLM inference latency by reusing KV caches across requests with shared prompts, but cluster-scale reuse is challenging because caches are partitioned across nodes. We propose a decentralized, prefix-cache-aware routing scheme for peer-to-peer LLM serving. Each node maintains a local radix tree of its own cached prefixes and asynchronously refreshed estimates of peer caches using periodic anti-entropy. Requests are routed to the node with the longest estimated prefix match, without centralized coordination or KV-cache transfer. Stale metadata only causes cache misses, not incorrect outputs, making weak consistency sufficient for correctness. Evaluation on simulated MMLU workloads show that decentralized routing improves latency under low communication delay and skewed prefix distributions, while high network latency and affinity-induced hotspots limit its benefits.

12.
arXiv (quant-ph) 2026-06-16

Counterdiabatic Raman Atom Optics for Compact High-Sensitivity Gravimetry

arXiv:2606.16945v1 Announce Type: new Abstract: Large-momentum-transfer (LMT) atom interferometry provides a route toward enhanced inertial sensitivity in compact quantum sensors, but its scalability is limited by the accumulation of pulse-transfer errors across long Raman pulse sequences. We investigate theoretically the use of stimulated Raman shortcut-to-adiabatic passage (STIRSAP) for high-fidelity LMT atom optics in a Mach–Zehnder interferometer geometry. The counterdiabatic correction is encoded directly into the Raman pulse envelopes, eliminating the need for auxiliary microwave or radio-frequency control fields. Numerical simulations based on an effective Raman model show that $1~\mu\mathrm{s}$ STIRSAP pulses achieve single-pulse transfer fidelities of $F_\pi = 0.99902$ while maintaining negligible pulse-time overhead even at high momentum order. We analyze the resulting tradeoff between interferometric phase enhancement and compound contrast decay and identify an unconstrained shot-noise optimum near $n\approx270$. The analysis further shows that practical operation at extreme LMT order is constrained by wave-packet separation, vibration noise, Doppler detuning, and accumulated systematic effects rather than by pulse duration itself. These results establish superadiabatic Raman control as a promising approach for scalable high-fidelity atom optics and clarify the physical limitations governing compact high-order atom interferometers.

13.
arXiv (math.PR) 2026-06-16

A 0-1 Law for Multifractal Spectra via the HGDS Scale Derivative

arXiv:2606.15850v1 Announce Type: new Abstract: We prove that the multifractal spectrum D(h,omega) of a stochastic process is almost surely deterministic under a scale decorrelation condition on the HGDS scale derivative. The key difficulty is that the pointwise Hölder exponent lives in the germ sigma-algebra, where classical 0-1 laws do not reach. We get around this by working with the geometry accumulation integral G_Lambda, which is a genuine Lebesgue integral over scales and concentrates almost surely. The boundary case – log-correlated fields – is sharp: the variance summability condition fails exactly there.

14.
medRxiv (Medicine) 2026-06-11

Incremental costs of transitioning from four to eight WHO-recommended antenatal care visits in Uganda: A costing analysis from a societal perspective

Background In 2016, the World Health Organization revised its antenatal care (ANC) recommendation from four to eight visits. For low- and middle-income countries like Uganda, where achieving even four visits remains a challenge, this transition has significant cost implications for both the health system and households. This study estimated the incremental costs of adopting the eight-visit model from a societal perspective. Methods The study was conducted in six government health facilities in southwestern Uganda. A micro-costing approach estimated health facility costs (personnel, equipment, consumables, and overhead). Costs incurred at patients end (transport, ultrasound, medical expenses, and time) were collected from 785 women using a questionnaire, with all costs in 2025 USD. Results For an average of 4.3 visits, total cost per woman was $100.1: facility costs $43.7 (43.7%), and patient costs $56.4 (56.3%). Transitioning to eight visits would increase total cost by $57.8 (57.8%), of which $36.4 (63.0%) would fall on households, equivalent to 68.8% of average monthly household income. Total costs would rise by 55.4% ($115.5 to $179.5) at Health Center IVs and 64.3% ($102.3 to $168.1) at Health Center IIIs, with facility costs up 43.4% and 62.9% and patient costs up 61.2% and 65.7%, respectively. Conclusion Transitioning to eight ANC visits would impose a large financial burden on households, with the incremental patient cost equivalent to more than two-thirds of average monthly household income. Equitable implementation requires improving availability of medicines and diagnostics, subsidizing transport, exploring telemedicine or community-based models, and improving efficiency at lower-tier health centers.

15.
arXiv (CS.CL) 2026-06-11

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

17.
medRxiv (Medicine) 2026-06-15

Shortened blastocyst vitrification achieves live birth rates comparable to standard protocols: an analysis of 3168 cryotransfers

Study question Do shortened blastocyst vitrification and warming protocols provide comparable live birth rates (LBR) and obstetrical and perinatal outcomes to traditional vitrification and warming protocols? Summary answer Shortened vitrification and warming protocols provide comparable LBR, obstetric and perinatal outcomes to traditional protocols. Shortened vitrification coupled with traditional multi step warming benefitted women >35yrs. What is known already Embryo viability following cryopreservation is dependent on blastomere survival and functional integrity, both impacted by ice crystal formation and osmotic gradients. Recent innovations in cryopreservation challenge the need for stepwise dehydration and rehydration protocols. While one step ''fast'' blastocyst warming protocols seem to provide equivalent clinical outcomes to traditional ''slow'' protocols, fewer studies investigate whether blastocyst dehydration rates can be similarly increased. A thorough safety and effectiveness evaluation remains necessary for both treatment success and offspring health. Study design, size, duration Three clinics within a network participated in this retrospective consecutive cohort study, with cycle data collected for 3603 warmed blastocysts resulting in 3168 frozen blastocyst transfers in 2170 patients between 2023 and 2025. We modelled the relationship between ''fast'' versus ''slow'' protocols and outcomes with Generalized Additive Models, and linear and logistic regressions where appropriate. Two tailed chi square with Yates correction was used to examine pregnancy loss and obstetrical and perinatal outcomes; p0.05). Importantly, women 35yrs or older at vitrification (n=1715 transfers) profited from a F/S strategy, which provided a significant increase in live birth rates (OR:1.42 [1.02-1.98] p=0.038) compared to S/S. The same improved live birth following a F/S strategy were also seen in embryos of lower quality (OR:1.78 [1.12-2.83] p=0.015), suggesting of a protective effect of this cryopreservation strategy on the developmental competence of impaired germplasm. Limitations, reasons for caution Factors affecting the results may be unaccounted for by the study retrospective nature. Wider implication of the findings Overall, shortened, ''faster'' vitrification and warming protocols provide comparable reproductive outcomes to traditional ones. The combination of shorter exposure to cryoprotectant (CPA) during vitrification and stepwise osmotic gradient during warming provided significant clinical benefits specifically to patients >35 and lower quality embryos, pointing to the possibility of adapting vitrification protocols to specific patients populations and optimizing their clinical outcomes.

18.
arXiv (CS.CL) 2026-06-19

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

19.
arXiv (CS.CV) 2026-06-12

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow – segmenting the image, running Marching Cubes, and then manually cleaning up the result – is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass – no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

20.
arXiv (CS.AI) 2026-06-19

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

arXiv:2606.20363v1 Announce Type: new Abstract: Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

21.
arXiv (CS.LG) 2026-06-12

Epistemic Uncertainty Is Not the Reducible Kind

作者:

arXiv:2606.12646v1 Announce Type: cross Abstract: The standard taxonomy of predictive uncertainty defines epistemic uncertainty as the part removable by collecting more data, while the standard measure identifies it with a mutual-information term. We prove the definition and the measure are extensionally inconsistent. On an explicit construction, the measure assigns all uncertainty to the epistemic class, yet no quantity of training data reduces it. Reducibility is instead a property of the pair (uncertainty, acquisition class), and the dichotomy resolves into three parts: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic uncertainty. An exact identity for the value of an observation shows that in-distribution data never reduces mechanism-irreducible uncertainty and generically increases it. Ensemble disagreement, the deployed epistemic estimate, tracks the training procedure rather than the epistemic term. It collapses to zero beneath a positive truth under consistent training, and equals hyperparameter-scaled initialization noise under interpolation. A finite-sample falsification test and seed-swept experiments confirm the theory.

22.
arXiv (CS.LG) 2026-06-16

Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model

arXiv:2606.15219v1 Announce Type: new Abstract: In this work, we tackle the following question: Can neural networks trained with gradient-based methods achieve the optimal computational-statistical tradeoff in learning Gaussian single-index models? Prior research has shown that any polynomial-time algorithm under the statistical query (SQ) framework requires $\Omega(d^{s^\star/2}\lor d)$ samples, where $s^\star$ is the generative exponent representing the intrinsic difficulty of learning the underlying model. However, it remains unknown whether neural networks can achieve this sample complexity. Inspired by prior techniques such as label transformation and landscape smoothing for learning single-index models, we propose a unified gradient-based algorithm for training a two-layer neural network in polynomial time. Our method is adaptable to a variety of loss and activation functions, covering a broad class of existing approaches. We show that our algorithm learns a feature representation that strongly aligns with the unknown signal $\theta^\star$, with sample complexity $\widetilde{O} (d^{s^\star/2} \lor d)$, matching the SQ lower bound up to a polylogarithmic factor for all generative exponents $s^\star\geq 1$. Furthermore, we extend our approach to the setting where $\theta^\star$ is $k$-sparse for $k = o(\sqrt{d})$ by introducing a novel weight perturbation technique that leverages the sparsity structure. We derive a corresponding SQ lower bound of order $\widetilde{\Omega}(k^{s^\star})$, matched by our method up to a polylogarithmic factor. Our framework, especially the weight perturbation technique, is of independent interest, and suggests potential gradient-based solutions to other problems such as sparse tensor PCA.

23.
arXiv (CS.LG) 2026-06-12

Auditing Discriminatory Patterns in Mortgage Lending Through Association Rules and Fair Binning

arXiv:2606.12435v1 Announce Type: cross Abstract: Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.

24.
arXiv (CS.AI) 2026-06-19

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

arXiv:2606.19704v1 Announce Type: new Abstract: Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

25.
arXiv (CS.CL) 2026-06-12

BOUTEF: A Multilingual Corpus for FakeNews in North Africa – Language as a Weapon

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.