Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-17

Silent Manipulation of Mental Health Treatment Recommendations from a Large Language Model

Importance. Large language models (LLMs) increasingly inform mental health decisions by patients and clinicians. Inference-time activation steering can shift model behavior on a target dimension without altering weights or prompts and without disclosure to users, allowing treatment recommendations to be silently changed for commercial or ideological reasons. Objective. To determine whether directional activation steering can shift an open-weights LLM's depression treatment recommendations. Design, Setting, and Participants. This non-human subjects study applied directional activation steering to an open-weights LLM (DeepSeek V4 Flash) responding to 12 depression-advice scenarios (4 favoring medication, 4 favoring avoidance, 4 neutral), generated at 30 amplitudes from -1.5 to +1.5 in 0.1 increments plus an unsteered baseline. Exposures. A single steering direction contrasting antidepressant medication with self-directed approaches (diet, exercise, meditation, dietary supplements), constructed from 16 paired training prompts and applied at the attention output of every transformer block; weights and system prompt were held constant. Main Outcomes and Measures. The extent to which medication and four self-care categories were addressed, scored 0 to 3 by a human-validated LLM rater (Claude Opus 4.7), the medication-versus-self-care balance, and clinician referral, estimated per unit of amplitude using mixed-effects models with a scenario random intercept. Results. Across 372 generations, steering produced a graded, dose-dependent shift in the medication-versus-self-care balance, which declined by 0.32 per unit of amplitude (beta=-0.32; 95% CI, -0.39 to -0.25; P < .001); medication extent fell and self-care extent rose. The shift was largest for scenarios with no stated treatment preference (beta = -0.44; 95% CI, -0.54 to -0.34; P < .001). A clinician referral appeared in 322 of 372 responses (87%) and did not vary with steering amplitude (P = .63). Conclusions and Relevance. In this open-weights LLM providing depression treatment information, inference-time activation steering shifted treatment recommendations without altering weights, prompt structure, or safety outputs, with the largest effect among users expressing no treatment preference. These findings suggest a need for LLM disclosure standards and independent auditing as such models inform clinical decisions.

02.
arXiv (CS.CL) 2026-06-18

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs – designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate – instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

04.
arXiv (CS.CL) 2026-06-18

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

05.
bioRxiv (Bioinfo) 2026-06-14

Virtual phenotypic screening discovers novel scaffolds inhibiting the PI3K/mTOR pathway

Phenotypic drug discovery has yielded many first-in-class small-molecule drugs by discovering modulators of disease phenotypes in physiologically relevant cellular systems. However, high-content phenotypic assays lack the ultra-high-throughput scalability of target-based screens. Recent advances in virtual screening present an opportunity to address this bottleneck, but have been limited to simple phenotypes like viability, restricted to small repurposing libraries, or lack in-depth biological validation. Here, we present PhenoCompass, a multimodal co-embedding model that aligns compound structures and high-content phenotypic imaging to enable virtual phenotypic screening over billion-compound libraries. Following training on the Joint Undertaking in Morphology dataset with more than 100,000 Cell Painting compound profiles, retrospective validation with historical biochemical high-throughput screening data demonstrates that PhenoCompass ranks compounds according to their biochemical target engagement. Leveraging PhenoCompass, we performed a prospective screen of 3.8 billion Enamine REAL compounds for inhibitors of PI3K/mTOR pathway, a critical signaling cascade whose aberrant activation is a common tumor driver. This search identified 11 novel compounds with pathway-consistent Cell Painting readout and diverse scaffolds, a 54-fold enrichment over the training set. Orthogonal validation experiments using a FOXO3A reporter assay and direct kinase inhibition confirmed seven structurally novel inhibitors with distinct mechanisms of action. These results highlight the convergence of diverse molecular target profiles onto a shared morphological pathway signature and establish PhenoCompass as a robust framework for high-content phenotypic virtual screening.

06.
arXiv (CS.AI) 2026-06-18

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

arXiv:2606.18617v1 Announce Type: cross Abstract: There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

07.
arXiv (CS.CL) 2026-06-16

RASST: Retrieval-Augmented Simultaneous Speech Translation

Simultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.

08.
bioRxiv (Bioinfo) 2026-06-10

HOMED enables hierarchical and multimodal optimization of DNA methylation deconvolution across tissues

Cellular heterogeneity is a major confounder in bulk DNA methylation data for epigenome-wide association studies. Existing reference-based DNAm deconvolution methods often ignore hierarchies among related cell types and may generalize poorly across datasets due to limited variability in reference profiles. We developed HOMED (Hierarchically Optimized Methylation Deconvolution), a framework that integrates cell-lineage hierarchies, single-cell RNA sequencing-guided deconvolution, and paired bulk RNA-seq/DNAm data for CpG signature optimization. Across simulated and real peripheral blood mononuclear cell, lung, and placental datasets, HOMED consistently yielded the highest PCCs and lowest RMSEs, outperforming existing scRNA-seq-guided DNAm deconvolution methods, improving accuracy, resolution, and cross-tissue generalizability.

09.
arXiv (CS.CV) 2026-06-19

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores – an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) – plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

10.
arXiv (math.PR) 2026-06-11

Martingale Solutions to a Stochastic Keller-Segel System with nonlocal Source and Super-linear Noise

arXiv:2606.11774v1 Announce Type: new Abstract: Global nonnegative martingale solutions are shown to exist for a stochastic Keller-Segel system with a nonlocal Fisher-KPP source and super-linear multiplicative noise. The result is obtained for nonnegative initial data with no smallness assumption, provided that the nonlocal source term is dominant. The main difficulty stems from the absence of a coercive structure and the super-linear nature of the noise. An additional cut-off with finite L^2 norm in the classical Galerkin method is added to establish a well-posed approximation problem. Moreover, due to the nonlocal Fisher-KPP structure, it is necessary to prove the positivity of the approximating solution in order to obtain uniform estimates. In the compactness arguments, the usual tightness argument in the framework of Hilbert spaces cannot be directly applied to the uniform estimates obtained in this paper. As a result, we develop a more general version of the compactness argument and tightness criterion, presented in the appendix, which will be applied throughout the paper. This allows for the global existence of nonnegative martingale solutions to be derived from Jakubowski's version of the Skorokhod Theorem, along with a thorough discussion of the convergence properties.

11.
arXiv (CS.LG) 2026-06-17

Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning

arXiv:2606.17233v1 Announce Type: new Abstract: In many engineering applications, a single high-fidelity model produces multiple quantities of interest (QoIs) under the same input parameters, e.g. finite element models of complex physical systems. To alleviate the high computational cost of direct model evaluations, surrogate models are widely used to construct efficient approximations of model responses. Naturally, the accuracy of surrogates strongly depends on the quality of the experimental design (ED). However, a single ED may not provide an adequate representation for all outputs simultaneously, especially when different outputs exhibit varying sensitivities to the input variables. A straightforward solution is to perform separate sampling for each output, but this results in increased sampling complexity and computational cost. From a statistical perspective, such an approach also ignores potential correlations among all outputs and may compromise data consistency. To address this issue, an adaptive sequential sampling method for constructing polynomial chaos expansion surrogate models is generalized for vector valued QoIs. The method sequentially selects new samples from a candidate pool based on their local contribution to the output variance, while balancing distance-based exploration of the input space and exploitation of aggregated variance information across all outputs. Its performance is compared with non-sequential Latin Hypercube Sampling through several numerical examples from engineering problems. Numerical results demonstrate that the proposed strategy improves both surrogate accuracy and stability, and provides a more reliable estimation of second-order statistics.

12.
bioRxiv (Bioinfo) 2026-06-14

Generative design of antigen-specific T-cell receptor sequences with a conditional diffusion model

T cell receptor (TCR)-based immunotherapy holds immense potential for treating cancers and infectious diseases, where highly antigen-specific TCR recognition is crucial for adaptive immunity against tumors and pathogens. Engineering or de novo generation of the complementarity-determining region 3 (CDR3) loops of TCRs using artificial intelligence offers a powerful alternative to designing reactive TCRs rather than laborious experimental screening. However, current in silico approaches are constrained by weak conditional guidance, limited flexibility, and a lack of rigorous functional validation. To address these limitations, we introduce TCRDiff, a generative diffusion framework for designing antigen-specific TCRs conditioned on peptide-MHC (pMHC) targets and germline-encoded variable genes. By leveraging pre-trained knowledge from massive T-cell repertoires and TCR-pMHC recognition data, TCRDiff generates CDR3{beta} sequences with state-of-the-art fidelity to native binding TCRs through a denoising diffusion process. Furthermore, incorporating the interface geometry features generated TCR-pMHC complexes with superior structural plausibility. As a proof of concept, we deployed TCRDiff in a systematic pipeline to design candidate TCRs for immunotherapy. In vitro activation assays validated that TCRDiff-generated TCRs specifically recognize the MAGE-A3 epitope with minimized off-target cross-reactivity. Together, TCRDiff establishes a powerful, validated computational paradigm to accelerate the development of TCR-based immunotherapies.

13.
arXiv (CS.CL) 2026-06-12

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

作者:

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

14.
arXiv (CS.CV) 2026-06-17

Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

15.
arXiv (CS.AI) 2026-06-15

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

arXiv:2602.00845v3 Announce Type: replace Abstract: Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner

16.
arXiv (CS.CL) 2026-06-19

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

17.
medRxiv (Medicine) 2026-06-22

''Circumstantial Determinants'': An Efficient Approach to Reaching People in Need of HIV Prevention?

HIV prevention and testing programmes primarily reach people who self-refer or attend routine health services. Higher-risk individuals are missed if they are healthy, under-estimate their risk of infection or under-report sexual risk-behaviours. We assess a new approach to address limitations in existing programmes by targeting HIV services on ''Circumstantial Determinants'' (CDs) of HIV risk - the social circumstances, settings, and norms associated with behaviours that increase risk of HIV acquisition. Data on potential CDs and sexual behaviour were collected in a population survey in Zimbabwe in 2018/19 (N=9141). HIV-negative individuals reporting [&ge;] 1 sexual risk-behaviours were defined as the 'priority population' for HIV prevention. For each sex, six circumstantial determinants were associated with being in the priority population (aOR [&ge;] 1.30; p [&le;] 0.01). Reach and efficiency of CDs (and combinations) were calculated; ROC curve algorithms evaluated their ability to identify priority population membership; and HIV prevention condom cascades were compared between CD-defined priority population subgroups. Example findings include that targeting men at bars and beerhalls could reach 48.5% of the priority population and 25.1% of lower-risk men. These percentages increase to 77.1% and 53.7% if men with poor mental health, no religious affiliation, negative social capital, or living on agricultural estates are also targeted. Targeting women with poor mental health could reach 32.0% of the priority population and 21.3% of lower-risk women. Targeting additional circumstantial determinants increases these percentages to 54.1% and 37.5%, respectively. Cascade barriers to condom use differed between CD-defined subgroups. The Circumstantial Determinants approach demonstrates proof-of-concept potential to strengthen HIV prevention services.

18.
Nature Medicine 2026-06-08

Effects of SGLT2 inhibition on incident heart failure in carriers of cardiomyopathy-associated genetic variants

Although the beneficial effects of sodium–glucose cotransporter 2 (SGLT2) inhibition in heart failure (HF) have been well established, it is unknown whether SGLT2 inhibition confers benefit in carriers of rare variants in cardiomyopathy-associated genes. Here we evaluated whole-exome sequencing data from the randomized DECLARE-TIMI 58 trial, in which adults with type 2 diabetes and increased cardiovascular risk were randomized to dapagliflozin or placebo treatment. Pathogenic or likely pathogenic variants (P/LP) in high-confidence cardiomyopathy genes were identified, and treatment effects on hospitalization for HF (HHF) were compared between carriers of such variants and noncarriers. Among 12,685 patients for whom sequence data were obtained, 121 carried a cardiomyopathy variant (76 dilated cardiomyopathy, 25 hypertrophic cardiomyopathy and 25 arrhythmogenic cardiomyopathy). Over a median follow-up of 4.2 years, dapagliflozin lowered the risk of HHF more strongly in carriers (hazard ratio 0.18, 95% confidence interval 0.04–0.86) than in noncarriers (hazard ratio 0.70, 95% confidence interval 0.57–0.86; P interaction 0.03). Absolute risk reduction was 13.0% in carriers and 1.0% in noncarriers (P interaction 0.03). Most carriers (82%) had no prior HF, and in carriers without prior HF, treatment with dapagliflozin reduced the absolute risk of HHF by 12.8%, compared with a reduction of 0.6% in noncarriers (P interaction 0.01). The findings from this cohort of older and high-risk patients raise the possibility that SGLT2 inhibitor treatment should be started early to prevent HF in individuals who carry P/LP cardiomyopathy variants. These results need to be confirmed in a prospective, dedicated trial of preventive HF treatments in carriers of P/LP cardiomyopathy-associated variants. In a whole-exome sequencing analysis, the beneficial effects of the SGLT2 inhibitor dapagliflozin in reducing the risk of future heart failure hospitalization in individuals with type 2 diabetes were markedly greater in individuals who carried a cardiomyopathy-associated genetic variant compared with noncarriers, suggesting a personalized preventative therapy based on genetic information.

19.
arXiv (CS.LG) 2026-06-16

Polynomial-Time Mistake-Bounded Language Generation

arXiv:2606.16077v1 Announce Type: cross Abstract: In this note, we introduce a polynomial-time version of the mistake-bounded language generation (MBLG) framework due to Kleinberg, Peale, and Reingold (2026). We observe that the family of parities of variables, and the family of conjunctions of literals, are polynomial-time MBLG. Our main result states that the family of monotone Boolean functions with polynomially-many maxterms is polynomial-time MBLG. This family includes all monotone Boolean functions, computable by polynomial-size decision trees. Our technique can be presented as a new combinatorial game about writing numbers on a board.

20.
medRxiv (Medicine) 2026-06-10

Development of an Open-Access Action Observation Video Library for Upper Limb Motor Rehabilitation

Background: Occupational therapists can improve stroke survivors hand and arm movement and participation in daily activities through action observation (AO). AO involves watching another persons hand or arm complete a movement or task. While research generally supports the use of AO with stroke survivors, there are limited AO videos are available to occupational therapists which makes applying AO challenging. Objective: The purpose of this work is to develop structured and widely accessible tool to support access to AO for stroke survivors, occupational therapists, and researchers. Methods: To develop an AO video library for stroke rehabilitation, functional and non-functional upper limb task deficits were first identified through clinical observations and clinician interviews to establish a prioritized list of daily activities. In collaboration with media production specialists, healthy adult volunteers were recruited and filmed performing these tasks from both first- and third-person perspectives. The recorded videos were then systematically edited, enhanced with instructional title slides, and distributed via a public YouTube channel for clinical application and a categorized digital repository for research purposes. Results: Initial assessments revealed a complete lack of familiarity, awareness, and utilization of AO resources among local occupational therapists, despite high perceived clinical utility. To address this gap, a final library of 150 tasks was established, resulting in the production of 419 finalized, standardized videos featuring six healthy volunteers. For clinical application, these videos were hosted on a free, public YouTube channel organized into 18 functional playlists, while a parallel set was structured into distinct movement categories for research repository storage. Conclusion: By providing a structured and highly accessible tool, this repository enables clinicians, researchers, and caregivers to readily implement evidence-based action observation interventions in both clinical and home settings.

21.
arXiv (CS.LG) 2026-06-19

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

arXiv:2606.19549v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training – chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

22.
arXiv (CS.AI) 2026-06-12

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

arXiv:2606.12918v1 Announce Type: cross Abstract: Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

23.
arXiv (quant-ph) 2026-06-11

Non-Hermitian Delocalization Realizes Random Dirac Criticality in One Dimension

arXiv:2606.12089v1 Announce Type: cross Abstract: Non-Hermitian systems can evade Anderson localization and exhibit delocalized states even in one dimension. Here, we show that such non-Hermitian delocalized states under periodic boundary conditions (PBC) are intrinsically critical, realizing the universality class of one-dimensional random Dirac fermions. By linking spectral winding to topological Anderson transitions via Hermitization, we demonstrate that the delocalized PBC states exhibit a Dirac-type criticality with universal algebraic correlations. In contrast to Hermitian systems, where this criticality occurs only at fine-tuned transition points, it emerges generically in non-Hermitian systems as a consequence of spectral topology. These results identify a universal mechanism by which non-Hermiticity promotes criticality, providing a unified description of non-Hermitian delocalization in one dimension.

24.
arXiv (CS.CL) 2026-06-19

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

25.
arXiv (CS.CV) 2026-06-16

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.