Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (quant-ph) 2026-06-16

Intermodal entanglement in a quantum optical model of HHG due to the back-action on the driving field

arXiv:2603.01315v2 Announce Type: replace Abstract: Preparation of nonclassical light with special quantum properties is essential for quantum technologies. High-harmonic generation (HHG) is a process which not only enables the creation of attosecond pulses but also has the potential to generate light with intricate quantum properties. In a recent experiment [1], nonclassical inter-harmonic correlations have been measured from a HHG source. In this work, we theoretically investigate entanglement between different harmonics within an effective quantum optical model. This model implements a signifcant degree of simplifcation regarding the processes within the target material, treating the material through susceptibilities, as it is usual in quantum optics. Such an approach yields a general description of HHG, permitting the implications that can be derived within it to hold broadly. We find that entanglement is produced as a result of the often neglected back-action. We can qualitatively reproduce experimentally measured nonclassicalities, which suggests that intermodal entanglement can, to an extent, be considered a universal phenomenon associated with HHG, rather than a result of using specific material targets.

02.
arXiv (CS.CL) 2026-06-19

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

03.
arXiv (CS.CL) 2026-06-16

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness – separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges – correct or not – are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges – at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.

04.
arXiv (CS.AI) 2026-06-17

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

arXiv:2606.17904v1 Announce Type: new Abstract: Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.

05.
PLOS Computational Biology 2026-06-22

Heterogeneous suppressive effect of <i>Wolbachia</i> incompatible insect technique coupled with sterile insect technique across time and historical <i>Ae. aegypti</i> abundance - using distributional synthetic controls

Authors:

by Yichen Zhai, Chia-Chen Chang, Zhiyong Xi, Cheong Huat Tan, Lee Ching Ng, Jue Tao Lim Background Biological control tools such as Wolbachia incompatible-insect technique, are a promising class of interventions to modify and suppress Aedes aegypti mosquitoes to reduce risk of Aedes-borne diseases. Due to the spatial nature of the intervention, intervention effects can be spatio-temporally heterogeneous. Yet, most evaluations of field-based technologies rely on average treatment effects, which preclude characterization and understanding of treatment effect heterogeneities and the factors influencing it. Methods Here, we developed a causal inference framework using distributional synthetic controls to explicitly account for spatio-temporal trap-level mosquito abundance data to ascertain the entomological efficacy of Wolbachia in suppressing Ae. aegypti abundance. This method is able to construct counterfactual distributions of intervened areas, provide detailed comparisons to actual distributions and quantify treatment effects of the intervention on mosquito abundance over different quantiles. By employing our framework to trap-level mosquito abundance data from 57,990 unique mosquito traps routinely maintained and measured twice a week, and a large-scale field trial of Wolbachia incompatible-insect technique coupled with sterile insect technique (IIT-SIT) in Singapore, we (1) quantified heterogeneous treatment effects for IIT-SIT across the time-since-intervention, over the traps’ historical mosquito abundance, over calendar time, (2) quantified whether elimination of wild-type Aedes aegypti was possible in intervention locations and (3) addressed if suppressive effects in spillover locations adjacent to directly intervened locations were heterogeneous. Results IIT-SIT interventions led to a strong suppressive effect on adult Aedes aegypti abundance. From the onset of intervention in directly treated locations, sector-specific intervention effectiveness (IE) ranged from 24.04% in the earliest treatment period, and reached 86.08% in the latest treatment period. Raw reductions in aegypti abundance were also found to increase over time as sectors were intervened over longer time periods. In spillover sectors, IE was lower in magnitude and more variable, but average IE reached a maximum of 78.08% in 2-years post-treatment. Wolbachia interventions also led to an increase in the percentage of traps recording no mosquitoes from 6.8% at the start of intervention to 33.01% 124-weeks post-intervention. We found that IE was higher in sectors with lower historical mosquito abundance. However, IE converged across sectors with different historical mosquito abundance as intervention time increased. Conclusion This study revealed spatial heterogeneities in suppressing wild-type female Ae. aegypti by IIT-SIT and provided strong evidence that IIT-SIT can drastically suppress wild-type Ae. aegypti populations despite heterogeneous treatment effects over time.

06.
arXiv (CS.LG) 2026-06-16

ANCHOR: Error-Controlled Adaptive Numerical Correction for Neural Operator Time Marching

arXiv:2512.19643v2 Announce Type: replace Abstract: Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on six canonical PDEs: 1D and 2D Burgers', 2D Allen-Cahn, 2D Cahn-Hilliard, 2D Navier-Stokes, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.

07.
arXiv (CS.CL) 2026-06-18

Fair Cognitive Impairment Detection Through Unlearning

Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.

08.
arXiv (CS.LG) 2026-06-19

FloatDoor: Platform-Triggered Backdoors in LLMs

arXiv:2606.19535v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

09.
arXiv (quant-ph) 2026-06-12

Geometric Algebra Quantum Gate Decomposition

arXiv:2606.12480v1 Announce Type: new Abstract: Quantum gates are usually described through matrix and tensor-product formalisms that often obscure their geometric structure. In this work, we formulate the Pauli and Clifford groups within the complex Geometric Algebra (GA) framework. We show that the Pauli group is naturally identified with the group of blades up to a global phase, thereby providing a geometric interpretation of Pauli operators and their commutation relations in terms of oriented subspaces. We further prove that Clifford operators are generated by products of {\pi}/4-Pauli rotors and introduce a greedy Pauli rotor decomposition algorithm whose empirical behavior suggests unexpectedly compact decompositions for Clifford operators. Finally, we show that Clifford+T universality admits a natural geometric interpretation through {\pi}/8-rotors within this framework.

10.
medRxiv (Medicine) 2026-06-18

Rare Coding Variants Reveal Distinct Genetic Architectures Across Multidimensional Sleep Phenotypes

Sleep and circadian traits have been widely studied using common variants, but the contribution of rare coding variation remains unclear. We analyzed rare coding variants in 397,065 whole-exome sequenced UK Biobank participants across 36 sleep phenotypes from self-report, diagnoses, sleep medication use and accelerometry, and meta-analyzed results with 171,536 whole-genome sequenced All of Us participants of diverse ancestries, with replication in the Mass General Brigham Biobank (N = 31,275). We identified 260 genes associated with sleep phenotypes, including novel associations with sleep medication use in 29 genes and 24 out of 29 have not previously been reported with any sleep phenotypes. We observed modest but significant rare variant heritability and strong genetic correlations between sleep medication use, insomnia and fatigue. Temporal gene expression trajectory analyses indicate that genes associated with self-reported sleep traits show constant high prenatal expression, whereas genes linked to sleep medication phenotypes exhibit peak expression in the late prenatal period. These findings highlight distinct biological mechanisms captured by different measurement sources of sleep phenotypes and reveal rare-variant-informed targets for therapeutic discovery.

11.
arXiv (CS.AI) 2026-06-18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

arXiv:2606.19245v1 Announce Type: new Abstract: Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

12.
arXiv (CS.CL) 2026-06-17

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

Authors:

Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-document benchmark of 122 tasks across five professional domains (financial, legal, medical, scientific, DevOps) using actual SEC filings, Federal Register rules, PubMed abstracts, arXiv papers, and GitHub postmortems. Paraphrasing, the strongest defense on synthetic benchmarks, shows no statistically significant attack success rate reduction on real documents (p=0.500) while degrading utility from 91.8% to 82.8%. We introduce PARSE (Provenance-Aware Retrieval Sanitization), a domain-aware, fact-preserving sanitization pipeline that classifies each sentence by injection likelihood, extracts structured facts before rewriting, and verifies fact preservation via a consistency-checking loop. A directiveness gate routes 59% of real enterprise documents to a lightweight path, concentrating computational cost on high-risk documents. PARSE achieves 15.6% attack success rate – a 38% reduction versus the 25.4% baseline – at 86.9% utility, the only condition that is both statistically significant (p=0.014, adequately powered) and maintains near-baseline utility. Practitioners should evaluate defenses on domain-matched real documents, not synthetic proxies.

13.
medRxiv (Medicine) 2026-06-22

Use of the Pharmacy First service in England in the first 12 months: geographic variation and health system context

Objectives: The Pharmacy First (PF) service was introduced across England from 31 January 2024 to expand the clinical role of community pharmacies and improve access to primary care. This paper describes use of PF in its first 12 months, in terms of uptake, access routes, consultation outcomes, geographic variations, service costs and antimicrobial supply. Methods: A descriptive analysis of all PF consultations submitted for payment to NHS Business Services Authority in England between 31 January 2024 and 31 January 2025. Pharmacy-level consultation data were linked to national data on population, location and pharmacy characteristics. PF use was examined using population-standardised consultation rates and consultations per pharmacy. Results: During the first year of implementation, 2,205,731 PF consultations were recorded as delivered across 11,349 pharmacies, with payment of GBP123 million to pharmacies. Uptake increased steadily over time. Most consultations were for acute sore throat (33%) and uncomplicated urinary tract infection (27%), with corresponding antibiotics, phenoxymethylpenicillin and nitrofurantoin being the most supplied. Most people self-referred (74%) into the service, with 95% of consultations managed without onward referral. Substantial geographic variation was observed. Northern regions had higher use based on the eligible population. The South East and Midlands had higher activity per pharmacy. London showed a distinct pattern, with higher self-referral into the service, lower medication supply and higher referral to other healthcare services. Higher consultation volume was weakly associated with pharmacy characteristics, including opening hours, pharmacy type and retail setting, and local context, in terms of socio-economic and geographic factors. Conclusions: PF had immediate uptake and is operating primarily as a direct-access model for common acute conditions. Findings suggest that PF is contributing to improved access to care and may shift demand away from general practice. However, the service uptake appears to be shaped by geographic location, proximity to other healthcare services and pharmacy characteristics.

14.
arXiv (CS.LG) 2026-06-16

A Multimodal Approach to Alzheimer's Diagnosis: Geometric Insights from Cube Copying and Cognitive Assessments

arXiv:2512.16184v2 Announce Type: replace Abstract: Early and accessible detection of Alzheimer's disease (AD) remains a critical clinical challenge, and cube-copying tasks offer a simple yet informative assessment of visuospatial function. This work proposes a multimodal framework that converts hand-drawn cube sketches into graph-structured representations capturing geometric and topological properties, and integrates these features with demographic information and neuropsychological test (NPT) scores for AD classification. Cube drawings are modeled as graphs with node features encoding spatial coordinates, local graphlet-based topology, and angular geometry, which are processed using graph neural networks and fused with age, education, and NPT features in a late-fusion model. Experimental results show that graph-based representations provide a strong unimodal baseline and substantially outperform pixel-based convolutional models, while multimodal integration further improves balanced classification performance and discriminative ability. SHAP-based interpretability analysis identifies specific graphlet motifs associated with corner integrity and edge continuity as key predictors, closely aligning with clinical observations of distorted cube drawings in AD. Together, these findings establish graph-based analysis of cube-copying behavior as an interpretable, non-invasive, and scalable framework for Alzheimer's disease screening.

15.
arXiv (CS.LG) 2026-06-18

A Survey on Data-Driven Models for Soil Moisture Regression and Classification

arXiv:2606.18316v1 Announce Type: new Abstract: Soil Moisture (SM) modelling constitutes a complex spatiotemporal learning problem characterised by nonlinear environmental interactions, heterogeneous data sources, and limited ground observations. Physics-based approaches, such as water balance models, rely on explicit hydrological equations and high-quality inputs, but their computational cost and scalability limitations restrict large-scale deployment. Data-driven artificial intelligence (AI) methods have emerged as flexible alternatives, enabling the extraction of empirical relationships between soil moisture and environmental variables with reduced modelling assumptions. This work presents a structured survey of AI-based models for soil moisture estimation and classification. Existing approaches are organized into five categories: (a) statistical time-series models, (b) geostatistical methods (c) classical machine learning (ML) models, (d) Deep Learning (DL) models and (e) Probabilistic/Bayesian methods. These models leverage historical soil moisture records, meteorological variables, vegetation indices, topography, soil characteristics, and geolocation data to perform regression or classification tasks.

16.
arXiv (CS.CV) 2026-06-12

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

17.
arXiv (CS.LG) 2026-06-18

Toward Simultaneously Optimal Regret in U-Calibration

arXiv:2606.18527v1 Announce Type: cross Abstract: U-calibration studies online forecasting algorithms whose predictions can be consumed by any unknown downstream agent, guaranteeing sublinear regret simultaneously for all proper loss functions. Existing U-calibration algorithms achieve worst-case optimal $O(\sqrt{T})$ regret for every bounded proper loss, but they fail to adapt to easier losses: as we show, even for smooth losses such as squared loss, they incur $\Omega(\sqrt{T})$ regret instead of the optimal $O(\log T)$ regret. In this work, we show that this limitation is not inherent. Specifically, we design a single forecast algorithm that simultaneously achieves $\tilde O(\sqrt{T})$ regret for every bounded proper loss and $O(\log T)$ regret for every bounded smooth proper loss. More generally, our algorithm also attains logarithmic regret for losses that are smooth relative to the log-barrier, which include several non-Lipschitz examples. Our approach is based on a novel variant of Follow-the-Perturbed-Leader (FTPL) in which perturbations are applied directly in the prediction space using self-concordant noise. The resulting analysis also departs substantially from prior FTPL analyses due to the complex nature of this noise and may be of independent interest.

18.
arXiv (CS.CV) 2026-06-12

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

19.
arXiv (CS.CL) 2026-06-16

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

20.
arXiv (CS.AI) 2026-06-17

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

arXiv:2606.17368v1 Announce Type: new Abstract: Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

21.
arXiv (CS.LG) 2026-06-16

A Fully First-Order Layer for Differentiable Optimization

arXiv:2512.02494v2 Announce Type: replace Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{O}(1)$ time, leading to an overall complexity of $\tilde{O}(\delta^{-1}\epsilon^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. The source code is available at https://github.com/guaguakai/FFOLayer.

22.
arXiv (CS.CL) 2026-06-11

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

23.
medRxiv (Medicine) 2026-06-18

Avidity of anti-pertussis toxin antibodies is associated with symptomatic Bordetella pertussis infection in a novel controlled human infection model

Background The association between functional antibody responses following Bordetella pertussis infection and symptomatic disease remains unclear. We characterized the maturation of anti-pertussis toxin (PT) IgG avidity after human challenge with B. pertussis and determined its association with symptomatic infection. Methods Healthy adults were intranasally inoculated with live B. pertussis organisms in a controlled human infection model and monitored for development of pertussis symptoms (NCT05136599). Serum samples were collected one day before inoculation and at 14, 28, 56, 180, and 365 days post challenge. Anti PT IgG avidity was tested using a titration of ammonium isothiocyanate (the bond breaking agent) to quantify a wide range of antibody avidities from low to very-high. Associations between covariates and avidity were examined using linear regression models, and high dimensional analyses were used to integrate all data. Findings Anti PT IgG avidity increased in both symptomatic (n=20) and asymptomatic (n=10) participants after the challenge, reached maximum levels at day 56, and then declined through day 365. Symptomatic participants developed significantly higher levels of high- and very high-avidity anti-PT antibodies at 28, 56, 180, and 365 days post-challenge compared with those who remained asymptomatic. In multivariate analyses, symptomatic infection was associated with higher levels of high and very high avidity anti-PT IgG at day180 and365 after challenge. Distinct avidity profiles in symptomatic vs asymptomatic participants emerged at day28 onwards, with the former group having higher levels of antibodies with higher avidities. However, levels of medium-high, high and very high avidity antibodies in symptomatic participants were lower at day 365 after challenge compared to their peak levels. Interpretation Anti-PT IgG avidity was associated with symptomatic B. pertussis infection and thus may serve as a surrogate of clinical disease outcome. These results highlight that antibody avidity provides an additional functional assay besides antibody quantitation to dissect immune responses to pertussis. Further investigation of anti PT IgG avidity should be pursued in natural pertussis outbreaks to determine whether it might be used to differentiate symptomatic from asymptomatic infections for epidemiologic purposes.

24.
arXiv (CS.CV) 2026-06-16

The Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture–shape gap between CNNs and attention models.

25.
arXiv (CS.AI) 2026-06-19

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

arXiv:2606.19528v1 Announce Type: cross Abstract: Fine-tuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) on an end-user's data offers personalized experiences while keeping data private, but faces severe memory constraints on consumer hardware. Peak memory during fine-tuning often exceeds device limits, especially for models with billions of parameters and long-context training data. This paper introduces a suite of complementary techniques to reduce memory footprint without sacrificing model quality: (1) base model quantization with on-the-fly dequantization, (2) memory-efficient checkpointing combining selective activation caching and disk offloading, (3) softmax approximation using semantically relevant token subsets, and (4) logits masking. Experiments on Llama-3.2 3B and Qwen-2.5 3B demonstrate up to $26\times$ and $28\times$ reduction in peak memory, enabling fine-tuning on resource-constrained devices.