Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-12

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

02.
arXiv (CS.LG) 2026-06-16

Enhancing Visual Feature Attribution via Weighted Integrated Gradients

arXiv:2505.03201v4 Announce Type: replace-cross Abstract: Integrated Gradients (IG) is a widely used attribution method in explainable AI, particularly in computer vision applications where reliable feature attribution is essential. A key limitation of IG is its sensitivity to the choice of baseline (reference) images. Multi-baseline extensions such as Expected Gradients (EG) assume uniform weighting over baselines, implicitly treating all baseline images as equally informative. In high-dimensional vision models, this assumption often leads to noisy or unstable explanations. This paper proposes Weighted Integrated Gradients (WG), a principled approach that evaluates and weights baselines to enhance attribution reliability. WG introduces an unsupervised criterion for baseline suitability, enabling adaptive selection and weighting of baselines on a per-input basis. The method preserves the core axiomatic properties of IG in a generalized weighted-baseline form. Under an expected, proxy-based fitness–relevance monotonicity assumption, WG provides a probabilistic justification for assigning larger weights to more informative baselines. Experiments on commonly used image datasets and models show that WG improves over EG under our protocol, with up to 36% gains across evaluated convolutional and Transformer architectures. These gains come with additional fitness-evaluation cost, so WG should be viewed as an attribution-fidelity trade-off rather than a faster alternative to EG. By moving beyond the assumption that all baselines contribute equally, Weighted Integrated Gradients offers a clearer and more reliable approach to explaining computer-vision models, improving both understanding and practical usability in explainable AI.

03.
PLOS Computational Biology 2026-06-12

Ten simple rules for executing an inherited research plan in computational biology

by Sahar Javaheri Tehrani, Toni Ingolf Gossmann Trainees in computational biology frequently inherit research plans whose aims, datasets, analytical strategies, and technical constraints were defined before their arrival. These plans often emerge from grants, collaborations, legacy codebases, shared high-performance computing environments, or partially completed analyses. While such plans provide a useful scaffold, they rarely specify all implementation details, prior assumptions, evaluation criteria, or dependencies needed for reliable execution. The transition from inheriting a partially articulated plan to producing reproducible results therefore creates an execution gap: a phase in which trainees must reconstruct what the project is, which elements are fixed, which remain negotiable, and which technical or organizational assumptions need to be tested before full-scale analysis begins. In this Ten Simple Rules article, we provide a practice-oriented framework for stabilizing inherited computational biology projects before workflows, benchmarks, and decision paths become entrenched. We do not claim that the individual practices described here are novel in isolation. Rather, our contribution is to organize familiar practices into a sequenced framework for a recurrent but under-articulated phase of computational research: inherited-plan execution. Computational biology makes this phase especially important because projects often combine heterogeneous datasets, fragile software environments, undocumented preprocessing choices, benchmarking assumptions, distributed collaborators, and asymmetrical access to contextual knowledge. By making this transition visible and operational, the rules aim to help trainees, supervisors, and collaborators reduce ambiguity, test feasibility, document decisions, and support reproducible and equitable project execution under real-world constraints.

04.
arXiv (CS.AI) 2026-06-16

Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality

arXiv:2505.18227v4 Announce Type: replace-cross Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.

05.
arXiv (CS.AI) 2026-06-11

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

arXiv:2606.11463v1 Announce Type: cross Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

06.
arXiv (CS.AI) 2026-06-17

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

arXiv:2606.17312v1 Announce Type: new Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently – a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion – measuring how much sampled answers differ – but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness – consistent with settings where multiple plausible solution paths remain competitive – while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

07.
arXiv (CS.AI) 2026-06-16

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

arXiv:2603.00680v4 Announce Type: replace Abstract: Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent's overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98 over the base model and 7.1 over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%. The code is released at https://github.com/TheNewBeeKing/MemPO.

08.
arXiv (CS.AI) 2026-06-16

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

arXiv:2606.15695v1 Announce Type: cross Abstract: Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

09.
arXiv (CS.LG) 2026-06-12

Adaptive Weighted Averaging

arXiv:2606.12763v1 Announce Type: new Abstract: We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

10.
arXiv (CS.AI) 2026-06-15

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

arXiv:2606.13705v1 Announce Type: cross Abstract: Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

11.
arXiv (quant-ph) 2026-06-12

Schrödinger Symmetry in Spherically-symmetric Static Mini-superspaces with Matter Fields

arXiv:2512.13651v3 Announce Type: replace-cross Abstract: Schr\"{o}dinger symmetry has been shown to emerge in a ``fluid limit" from the full superspace to several mini-superspace models. To investigate one aspect of the robustness of this emergent symmetry, we consider two spherically-symmetric static mini-superspace models with matter fields at the classical level: (i) a Maxwell field with a cosmological constant and (ii) $n$ massless scalar fields. By developing a method based on canonical transformations, we demonstrate that for model (i), 3D Schrödinger symmetry emerges, and the solution is the (anti-)de Sitter Reissner-Nordström spacetime, and for model (ii), $(2+n)$D Schrödinger symmetry appears, and the solution is a generalized Janis-Newman-Winicour spacetime and its ``interior", a Kantowski-Sachs type closed universe. Furthermore, for the vacuum model, we find that 2D Schrödinger symmetry holds with different lapse functions and mini-superspace coordinates, suggesting the potential, yet unconfirmed, covariance of the symmetry. Finally, we propose a physical interpretation of the symmetry under the Hamiltonian constraint $H$: symmetry generators commuting with $H$ map a solution to another one, while those non-commuting with $H$ generate a new theory with the Schrödinger symmetry and the transformed configuration is a solution to the new theory. These results reinforce the robustness of the emergent Schrödinger symmetry and open new frontiers for exploring dynamics of matter and gravity.

12.
arXiv (CS.CV) 2026-06-15

MCR-VQGAN: A Scalable and Cost-Effective Tau PET Synthesis Approach for Alzheimer's Disease Imaging

Tau positron emission tomography (PET) is a critical diagnostic modality for Alzheimer's disease (AD), but its widespread clinical adoption is hindered by radiation exposure, limited availability, high clinical workload, and substantial financial costs. To address these limitations, we propose the Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) to synthesize high-fidelity tau PET images from structural T1-weighted MRI. MCR-VQGAN advances the standard VQGAN architecture through three enhancements: multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM), which collectively improve the capture of local and global features. Using 222 paired T1-weighted MRI and tau PET scans from the ADNI database, we trained and compared MCR-VQGAN against cGAN, WGAN-GP, CycleGAN, and baseline VQGAN. MCR-VQGAN achieved superior image synthesis performance across all metrics (MSE = 0.0056 +/- 0.0061, PSNR = 30.65 +/- 4.47 dB, SSIM = 0.9263 +/- 0.0469). A CNN-based AD classifier trained on real tau PET achieved comparable accuracy on real (63.64%) and synthetic (65.91%) images, indicating that diagnostically relevant features are preserved. Regional SUVR-equivalent analysis across Braak-defined ROIs further indicated strong agreement between real and synthetic tau PET (Pearson r = 0.78-0.88; ICC = 0.71-0.84), with the strongest agreement in Braak V/VI (ICC = 0.838). Together, these results suggest that MCR-VQGAN offers a promising and scalable surrogate for conventional tau PET imaging, potentially improving the accessibility of tau biomarkers for AD research and clinical workflows.

13.
arXiv (CS.CV) 2026-06-16

Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

OKLCH – the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space – is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued – a threshold switch that fires only at an achromatic endpoint – so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+\sigma^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $\sigma\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable – older engines, image and video pipelines, or GPU shaders.

14.
arXiv (CS.AI) 2026-06-12

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

arXiv:2601.19072v3 Announce Type: replace-cross Abstract: Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations – where the generated review comments are ungrounded in the actual code – poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

15.
arXiv (CS.LG) 2026-06-18

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

arXiv:2606.18774v1 Announce Type: new Abstract: We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

16.
medRxiv (Medicine) 2026-06-11

Long-term Penetrance of Disease Variants in Genes Prioritized for Genomic Newborn Screening: Evidence from Adult Biobanks

Importance: Genomic newborn screening (gNBS) is a potential public health intervention, but its positive predictive value (PPV) remains uncertain. Estimating the prevalence and penetrance of pathogenic and likely pathogenic (P/LP) variants in genes prioritized for screening may clarify the long-term PPV and clinical utility of gNBS. Objective: To compare ICD-based ascertainment, electronic medical record (EMR) review, and clinical assessment of genetic disorders in adults with P/LP variants in 54 genes prioritized for gNBS. Design: Two-cohort observational study with EMR review and clinical assessment in the hospital-based cohort. Setting: The U.K. Biobank (UKB) and Mass General Brigham Biobank (MGBB). Participants: 451,877 adults from the UKB and 53,371 from the MGBB, all with exome sequencing data. Exposures: P/LP variants in 54 genes prioritized through expert consensus for gNBS, in genotypes consistent with each gene's inheritance pattern. Main outcomes and measures: The primary outcome was the absolute difference in the proportion of MGBB participants identified as affected by ICD versus EMR ascertainment. Secondary outcomes included findings from clinical assessments of undiagnosed MGBB participants, corrected UKB penetrance estimates, and extrapolation to U.S.. annual birth cohorts and living adults. Results: P/LP variants were identified in 665 UKB participants (0.15%) and 82 MGBB participants (0.15%), approximately 1 in 650. In MGBB, EMR review revealed that 58/82 individuals (70.7%) were undiagnosed, although 25 of 58 (43.1%) had documented symptoms. Disease-associated ICD codes were found in 39.0% (32/82) of participants, whereas EMR review identified symptoms in 59.8% (49/82, McNemar P

17.
arXiv (CS.LG) 2026-06-16

One-Step Generalization Ratio Guided Optimization for Domain Generalization

arXiv:2606.16301v1 Announce Type: new Abstract: Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domain-specific features, known as undesired correlations. Gradient-based DG methods typically guide gradients in a dominant direction but often inadvertently reinforce spurious correlations. Recent work has employed dropout to regularize overconfident parameters, but has not explicitly adjusted gradient alignment or ensured balanced parameter updates. We propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter's contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR via a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, thereby promoting domain-invariant feature learning. Theoretically, GENIE balances convergence contribution and gradient alignment among parameters, achieving higher OSGR while retaining SGD's convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single-DG methods.

18.
arXiv (CS.LG) 2026-06-12

Disparate Impact in Synthetic Data Generation

arXiv:2606.13105v1 Announce Type: new Abstract: We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

19.
medRxiv (Medicine) 2026-06-16

Reliability and construct validity of the Technology Device Interference Scale in a sample of children and parents

There is increasing interest in parent-child technoference: the interference with personal interactions caused by technology devices. This study examined the reliability and construct validity of the Technology Device Interference Scale (TDIS) to measure technoference in a sample of Canadian parents and children. Parents (n=883) and children (n=376) were recruited from clinical and community settings and completed the TDIS for their own and family member technoference over three timepoints (T1=2023, T2=2024, T3=2025). TDIS internal consistency, test-retest reliability, and construct validity were assessed using Cronbachs alpha, intraclass correlation coefficient, and confirmatory factor analysis, respectively. The TDIS showed good internal consistency and adequate to good construct validity when used by children to report on their own technoference (all >.70; CFI>.95, TLI>.95, RMSEA.70; CFI>.95, TLI>.90, RMSEA[≤].11). The TDIS had low to acceptable internal consistency and poor model fit for parent report of their own technoference ( range: .63 - .66; CFI

20.
arXiv (CS.LG) 2026-06-12

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

arXiv:2606.13104v1 Announce Type: new Abstract: Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

21.
arXiv (CS.CV) 2026-06-16

Active Reference Acquisition in Few-Shot Font Generation

Few-shot font generation aims to synthesize the remaining glyphs of a font given one or a few reference glyphs while preserving stylistic consistency, thereby supporting font designers in efficiently completing a typeface. Existing methods primarily focus on improving generation quality given a fixed reference set. However, when the current reference glyphs are insufficient to represent the target style, few-shot font generation may fail to produce satisfactory results. In practical scenarios, additional reference glyphs can often be obtained from the designer when necessary. Accordingly, we propose a new framework, Active Reference Acquisition in Few-Shot Font Generation, in which the model sequentially decides which character to acquire next as an additional reference. Furthermore, we propose a reference part-coverage-based acquisition function to efficiently query the designer. Motivated by the observation that font styles are well characterized by local structural parts, we represent each glyph using a histogram of local features and select query characters that maximize the expected part coverage of the reference set. By prioritizing characters that contain parts not yet covered by the current references, the proposed method progressively expands the diversity of visual parts in the reference set. As a result, generation quality is improved with fewer queries. Experiments on the Google Fonts dataset demonstrate that the proposed method achieves higher generation quality than random querying and reference-agnostic baselines. The code is available at https://github.com/matsuo-shinnosuke/ActiveRef-FontGen.

22.
arXiv (CS.CL) 2026-06-17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

23.
arXiv (CS.CL) 2026-06-15

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data." This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further study and systematically evaluate index-preserving query-only adaptation strategies that fine-tune only the query-encoder while keeping the document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that parameter-efficient fine-tuning of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise multi-tenant retrieval.

24.
arXiv (CS.CL) 2026-06-16

REFLEX: Reflective Evolution from LLM Experience

作者:

Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.

25.
arXiv (CS.CV) 2026-06-15

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.