Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-15

Filum Terminale Diameter on Routine Pediatric MRI: A Large-Cohort Clinical Reference in 3,406 Children and the Age-Dependent Meaning of the 2-mm Thickened-Filum Threshold

Background. A filum diameter >2 mm is the conventional MRI threshold for a thickened filum, but it derives from small, mostly adult series showing no age dependence; whether one cutoff suits all of childhood is untested. Objective. To build an age-specific filum-diameter reference on routine pediatric MRI and test, adjusting for image resolution, whether the 2-mm threshold is age-stationary. Materials and methods. In this retrospective study an nnU-Net tracer measured the maximal filum diameter on consecutive lumbosacral MRI; versus manual tracing it showed negligible bias but moderate single-measure agreement. After excluding report-confirmed fatty filum, lipoma, or tethered cord, the proportion >2 mm was analysed within one acquisition protocol and by logistic regression adjusting for voxel size and slice thickness. Results. Of 7,245 examinations, 3,869 (53%) were traceable; untraced ones were younger (median 0.75 vs 2.0 years). The presumed-normal cohort had median diameter 1.48 mm. At matched resolution, 2 mm marked the 94th percentile in infants (5.6% exceeded it) but the 83rd by 3-6 years (17.4%); the age effect persisted after adjusting for voxel size and slice thickness (3-6 years vs infants, adjusted OR 4.7; P < .001). Conclusion. Filum diameter clusters near 1.5 mm, and the fixed 2-mm cutoff flags ~5% of infants but ~17% of preschoolers. Caliber should be judged against an age-specific clinical reference, not one fixed cutoff; a thick filum is not itself a diagnosis of tethered cord.

02.
PLOS Medicine 2026-05-20

Prescribed hormonal contraceptive use trends in the Estonian Biobank: A longitudinal observational study

by Jelisaveta Džigurski, Märt Möls, Kristi Läll, Hannah Currant, Mall Eltermaa, Estonian Biobank Research Team , Reedik Mägi, Lili Milani, Triin Laisk Background Hormonal contraceptives (HCs) are widely used and have well-documented population-level statistics. Previous studies with short follow-ups have focussed on individual HC use and side effects. However, the same aspects over longer periods, HC formulation switching, and the impact of genetic factors on HC side effects remain understudied due to the limited availability of suitable datasets. We investigated whether the Estonian Biobank (EstBB) is suitable for studying genetic risk for HC side effects. Methods and findings This is a longitudinal descriptive study combining prescribed HC purchase data collected from 2004 to 2022 with genetic and health data from 73,071 female EstBB HC users aged 15–55 at the time of purchase. HC usage was defined by the Anatomical Therapeutic Chemical (ATC) codes G02B, G03A, and G03HB01. Methods included calculating age-stratified annual user prevalence, inferring usage periods from purchases, assessing formulation switching, identifying the International Classification of Diseases, Tenth Revision (ICD-10)-based side effect-related diagnoses and thromboembolism risk factors, and assessing carrier status for Factor V Leiden (FVL, rs6025) and prothrombin G20210A (PTM, rs1799963) genetic variants as proof-of-concept. Over 19 years, 20 HC formulations with five administration routes (oral pills, transdermal patches, vaginal rings, subdermal implants, intrauterine devices) were used. In the EstBB, combined HCs were the most commonly used among users aged 15–29, while progestin-only HC use increased with age and over time, comparable to the Estonian population. Overall, 64.2% (n = 46,920) of users switched formulations at least once, with 17.7% (n = 12,929) being rapid switchers. Side effect-related diagnoses were observed in 23.1% (n = 2,982) of rapid switchers, with excessive/irregular menstrual bleeding being the most common. Genetic analysis revealed that 5.3% (n = 3,886) of users carried at least one variant previously associated with increased thrombosis risk (3.5% (n = 2,556) carried FVL only, 1.8% (n = 1,276) PTM only, and 0.07% (n = 54) both). Carriers of thrombosis-associated variants had a significantly higher percentage of thrombosis (6.5%) than non-carriers (4.2%; OR = 1.61, 95% CI [1.40, 1.84], p 

04.
arXiv (CS.AI) 2026-06-11

From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning

arXiv:2606.11195v1 Announce Type: cross Abstract: Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consumption while bypassing the slow, reflective processes that underpin sound judgment. This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that operationalizes reflection through auditable reasoning loops. RRI operates not inside the model but around it, providing a practical structure for stable, auditable reasoning between humans and LLMs. The core premise is that LLMs inherit cognitive vulnerabilities similar to those that shape human thought: reliance on intuitive shortcuts, confusion between representation and reality, and a preference for coherence over falsification. When humans and models share these tendencies, their errors compound. We refer to this as relational drift, a failure that arises from interaction rather than from the model alone. Addressing this requires a shift from modeling relations between words to structuring relations between model outputs and human reasoning. RRI provides this missing layer through three components: the Rose-Frame, which identifies likely breakdowns in reasoning; the Architect's Pen, which introduces targeted reflection steps at critical moments; and an inference-time workflow that embeds these steps without retraining the model. Together, these elements transform human-AI interaction into a joint reasoning system with explicit checkpoints, conflict surfacing, and an auditable trail of assumptions. Rather than making machines think like humans or forcing humans to reason like machines, RRI creates a structured interaction in which both compensate for each other's limitations. It reframes AI safety as a cognitive architecture problem, where reliable decisions depend on embedding reflection directly into the interaction process.

05.
arXiv (CS.CL) 2026-06-16

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

06.
arXiv (CS.LG) 2026-06-18

Data-driven sparse identification of governing PDEs via knockoff filters and multi-criteria trade-offs

arXiv:2605.26631v2 Announce Type: replace-cross Abstract: We propose KO-PDE-IDENT, a data-driven framework for identifying parsimonious partial differential equations (PDEs) with false discovery rate (FDR) control. PDE discovery from noisy observations is often hindered by extreme multicollinearity among candidate terms, which causes typical sparse-regression methods to select spurious terms. To address this problem, KO-PDE-IDENT initially mines a support set of potential candidate terms via model-X knockoff filters with finite-sample FDR control, then refines and ranks the surviving PDE alternatives. The framework integrates three components. First, knockoff feature statistics are constructed by coupling $\ell_{0}$-constrained adaptive best-subset selection with SHapley Additive exPlanations (SHAP), yielding an effective and computationally efficient difference statistic. Second, a recursive feature elimination (RFE) procedure removes terms whose marginal contributions are dispensable and assesses statistical necessity through knockoff-perturbed hypothesis testing. Third, the final model selection is formulated as a multi-criteria decision-making (MCDM) problem, where the optimal governing equation is the alternative that best balances a wide range of criteria such as predictive accuracy, model complexity and coefficient uncertainty. We evaluate KO-PDE-IDENT on five canonical PDEs under severe noise corruption. Empirical results show that our framework can exactly recover the true PDE structure, eliminating false discoveries while retaining all true underlying terms, with low coefficient estimation error.

07.
arXiv (quant-ph) 2026-06-16

Weak continuous measurements require more work than strong ones

arXiv:2502.09732v4 Announce Type: replace Abstract: Understanding the energy cost of quantum measurement process and its connection to the measurement performance faces the challenge of modeling the objectification process. The latter, turns the measurement result into an objective fact, available to independent observers, and is responsible for the measurement irreversibility. To address this issue, we propose and analyze a dynamical model of quantum measurement, able to capture nonideal (weak and inefficient) measurements. In this model, the objectification is induced by a contact with a macroscopic reservoir at equilibrium which is responsible for the redundant broadcast of the measurement outcome (producing a Spectrum Broadcast Structure (SBS) state) while inducing decoherence in the pointer basis, in the line of the theory of quantum Darwinism. We analyze the performance of the obtained measurement process by introducing figures of merit to quantify the strength of the measurement and its efficiency. We also derive and a lower bound on the measurement work cost that we can relate to the measurement quality. We take as an illustration the readout of a qubit via its coupling to a harmonic oscillator. We investigate the long sequences of extremely short and weak measurements (a.k.a continuous measurements), to find under which conditions they converge to an ideal (projective) measurement and analyze their work cost. Surprisingly, we find that a sequence converging to projective measurement has a much larger work cost than an equivalent strong measurement obtained from a single intense interaction with the apparatus. We extend this result to a large class of models owing to scaling arguments. Our analysis offers new insights into the trade-offs between measurement strength, energy consumption, and information extraction in quantum measurement protocols.

08.
arXiv (CS.CL) 2026-06-12

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

09.
arXiv (CS.CL) 2026-06-16

The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

Automatic semantic change detection aims to identify how word meanings shift over time, offering insights into both linguistic and societal change. Despite recent progress in computational lexical semantic change (LSC), existing benchmarks and methods struggle to capture bi-directional semantic change, particularly cases where words simultaneously gain and lose senses. This problem is especially challenging for words that have both slang and standard meanings. To address these gaps, we introduce two complementary benchmark datasets. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, sense loss, and stability across three time periods, enabling the study of complex semantic trajectories. The SlangTrack Word Sense Disambiguation (ST-WSD) dataset provides fine-grained, instance-level sense annotations for words combining slang and standard usages, supporting systematic benchmarking of WSD and semantic change detection models. Using these benchmarks, we systematically evaluate models across different methodological families: unsupervised clustering using contextualised embeddings, supervised machine learning, transformer-based models, and state-of-the-art large language models. Among the evaluated systems, the few-shot GPT-4o model achieved the strongest aggregate performance on Exact Sense Match (ESM) and multi-label accuracy; however, Macro-F1 scores near 0.5 across all systems show that rare slang senses remain difficult, which we identify as the central open challenge.

10.
arXiv (CS.CL) 2026-06-17

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

11.
arXiv (CS.LG) 2026-06-15

More with LESS – Local Scene Representations for Tactile Imaging

arXiv:2606.14344v1 Announce Type: new Abstract: Tactile imaging seeks to reconstruct the internal structure of soft objects through touch sensing, with applications in medical diagnosis and robotic manipulation. Recent self-supervised learning approaches have shown promising results, but rely on global, unstructured representations and robot-controlled sensing, limiting generalization and practical use. We propose Local Encoder for Spatial Sensing (LESS), an object-centric tactile representation that exploits the local nature of touch. The tactile scene is modeled as a grid of recurrent encoders with local receptive fields, whose states are fused to reconstruct 2D or 3D images of internal structure. This compositional design enables strong generalization: models trained on single-inclusion phantoms accurately image objects with multiple inclusions and varying sizes. The local structure further supports spatial uncertainty estimation. In addition, we enable hand-held tactile imaging via external pose tracking and human-like palpation data, and extend tactile imaging to full 3D reconstruction.

12.
arXiv (CS.CV) 2026-06-15

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

13.
arXiv (CS.LG) 2026-06-15

Recursively Trained Diffusion Models: Limiting Collapse Distribution and Spectral Characterization

arXiv:2606.13796v1 Announce Type: cross Abstract: Recursive training of generative models on their own outputs can lead to model collapse, a compounding drift away from the true data distribution. Existing theoretical works bound finite-round error accumulation in the context of diffusion models, but two questions remain open:~what distribution does the recursion converge to, and how fast? We answer both, isolating a mechanism distinct from imperfect learning: even with perfect score estimation and exact sampling, the early stopping of the reverse diffusion (required for numerical stability) drives a progressive drift away from the data distribution. We prove that this recursion converges geometrically to a unique limiting distribution, which admits a closed-form characterization as an infinite mixture of increasingly Gaussian-smoothed versions of the data distribution. A Hermite spectral decomposition of this limit reveals that recursive training acts as a low-pass filter: higher-order modes, which encode fine non-Gaussian structure, are attenuated much more strongly than coarse modes. This spectral picture motivates annealed truncation schedules that progressively shrink truncation times across retraining rounds; we prove that any schedule converging to $0$ asymptotically eliminates recursive compounding. Finally, we show our idealized characterization is robust: in the presence of discretization and score estimation errors, the learned distribution remains in a Wasserstein-2 ball around the ideal limit, with mode-dependent contraction rates that contract high-order errors faster than low-order ones. We validate the theory on synthetic Gaussian mixtures and CIFAR-10.

14.
arXiv (CS.CL) 2026-06-15

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

15.
arXiv (quant-ph) 2026-06-15

Computational regimes in matrix-product-state-based quantum trajectory simulations

arXiv:2606.13779v1 Announce Type: new Abstract: Efficient simulation of open quantum systems is central to modeling noisy quantum hardware and many-body dynamics. In trajectory-based tensor network methods, cost is often associated with trajectory-level quantities such as entanglement growth or bond dimension. However, the total cost of a fixed-accuracy simulation also depends on statistical sampling, and the interplay between per-trajectory complexity and sampling effort remains poorly understood. Here we introduce a cost-resolved framework for matrix product state (MPS)-based quantum trajectory simulations that decomposes total cost into memory per trajectory, runtime per trajectory, and sampling effort. We show that physically equivalent stochastic unravelings of the same Lindblad dynamics do not necessarily reduce total cost, but instead redistribute cost between trajectory complexity and statistical convergence. This trade-off is quantified by two dimensionless inflation factors: a bond dimension inflation $\alpha$ and a sampling inflation $\kappa$, which together determine the preferred unraveling under hardware-dependent memory and parallelism constraints. We provide a practical protocol for extracting $(\alpha,\kappa)$ from modest pilot simulations and demonstrate it using benchmarks across multiple noise channels. The resulting decision maps show that the computationally favorable unraveling can change with noise strength, time-step resolution, system size, and available parallelism. These results establish unraveling choice as a hardware-aware simulation design problem rather than an intrinsic optimization of trajectory entanglement alone.

16.
arXiv (CS.AI) 2026-06-19

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

arXiv:2606.19921v1 Announce Type: new Abstract: This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

17.
arXiv (CS.CV) 2026-06-12

Language-Guided Abstraction for Visual Reasoning

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

18.
arXiv (CS.AI) 2026-06-16

Intrinsic Computational Functionalism and Simulated Consciousness

arXiv:2606.15348v1 Announce Type: cross Abstract: A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

19.
arXiv (quant-ph) 2026-06-16

Decoherence-free algebras in quantum dynamics

arXiv:2403.12926v2 Announce Type: replace Abstract: In this Article we analyze the algebraic properties of the asymptotic dynamics of finite-dimensional open quantum systems in the Heisenberg picture. In particular, a natural product (Choi-Effros product) can be defined in the asymptotic regime. Motivated by this structure, we introduce a new space called the Choi-Effros decoherence-free algebra. Interestingly, this space is both a C*-algebra with respect to the composition product, and a B*-algebra with respect to the Choi-Effros product. Moreover, such space admits a direct-sum decomposition revealing a clear relationship with the attractor subspace of the dynamics. In particular, the equality between the attractor subspace and the Choi-Effros decoherence-free algebra is a necessary and sufficient condition for a faithful dynamics. Finally, we show how all the findings do not rely on complete positivity but on the much weaker Schwarz property.

20.
arXiv (CS.LG) 2026-06-15

AGORA: Can Deliberation and Governance Gates Absorb Participation Bias in Transit Planning?

arXiv:2606.13696v1 Announce Type: cross Abstract: Transit network design depends not only on the optimization algorithm but also on who shows up to the public hearing. Current practice often collects one-directional comments from self-selected attendees, leaving participant mix as an uncontrolled source of outcome variation. We present AGORA, a framework that holds the network, demand, and solver fixed while systematically varying meeting composition through stakeholder agents, structured deliberation, and governance gates. Across two standard benchmark networks at different scales, we find that (i) aggregate outcomes vary little across compositions, but on tail risk and fairness disparity, representative sampling still tends to outperform skewed compositions; (ii) without deliberation, composition produces no variation at all, showing that deliberation is the mechanism through which who attends affects outcomes; and (iii) governance gates compress cross-profile variance without shifting the average outcome on Mandl, but low acceptance on Mumford0 shows thresholds require instance-specific calibration. These findings reframe participation bias from an uncontrollable input to a process-design problem: even without guaranteed representative attendance, well-structured deliberation and governance criteria can substantially reduce how much outcomes depend on who is in the room.

21.
arXiv (CS.CV) 2026-06-17

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

22.
arXiv (CS.LG) 2026-06-19

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

arXiv:2606.20400v1 Announce Type: new Abstract: Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.

23.
arXiv (CS.CV) 2026-06-16

Conditional Multi-Event Temporal Grounding in Long-Form Video

Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

24.
arXiv (CS.CL) 2026-06-12

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types – nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

25.
arXiv (CS.AI) 2026-06-12

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

arXiv:2606.12871v1 Announce Type: new Abstract: Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.