Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-11

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages – embedding, reranking, and LLM generation – on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression – a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

02.
arXiv (CS.LG) 2026-06-17

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

arXiv:2603.20775v2 Announce Type: replace Abstract: In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

03.
arXiv (CS.LG) 2026-06-18

Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks

arXiv:2602.02056v3 Announce Type: replace-cross Abstract: Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.

04.
arXiv (CS.CL) 2026-06-18

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

05.
arXiv (CS.CL) 2026-06-18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

06.
arXiv (CS.LG) 2026-06-19

On the Oracle Complexity of Interpolation-Based Gradient Descent

arXiv:2606.19878v1 Announce Type: new Abstract: Recent work on first-order optimizers for empirical risk minimization (ERM) has suggested that smoothness of ERM loss functions in the training data, rather than in the optimization parameters, can be leveraged to improve the oracle complexity of gradient descent (GD) methods. In this paper, we propose an inexact gradient method, piecewise polynomial interpolation-based gradient descent (PPI-GD), which approximates the full gradient in each iteration by querying the first-order oracle at equidistant points in the data domain to construct polynomial interpolants of the resulting gradient samples over appropriately sized patches of the data domain. We analyze the oracle complexity of PPI-GD for strongly convex and non-convex loss functions when the data space dimension is bounded by a polylogarithmic function of the number of training samples, and find it to outperform several GD variants in key regimes when the loss function is sufficiently smooth. Furthermore, our analysis extends several techniques from the error analysis of bicubic spline interpolants to the setting of $d$-variate tensor product polynomial interpolants which may be of independent interest in interpolation analysis.

07.
arXiv (CS.LG) 2026-06-18

A Streaming Sparse Cholesky Method for Derivative-Informed Gaussian Process Surrogates Within Digital Twin Applications

arXiv:2511.00366v2 Announce Type: replace-cross Abstract: Digital twins are developed to model the behavior of a specific physical asset (or twin), and they can consist of high-fidelity physics-based models or surrogates. A highly accurate surrogate is often preferred over multi-physics models as they enable forecasting the physical twin future state in real-time. To adapt to a specific physical twin, the digital twin model must be updated using in-service data from that physical twin. In this paper, we combine and extend several previous surrogate-related advancements with the goal of demonstrating an end-to-end digital twin (DT) solution for predicting performance of an aircraft structure (the physical asset). To this end, we extend Gaussian process (GP) models to include derivative data, for improved accuracy, with dynamic updating to ingest physical twin data during service. Including derivative data, however, comes at a prohibitive cost of increased covariance matrix dimension. We circumvent this issue through our modified dynamic sparse Cholesky linear system solver. Numerical experiments demonstrate that the prediction accuracy of the derivative-enhanced sparse Cholesky GP method produces improved models upon dynamic data additions. Lastly, we demonstrate the developed algorithm within a DT framework to model fatigue crack growth in an aerospace vehicle, thereby exhibiting through our assembled engineered system how digital twin technologies can be combined in practice.

08.
arXiv (CS.AI) 2026-06-12

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

arXiv:2606.12581v1 Announce Type: cross Abstract: Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

09.
arXiv (CS.CL) 2026-06-11

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

A social highlighter's most useful signal – which passages a crowd of readers marks – exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies – it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

10.
arXiv (CS.CV) 2026-06-11

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

11.
arXiv (quant-ph) 2026-06-11

An Introduction to the Foundations and Interpretations of Quantum Mechanics

arXiv:2603.09818v2 Announce Type: replace Abstract: This article surveys a selection of key conceptual and interpretational developments in quantum mechanics, tracing the theory from its foundational postulates to contemporary discussions of measurement, nonlocality, and the emergence of classicality. Beginning with the structure of Hilbert space and the postulates governing state evolution and measurement, the epistemic stance of the Copenhagen interpretation and its modern reformulations are examined. The Einstein-Podolsky-Rosen argument, Bell's theorem, and Hardy's paradox are then discussed as probes of locality and realism, alongside the deterministic but explicitly nonlocal de Broglie-Bohm theory. The measurement problem and the implications of contextuality are analyzed in relation to objective collapse models, which introduce new physical dynamics to account for definite outcomes. Finally, the role of decoherence in the suppression of interference and the emergence of classical behavior is explored, together with the interpretational frameworks of many-worlds and consistent histories. This material aims to provide a coherent introductory overview of how several of the most prominent interpretations address the central concern of what quantum mechanics tells us about the nature of physical reality.

12.
arXiv (CS.CV) 2026-06-16

Selective Synergistic Learning for Video Object-Centric Learning

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

13.
arXiv (CS.LG) 2026-06-18

Wasserstein Policy Learning for Distributional Outcomes

arXiv:2606.19117v1 Announce Type: cross Abstract: Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential outcomes. In this paper, we study offline policy learning with distribution-valued outcomes, where each potential outcome is a probability measure on $\mathbb{R}$ and the reward is defined through a utility functional applied to the Wasserstein barycenter of induced outcome distributions. We establish statistical guarantees for the policy learning framework based on both Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators. By handling the challenging uniform deviation over the product of the combinatorial policy class and the infinite-dimensional quantile domain, we prove that the finite-sample regret has leading dependence $\widetilde{\mathcal{O}}(\sqrt{\mathrm{N-dim}(\Pi)/N})$. In the one-dimensional Wasserstein setting and under the stated regularity conditions, the leading regret rate is still governed by the policy-class complexity. Moreover, we provide a minimax lower bound establishing the sharpness of the leading dependence on $N$ and $\mathrm{N-dim}(\Pi)$.

14.
arXiv (CS.CL) 2026-06-18

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose Rubric-Conditioned Self-Distillation, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

15.
arXiv (CS.LG) 2026-06-15

Closed-loop discovery of out-of-distribution processing protocols by evolutionary search and uncertainty-aware learning

arXiv:2606.13859v1 Announce Type: cross Abstract: Many materials and chemical systems exhibit history-dependent responses, where functional outcomes are governed not only by final-state variables but by the time-dependent sequence of fields, temperatures, or chemical potentials applied during operation. Discovering new processing protocols is therefore a high-dimensional search problem in which the control variable is an entire waveform or sample history, and conventional strategies either remain confined to conservative interpolative families or become prohibitively measurement intensive. Here, a closed-loop workflow is introduced that couples evolutionary search over a compact waveform representation with uncertainty-aware deep kernel learning to generate, rank, and experimentally validate candidate protocols. Applied to ferroelectric thin films, with the scanning-probe tip-bias waveform as the protocol and the nonlinear electromechanical response as the reward, the workflow discovers waveform families that enhance nonlinearity by de-aging the film. Spatially resolved before/after measurements show that the best-performing waveforms selectively activate pre-existing, weakly pinned domain-wall segments, whereas the worst drive long-range irreversible switching. This framework reframes protocol tuning as out-of-distribution discovery, generalizable to synthesis and annealing trajectories, battery formation protocols, and other high-dimensional control problems.

16.
arXiv (CS.CV) 2026-06-12

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

18.
arXiv (CS.LG) 2026-06-15

PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

arXiv:2606.14510v1 Announce Type: new Abstract: Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for de novo macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD's generation quality and reward-optimization performance against representative peptide generation baselines.

19.
arXiv (CS.CV) 2026-06-15

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .

20.
arXiv (CS.CV) 2026-06-16

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

21.
arXiv (CS.CL) 2026-06-15

Detecting undisclosed LLM-generated content in parliamentary texts

In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

22.
arXiv (CS.CV) 2026-06-11

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

23.
arXiv (CS.AI) 2026-06-18

Code-Augur: Agentic Vulnerability Detection via Specification Inference

arXiv:2606.18619v1 Announce Type: cross Abstract: The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

24.
arXiv (CS.AI) 2026-06-18

Rescaling MLM-Head for Neural Sparse Retrieval

arXiv:2606.18811v1 Announce Type: cross Abstract: Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

25.
arXiv (CS.LG) 2026-06-16

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

arXiv:2606.15531v1 Announce Type: new Abstract: Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment resides in model weights, they do not by provide a general formal framework for deriving guarantees about when fine-tuning degrades it – leaving the field without principled tools for predicting or preventing alignment collapse. We develop a local geometric framework through geometric analysis of parameter-space trajectories and apply it to understand the fragility of alignment in fine-tuning. While first-order analysis suggests orthogonal updates are safe, we prove this is illusory: the curvature of the fine-tuning loss induces second-order acceleration that can induce second-order drift into alignment-sensitive regions. We formalize a construct of our framework as the Alignment Instability Condition (AIC), three geometric properties that, when present, are sufficient to guarantee degradation. Our main result proves quartic onset of alignment degradation along gradient-flow trajectories, determined by how sharply alignment depends on specific parameters and how strongly tasks couple to these parameters. These findings yield formal sufficient conditions under which static first-order protection can fail under gradient descent. We further empirically validate the framework's foundations, showing that the Fisher Information Matrix provides a proxy for the degree of safety degradation across diverse fine-tuning.