Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-17

Know Thy Reasoner: Not All Language Models Explore Alike

arXiv:2604.10827v2 Announce Type: replace Abstract: Compute scaling for LLM reasoning trades off exploring solution approaches (breadth) against refining promising ones (depth), yet why a given trade-off works, and why it often fails to transfer across models, remains unclear. We argue that the optimal strategy depends on the model's diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted. We formalize this with a framework decomposing reasoning uncertainty, deriving when depth-based refinement outperforms parallel sampling, and validate it across three model families at both inference and training. Our central finding is that the diversity regime dictates the strategy: low-diversity aligned models benefit from depth-based refinement with lightweight intrinsic signals, whereas high-diversity base models are often harmed by it, and instead need breadth or stronger signals to compensate.

02.
arXiv (CS.CV) 2026-06-17

Looped World Models

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

03.
arXiv (CS.CV) 2026-06-15

StereoGeo: an end-to-end stereo camera calibration method

In this work, we propose StereoGeo, an end-to-end network-based approach for stereo camera calibration. Our method estimates the focal lengths and gravity directions of the left and right cameras, as well as the relative extrinsic transformation relating them. Existing methods often rely on calibration patterns in structured environments or address only a single camera configuration, being limited to either intrinsic or extrinsic estimation, and depending on a multi-view setups. StereoGeo extends the GeoCalib algorithm, integrating deep neural network feature extraction with a differentiable optimizer. Extensive experiments on real-world benchmarks demonstrate that StereoGeo achieves competitive performance for intrinsic calibration and provides accurate stereo extrinsic estimation, outperforming existing methods that are limited to monocular settings. The dataset used in this work is partially publicly available at https://github.com/meddourimane/StereoGeo-dataset.

04.
arXiv (CS.CL) 2026-06-16

Metacognitive Myopia in Large Language Models

Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.

05.
arXiv (CS.AI) 2026-06-18

scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering

arXiv:2606.18672v1 Announce Type: cross Abstract: Single-cell RNA sequencing (scRNA-seq) serves a pivotal role in characterizing gene expression at the cellular level, enabling the identification of cell types and advancing the understanding of cellular heterogeneity. Despite the significant progress in scRNA-seq data clustering, we argue that current methods always ignore the sparsity and noise, as well as the complex intercellular structural information inherent in scRNA-seq data. Toward this end, in this paper, we propose a novel single-cell RNA-seq clustering framework via deep Siamese Graph Transformer Network (termed scGTN), which explicitly integrates gene expression profile and intercellular structural dependencies for cell clustering. In particular, we formulate scRNA-seq data as a graph and construct two augmented graph views that serve as dual views to capture complementary intercellular information. Then, a Siamese graph transformer network is employed to explicitly incorporate shortest-path information and node-wise distances for capturing richer structural relationships between cells. Finally, we employ an optimal transport strategy to guide the cell clustering in a self-supervised manner. Extensive experiments on multiple benchmark scRNA-seq datasets demonstrate that our scGTN consistently outperforms existing methods. Our code is available at https://github.com/W-RMSL/scGTN.

06.
arXiv (CS.CL) 2026-06-16

HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

07.
arXiv (CS.LG) 2026-06-19

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

arXiv:2606.19993v1 Announce Type: new Abstract: We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware optimum of SVD-LLM(W), AIR runs a single closed-form alternating least squares (ALS) sweep that integrates influence element-wise under a monotone-descent guarantee. AIR is layer-local and composes orthogonally with end-to-end methods: alone it exceeds ACIP, and AIR+LoRA outperforms it further. AIR improves perplexity over SVD-LLM(W) by >18% at

08.
arXiv (CS.CL) 2026-06-16

FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce FraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \href{https://anonymous.4open.science/w/FraudMessageWalker-Bench}{anonymous link}.

09.
arXiv (CS.LG) 2026-06-11

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

arXiv:2606.09289v2 Announce Type: replace Abstract: Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

10.
arXiv (CS.AI) 2026-06-19

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

arXiv:2604.08552v2 Announce Type: replace-cross Abstract: Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

11.
arXiv (CS.CV) 2026-06-17

Similarity-based representation factorization for revealing interpretable dimensions in representational data

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

12.
arXiv (CS.LG) 2026-06-18

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: https://doi.org/10.57967/hf/8695. Code: https://github.com/edstevenson/ThousandWorlds.

13.
arXiv (CS.LG) 2026-06-16

On the Benefits of Weight Normalization for Overparameterized Matrix Sensing

arXiv:2510.01175v2 Announce Type: replace Abstract: While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.

14.
bioRxiv (Bioinfo) 2026-06-18

Predicting optimal growth temperatures of bacteria using learned structural information from a single protein

Temperature is a fundamental determinant of bacterial physiology and ecology. Optimal growth temperature (OGT) is highly variable across species, contributing to differences in where and when species are most likely to thrive. Although the OGTs for most bacteria remain unknown, the increasing availability of genomes from uncultivated and cultivated taxa has made it advantageous to build genomic, cultivation-independent models to infer OGT. However, pre-existing genomic models often lack the generalizability and mechanistic grounding required for robust inferences of OGT. We propose a novel framework for predicting bacterial OGT which uses learned protein structural signatures of thermal adaptation. We hypothesize that biophysical tradeoffs which dictate enzymatic functions across variable temperatures provide a more robust empirical basis for OGT prediction than broad genomic features. Our OGT-predicting model, ROSEATE, is based on a single gene, adenylate kinase (ADK), that encodes for a ubiquitous enzyme essential for energy homeostasis. ROSEATE uses high-dimensional latent space encoding via MSA Transformer, a protein language model which embeds ADKs in a manner which preserves biophysical information about embedded proteins. We show that the accuracy of the ROSEATE model is on par with other genome-based models, has a high degree of phylogenetic generalizability, and the ESM embeddings effectively capture key temperature-adaptive enzyme characteristics derived from AlphaFold structures. Because ROSEATE is based on analyses of a single ubiquitous protein, it can be used with metagenomic data to infer the community-level variation in bacterial OGTs. We demonstrate this feature of ROSEATE by reconstructing ADK sequences from over 500 environmental and host-associated metagenomes, successfully distinguishing community-wide thermal preferences across diverse habitats, from polar oceans to mammalian guts. By transitioning from genomic proxies to informationally dense protein structural features, this work provides an efficient, interpretable tool for predicting bacterial OGTs across taxa and whole communities.

15.
arXiv (CS.CV) 2026-06-18

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

16.
arXiv (quant-ph) 2026-06-17

Unclonable Encryption in the Haar Random Oracle Model

arXiv:2603.11437v2 Announce Type: replace-cross Abstract: We construct unclonable encryption (UE) in the Haar random oracle model, where all parties have query access to $U,U^\dagger,U^*,U^T$ for a Haar random unitary $U$. Our scheme satisfies the standard notion of unclonable indistinguishability security, supports reuse of the secret key, and can encrypt arbitrary-length messages. That is, we give the first evidence that (reusable) UE, which requires computational assumptions, exists in "microcrypt", a world where one-way functions may not exist. As one of our central technical contributions, we build on the recently introduced path recording framework to prove a natural ``unitary reprogramming lemma'', which may be of independent interest.

17.
arXiv (quant-ph) 2026-06-16

Efficient Implementation of a Single-Qutrit Gate Set via Coherent Control

arXiv:2507.06860v2 Announce Type: replace Abstract: Qutrits offer the potential for enhanced quantum computation by exploiting an enlarged Hilbert space. However, the synthesis of high-fidelity and fast qutrit gates, particularly for single qutrits, remains an ongoing challenge, as it involves overcoming intrinsic constraints in quantum platforms. Here, we develop a novel framework for the efficient implementation of a single-qutrit gate set via coherent control, leveraging SU(3) dynamics while obviating platform-specific constraints such as those arising from the selection rule. As a proof-of-principle demonstration, we realize 35-ns qutrit Hadamard and X gates using a superconducting transmon, achieving an average fidelity of 99.5\%, as verified by randomized benchmarking. We further demonstrate two paradigmatic quantum circuits, which can be naturally extended to scalable qudit algorithms for phase estimation and parity check. In addition, we propose an SU(3)-based decomposition strategy for an arbitrary single-qutrit gate and numerically demonstrate its substantial efficiency improvement over conventional SU(2)-based protocols. By addressing the challenge of efficiently implementing single-qutrit gates, our protocol paves the way for realizing high-performance qutrit processors in diverse quantum platforms.

18.
arXiv (CS.CL) 2026-06-16

Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Evaluated on diverse multimodal reasoning benchmarks spanning mathematical and general domains, DUPL achieves solid gains. It improves Qwen2.5-VL accuracy by up to $12.3%$ (3B) and $7.9%$ (7B), and Qwen3-VL-Instruct by up to $10.7%$ (4B) and $12.4%$ (8B), consistently outperforming GRPO, while seamlessly generalizing to alternative algorithms (DAPO, $+6.5%$ avg) and architectures (LLaVA-OneVision-1.5, $+4.7%$ avg). These results demonstrate that DUPL is an effective and generalizable approach for multimodal RLVR.

19.
arXiv (CS.CL) 2026-06-11

Scenario-based Probing and Steering Cultural Values in Large Language Models–Extended Version

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart–Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

20.
arXiv (CS.AI) 2026-06-16

Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data

arXiv:2410.16089v2 Announce Type: replace Abstract: The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering the need for the development of UAV detection and classification mechanisms essential. We propose a methodology for developing a system that fuses already processed multi-sensor data into a new Deep Neural Network to increase its classification accuracy towards UAV detection. The DNN model fuses high-level features extracted from individual object detection and classification models associated with thermal, optronic, and radar data. Additionally, emphasis is given to the model's Convolutional Neural Network (CNN) based architecture that combines the features of the three sensor modalities by stacking the extracted image features of the thermal and optronic sensor achieving higher classification accuracy than each sensor alone.

21.
arXiv (CS.CL) 2026-06-17

Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision-Language Models

Online user generated content games (UGCGs) are increasingly popular among children and adolescents for social interaction and more creative online entertainment. However, they pose a heightened risk of exposure to explicit content, raising growing concerns for the online safety of children and adolescents. Despite these concerns, few studies have addressed the issue of illicit image-based promotions of unsafe UGCGs on social media, which can inadvertently attract young users. This challenge arises from the difficulty of obtaining comprehensive training data for UGCG images and the unique nature of these images, which differ from traditional unsafe content. In this work, we take the first step towards studying the threat of illicit promotions of unsafe UGCGs. We collect a real-world dataset comprising 2,924 images that display diverse sexually explicit and violent content used to promote UGCGs by their game creators. Our in-depth studies reveal a new understanding of this problem and the urgent need for automatically flagging illicit UGCG promotions. We additionally create a cutting-edge system, UGCG-Guard, designed to aid social media platforms in effectively identifying images used for illicit UGCG promotions. This system leverages recently introduced large vision-language models (VLMs) and employs a novel conditional prompting strategy for zero-shot domain adaptation, along with chain-of-thought (CoT) reasoning for contextual identification. UGCG-Guard achieves outstanding results, with an accuracy rate of 94% in detecting these images used for the illicit promotion of such games in real-world scenarios.

22.
arXiv (CS.AI) 2026-06-16

Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment

arXiv:2606.14948v1 Announce Type: cross Abstract: LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch conformance to repository-specific architectural conventions via source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B on 3,360 curated instances achieves resolved rates of up to 27.2% on SWE-bench Verified - up to 540% over the base model and 256% over unfiltered fine-tuning. Meanwhile, the trained models achieve strong cross-language generalization and consistent improvements in architectural patch quality.

23.
arXiv (CS.LG) 2026-06-12

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

arXiv:2606.13637v1 Announce Type: new Abstract: Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

24.
arXiv (CS.CV) 2026-06-11

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{cal}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

25.
arXiv (CS.CL) 2026-06-16

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Large language models (LLMs) are increasingly being used in a zero-shot (generative) fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we use a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting model's assessment accuracy, we systematically varied (i) contextual knowledge prompted to the models like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative, even exceeding human raters agreement with self-reported scores; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, DeepSeek) plateaus beyond 70B parameters while closed-weight (gpt-o3-mini, gpt-5) alternatives improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Beyond agreement with self-reports, LLMs' estimates discriminated PTSD severity from depression, anxiety, and alcohol use, and prospectively predicted future mental healthcare expenditure. Together, these results suggest that contextual knowledge and modeling strategies meaningfully affect accuracy and clinical utility of LLM-based assessments of PTSD severity.