Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-16

Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter. We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1. Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.

02.
arXiv (CS.LG) 2026-06-11

CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting

arXiv:2511.09789v2 Announce Type: replace Abstract: Recent advances in deep forecasting models have achieved remarkable performance, yet most approaches still struggle to provide both accurate predictions and interpretable insights into temporal dynamics. This paper proposes CaReTS, a novel multi-task learning framework that combines classification and regression tasks for multi-step time series forecasting problems. The framework adopts a dual-stream architecture, where a classification branch learns the stepwise trend into the future, while a regression branch estimates the corresponding deviations from the latest observation of the target variable. The dual-stream design provides more interpretable predictions by disentangling macro-level trends from micro-level deviations in the target variable. To enable effective learning in output prediction, deviation estimation, and trend classification, we design a multi-task loss with uncertainty-aware weighting to adaptively balance the contribution of each task. Furthermore, four variants (CaReTS1–4) are instantiated under this framework to incorporate mainstream temporal modelling encoders, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers. Experiments on real-world datasets demonstrate that CaReTS outperforms state-of-the-art (SOTA) algorithms in forecasting accuracy, while achieving higher trend classification performance.

03.
arXiv (CS.AI) 2026-06-19

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

arXiv:2601.21542v3 Announce Type: replace-cross Abstract: Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: 1) Bidirectional Temporal Perception, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

04.
PLOS Medicine 2026-06-18

Association between initial benzodiazepine prescribing patterns and time to benzodiazepine discontinuation: A population-based retrospective cohort study

by Nikki Bozinoff, Tanya S. Hauck, Robert A. Kleinman, Matthew E. Sloan, Beth A. Sproule, Simone N. Vigod, Jennifer Wyman, Priscila Pequeno, Tara Gomes Background Long-term benzodiazepine use has been associated with increased risk of morbidity and mortality. Preventing long-term use through safer prescribing practices has received little attention to date. We sought to better understand associations between initial prescription characteristics and duration of benzodiazepine use. Methods and findings This was a retrospective population-based cohort study of 1,820,808 adults in Ontario with incident benzodiazepine prescriptions between January 1, 2013 and December 31, 2020, with follow-up to December 31, 2021. The primary exposure was duration of the index prescription (≤7 days—referent group, 8–14 days, 15–30 days, or >30 days). Secondary exposures were: (a) duration of action of index benzodiazepine(s) prescription (short-acting, long-acting or both); (b) number of benzodiazepine dispensed on index (1 or 2+); and (c) mean daily dose of the index prescription in Diazepam Milligram Equivalents (DMEs). The primary outcome was time to benzodiazepine discontinuation in days. Multivariable models were adjusted for age, sex, anxiety, insomnia, and substance use disorders as well as other important comorbidities and socio-demographic characteristics. The median age at index was 53 years (Interquartile Range (IQR) 38–67), and 62.6% were women. The median time to discontinuation in women was 16 days (IQR: 6–29) while the median time to discontinuation in men was 19 days (IQR: 6–29). Lorazepam was the most commonly prescribed benzodiazepine on index (63.9%), followed by clonazepam (17.3%) and diazepam (5.8%). In multivariable Cox Proportional Hazards Models, longer index prescriptions were associated with a lower likelihood of benzodiazepine discontinuation (adjusted Hazard Ratio (aHR) 0.54 (95% Confidence Interval (CI) [0.54,0.54]) for 8–14 days; aHR 0.26 (95% CI [0.25,0.26] for 15–30 days and aHR 0.14 (95% CI [0.14,0.14]) for >30 days, compared to ≤7 days, respectively). Being prescribed two or more benzodiazepines versus 1 was also associated with a reduced likelihood of discontinuation (aHR 0.59 (95% CI [0.57,0.61])), as was being prescribed long-acting benzodiazepines (aHR 0.80 (95% CI [0.80,0.80])) or a combination of short and long acting benzodiazepine (aHR 0.84 (95% CI [0.80,0.88])) versus short-acting benzodiazepines alone. Mean daily doses of >5 to ≤10 DME and >10 to ≤20 DME were associated with an increased likelihood of discontinuation (aHR 1.03 (95% CI [1.03,1.03]); aHR: 1.03 (95% CI [1.03,1.04])), whereas doses >20 DME were associated with a reduced likelihood of discontinuation (aHR 0.98 (95% CI [0.97,0.98])) compared with ≤5 DME. Findings may be subject to bias from unmeasured confounding. Conclusion This large population-based cohort study found that prescribing shorter courses of benzodiazepines, use of a single benzodiazepine, use of a short-acting agent, were associated with reduced likelihood of long-term benzodiazepine use. Findings suggest that simple changes to prescribing practices could reduce prolonged benzodiazepine use and the morbidity and mortality associated with long-term use of these medications.

05.
arXiv (quant-ph) 2026-06-16

Hardy and Cabello Arguments in Spatial and Temporal Frauchiger-Renner Scenarios

arXiv:2606.15467v1 Announce Type: new Abstract: We investigate Hardy- and Cabello-type logical structures within spatial and temporal extensions of the Frauchiger–Renner (FR) framework, embedding these constructions directly into the FR multi-observer architecture. In the spatial multi-observer scenario, both Hardy and Cabello contradictions arise, with the Cabello construction yielding the stronger violation,$\(\Delta_Cabello^{\max}=0.1078\)$, which exceeds the maximal Hardy probability $\(P_{H}^{\max}=\frac{5\sqrt{5}-11}{2}\approx 0.09017\)$. We then develop a sequential temporal FR protocol based on coherent multi-observer measurements performed on a single spin-$\tfrac12$ system. In this temporal setting, the Hardy contradiction disappears identically due to dynamical constraints imposed by sequential state updates, whereas a finite Cabello-type violation survives, \(\Delta_Cabello^{\max}\approx 0.0674\). Our results establish a fundamental structural distinction between spatial entanglement and temporal multi-observer correlations in FR-type logical scenarios, and demonstrate that certain observer-independent description failures persist even without spacelike separation.

06.
arXiv (CS.CV) 2026-06-16

Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.

07.
arXiv (CS.CV) 2026-06-15

S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective – rather than relying on static generation and disjoint filtering – we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

08.
arXiv (CS.AI) 2026-06-12

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

arXiv:2605.02249v2 Announce Type: replace Abstract: We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

09.
arXiv (CS.CV) 2026-06-16

OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

10.
arXiv (CS.LG) 2026-06-11

Learning Patterns and Abstractions from Perceptual Sequences

作者:

arXiv:2503.10973v2 Announce Type: replace Abstract: Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts – a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation – letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

11.
arXiv (CS.CV) 2026-06-11

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at https://github.com/MCG-NJU/SpikeTAD.

12.
arXiv (CS.CL) 2026-06-18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

13.
arXiv (CS.AI) 2026-06-12

Hellinger Multimodal Variational Autoencoders

arXiv:2601.06572v4 Announce Type: replace-cross Abstract: Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha-divergence$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

14.
arXiv (CS.LG) 2026-06-15

Attention-Based Estimation of the Individual Treatment Benefit Probability under Dose Variation

arXiv:2606.13821v1 Announce Type: new Abstract: Estimating the probability that a treatment outperforms a control for an individual patient, called the Individual Probability of Treatment Benefit (IPTB), offers a clinically intuitive alternative to population-average metrics. However, existing methods for IPTB estimation are largely confined to binary treatment settings, despite the prevalence of dose-varying interventions in clinical practice. We propose a general framework for IPTB estimation with ordinal outcomes under discrete dose assignments, called Dose-AIPTB (Dose Attention-based IPTB). Our approach recasts the problem as binary classification over the unobserved sign of the individual treatment effect, constructing pseudo-labels from covariate-similar pairwise comparisons and aggregating them via attention mechanisms or Nadaraya-Watson kernel regression. This formulation naturally accommodates multiple discrete dose levels, extending beyond the binary treatment paradigm. Through numerical experiments on real-world and synthetic data under covariate shift, varying sample sizes, and heterogeneous outcomes, we demonstrate that attention-based aggregation consistently outperforms kernel alternatives. The framework provides a foundation for personalized dose selection grounded in individual-level benefit probabilities. Codes implementing the model are publicly available at https://github.com/NTAILab/AIPTBDose.

15.
arXiv (CS.LG) 2026-06-17

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

arXiv:2606.17062v1 Announce Type: cross Abstract: Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.

16.
arXiv (CS.AI) 2026-06-16

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

arXiv:2606.15096v1 Announce Type: new Abstract: The Riemann Hypothesis remains one of the central unsolved problems in mathematics. Rather than claiming proof, we investigate whether a verifiable AI-assisted reasoning system can produce reliable, formally checked partial progress while explicitly identifying the remaining mathematical obstructions. We apply the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) to two RH-adjacent certification tasks. First, we construct and verify a finite RH-boundary certificate for inequality on a parameterized safe lower curve over a region. The numerical boundary curve is converted into a certificate-backed lower curve, audited using outward-rounded interval arithmetic and Arb/FLINT ball arithmetic, and then checked in Rocq/CoqInterval for the parameterized theorem. Second, we initiate a formal Lagarias-route certificate. Lagarias criterion states that RH is equivalent to the global inequality. We formalize the finite quantity and produce a Coq-checked finite certificate. The final system identifies the exact unresolved mathematical bottlenecks: formalizing the Lagarias equivalence, proving the global tail theorem beyond any finite cutoff, and potentially reducing counterexamples to colossally abundant or related extremal integers. These results demonstrate that VGPT-RSI can produce certified RH-adjacent formal progress, organize proof dependencies, and avoid overclaiming when the remaining obstruction is genuinely mathematical.

17.
arXiv (CS.CL) 2026-06-16

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

18.
arXiv (CS.CV) 2026-06-12

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

19.
arXiv (CS.LG) 2026-06-16

Functional Gradient Descent with Adaptive Representations

arXiv:2606.16926v1 Announce Type: cross Abstract: Functional optimization problems are typically solved by optimizing the parameters of a fixed representation, such as a neural network, resulting in highly nonconvex losses that complicate both training and theoretical analysis. An interesting alternative is functional gradient descent (FGD), that is, gradient descent directly in function space, which benefits from strong convergence results and admits a clean theory. However, FGD is difficult to implement in practice because functional gradients are infinite-dimensional, and thus cannot be fully computed nor stored in memory. Existing implementations therefore rely on fixed approximations, which introduce approximation error. We propose a new, theoretically-grounded FGD algorithm that adapts the representation of the functional gradients over the course of optimization. By explicitly incorporating this approximation into the analysis, we establish convergence to a stationary point (for smooth losses) and to a global minimizer (under smoothness + a Polyak-Lojasiewicz-type condition) regardless of our approximations. To the best of our knowledge, this is the first implementable FGD method with such guarantees in a general setting. We demonstrate the effectiveness of our method on regression, numerical solution of PDEs, and modern computer vision. Across settings, our method consistently outperforms both FGD with fixed approximations and neural network baselines in efficiency and accuracy.

20.
arXiv (CS.LG) 2026-06-15

Beyond a Single Explanation of the Adam–SGD Gap

arXiv:2606.14259v1 Announce Type: new Abstract: Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam–SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a crossover batch size at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.

21.
arXiv (CS.AI) 2026-06-19

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

arXiv:2604.08552v2 Announce Type: replace-cross Abstract: Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

22.
arXiv (CS.CL) 2026-06-16

T-Mem: Memory That Anticipates, Not Archives

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

23.
arXiv (CS.CV) 2026-06-19

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

24.
arXiv (CS.AI) 2026-06-17

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

arXiv:2606.17904v1 Announce Type: new Abstract: Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.

25.
arXiv (quant-ph) 2026-06-15

Fulde-Ferrell superfluids in an asymmetric three-component Fermi Gas

arXiv:2602.24006v2 Announce Type: replace-cross Abstract: An asymmetric three-component Fermi gas, featuring Raman-induced spin-orbit coupling between the first and second components and contact interaction only between the first and third components, introduces both spin-orbit coupling and population imbalance-two mechanisms known to stabilize the Fulde-Ferrell superfluids.We systematically study Fulde-Ferrell superfluids in an asymmetric three-component Fermi gas { in two dimensions and at zero temperature} by finding the global minima of the thermodynamic potential. We reveal a new class of composite Fulde-Ferrell superfluids that emerges when strong spin-orbit coupling generates a double-well structure in momentum space within the lower spin-orbit-coupled band. The key features of these composite superfluids are identified.