Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-11

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

02.
arXiv (CS.LG) 2026-06-16

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

arXiv:2606.15054v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously, claiming dictionary slots regardless of content alignment. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per-feature extension lets each feature decide independently. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts far more often than standard, filling dictionary slots that inner product wastes on norm detectors. Loss reweighting that equalizes gradients barely closes the gap, confirming forward-pass score geometry as the lever. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations.

03.
arXiv (CS.AI) 2026-06-12

The KG-ER Conceptual Schema Language

arXiv:2508.02548v3 Announce Type: replace-cross Abstract: We propose KG-ER, a conceptual schema language for knowledge graphs that describes the structure of knowledge graphs independently of their representation (relational databases, property graphs, RDF) while helping to capture the semantics of the information stored in a knowledge graph.

04.
arXiv (CS.LG) 2026-06-15

The Program Is Still There: A Conservation Law for Program Discovery

arXiv:2606.13799v1 Announce Type: cross Abstract: Finding the shortest program that generates a sequence is uncomputable, and for six decades that fact has been mistaken for a wall around finding any generating program. It is not a wall but a price, and this paper measures it. For every algorithm that learns about a candidate program only through its score, a class spanning Levin search, evolutionary methods, simulated annealing, and the cross-entropy method, we define the coupling width of a search problem and prove an unconditional worst-case lower bound, exponential in that width with base one less than the domain size. From it follows a conservation law: structural knowledge injected into a search trades one for one against the search it removes, and their sum can never fall below the length of the program sought. Levin's 1973 upper bound and the lower bound proved here are the two ends of one conserved quantity, closing on each other as the instruction set grows. The only escape is to read a candidate's structure rather than its score, and its price, which we prove for generic targets, is incompleteness. A deterministic engine built on this theory recovers a generating program, certified by compressing its data and predicting an unseen continuation, for 2,383 of 3,914 sequences across four independent populations, including 244 of the 256 elementary cellular automata, with measured discovery cost rising along program length more than an order of magnitude inside the score-oracle worst case.

05.
arXiv (CS.AI) 2026-06-11

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

arXiv:2606.11400v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

06.
arXiv (CS.CL) 2026-06-11

Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs

Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (dummy backdoor) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.

07.
arXiv (CS.CL) 2026-06-12

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

08.
arXiv (CS.LG) 2026-06-17

Beyond Independent Genes: Learning Module-Inductive Representations for Single-Cell Gene Perturbation Prediction

arXiv:2602.04901v2 Announce Type: replace-cross Abstract: Predicting transcriptional responses to genetic perturbations is a central problem in functional genomics. In practice, perturbation responses are rarely gene-independent but instead manifest as coordinated, program-level transcriptional changes among functionally related genes. However, most existing methods do not explicitly model such coordination, due to gene-wise modeling paradigms and reliance on static biological priors that cannot capture dynamic program reorganization. To address these limitations, we propose scBIG, a module-inductive perturbation prediction framework that explicitly models coordinated gene programs. scBIG induces coherent gene programs from data via Gene-Relation Clustering, captures inter-program interactions through a Gene-Cluster-Aware Encoder, and preserves modular coordination using structure-aware alignment objectives. These structured representations are then modeled using conditional flow matching to enable flexible and generalizable perturbation prediction. Extensive experiments on multiple single-cell perturbation benchmarks show that scBIG consistently outperforms state-of-the-art methods, particularly on unseen and combinatorial perturbation settings, achieving an average improvement of 6.7% over the strongest baselines. The code is available at https://github.com/ttruan2426-dot/scBIG.

09.
arXiv (CS.AI) 2026-06-16

Honeypot Protocol

作者:

arXiv:2604.13301v1 Announce Type: cross Abstract: Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.

10.
arXiv (CS.LG) 2026-06-15

Scalable Deep Unfolding of Conic Optimizers

arXiv:2606.13825v1 Announce Type: cross Abstract: Deep unfolding (DU) accelerates iterative optimizers by introducing learnable components and training them through unrolled iterations, but extending DU to the large-scale semidefinite programs (SDPs) common in robotics has remained limited. Unrolling a full-update conic solver such as COSMO exposes two obstacles that prior work on learned conic solvers has not: backpropagating through the per-iteration linear-system solve incurs memory quadratic in the problem size once the coefficient matrix is formed explicitly, and backpropagating through the positive semidefinite (PSD) cone projection becomes numerically unstable when eigenvalues coincide. We address the first obstacle with a matrix-free implicit differentiation rule that operates entirely through matrix-vector products, reducing memory from $O(n^2)$ to $O(n)$ and enabling backpropagation at scales where direct factorization runs out of memory. We address the second with a backward rule based on the Dalečkii–Krein representation of the Fréchet derivative, which remains well-defined under repeated eigenvalues. Together these make it possible to learn lightweight hyperparameter policies and warm-starts for a full-update conic solver. We evaluate on nonlinear covariance steering problems solved via sequential convex programming (SCP), as well as standalone SDPs and second-order cone programs ranging from max-cut and Lovász $\vartheta$ SDPs to robust estimation and control problems. The learned policies outperform state-of-the-art solvers across all problems, and can provide up to a 50$\times$ speedup depending on the class. When used as a subroutine in SCP, the learned approach delivers over a 30$\times$ speedup compared to COSMO.

11.
arXiv (CS.CV) 2026-06-12

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

12.
arXiv (CS.LG) 2026-06-16

LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors

arXiv:2606.15896v1 Announce Type: cross Abstract: Learning-based quadrupedal locomotion typically relies on complex reward formulations that entangle task specification, operational limits, gait preference, and terrain adaptation within a single optimization objective. We instead treat these functions through distinct mechanisms: rewards for task specification, constraints for operational limits, energy minimization for gait preference, and exteroceptive perception for adapting energy use to terrain difficulty. We show that these components jointly enable efficient, terrain-adaptive locomotion, and that removing each component exposes a distinct failure mode. Our formulation removes explicit gait priors (including air-time, contact-count, and foot-clearance targets) in favor of emergent behavior. Compared to a conventional complex-reward baseline, our formulation achieves comparable terrain traversal while reducing cost of transport by 56% and operational-limit violations by 96%. The resulting policies transfer zero-shot to a physical Unitree Go2 using LiDAR-based elevation mapping. Project website with videos: https://tinyurl.com/locomposition.

13.
PLOS Computational Biology 2026-06-08

Statistics of cortical representational drift can enable robust readout

by Charles Micou, Timothy O’Leary Representational drift of fixed stimuli, learned tasks and familiar environments is observed in many brain areas, leading to reconfiguration of population codes over days to weeks. This raises the question of whether downstream brain regions employ mechanisms to track changes in population activity and thus preserve the fidelity of the information they extract. We show that the statistical properties of drift have a significant impact on such mechanisms. Over an extended period, a net change in population tuning due to drift can arise from an accumulation of small changes distributed across the population, or via abrupt jumps that affect smaller subsets of cells at each time point. We demonstrate that an adaptive readout can exploit the heavy-tailed statistics of abrupt jumps to maintain a more stable readout using a simple inference mechanism. Using experimental data, we investigate the extent to which heavy-tailed drift statistics are observed during representational drift in the posterior parietal cortex and visual cortex. We find that experimentally measured drift does not conform to a Gaussian random walk. Instead, we find sudden jumps in neural tuning that would be advantageous for a downstream observer adapting to changes in representation. These observations motivate future study to determine whether adaptive decoding mechanisms exist in the brain and to determine the physiological mechanisms that shape the statistics of representational drift.

14.
arXiv (quant-ph) 2026-06-19

Optimal multi-spectral squeezing via deterministic 2D-phase optimization

arXiv:2606.20192v1 Announce Type: new Abstract: Optimization routines are ubiquitous in quantum information technologies and essential to reach the resource levels required by quantum protocols. Specifically, multi-spectral squeezing for use in such protocols requires that losses be kept minimal at every stage, including coherent detection, which is performed by interfering the signal with a classical local-oscillator beam. This in turn requires control over all optical degrees of freedom of the beam in order to optimize the detection. The most general framework for this optimization relies on agnostic, off-the-shelf machine-learning techniques. Here we take the opposite approach: by focusing on a physical description of the specific optical process, we develop a deterministic sequential algorithm that provably reaches the global maximum of the visibility in a pixel basis and scales linearly with the number of pixels, thereby offering an efficient and theoretically grounded alternative to black-box optimization. In our waveguide-based setup, the optimized mask increases the visibility from 76% to 84%, corresponding to a 20% gain in mode-matching efficiency. Multi-spectral squeezing measurements confirm that this improvement translates directly into quantum readout: for the most squeezed spectral mode, the squeezing increases from $-2.08$ dB to $-2.64$ dB, consistent with the inferred efficiency gain. These results establish deterministic spatial phase shaping as an effective, interpretable route to enhanced multimode squeezing in waveguide platforms.

15.
arXiv (CS.CV) 2026-06-19

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

16.
arXiv (CS.LG) 2026-06-17

Learning in Matching Games with Bandit Feedback

arXiv:2506.03802v2 Announce Type: replace Abstract: We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with initially unknown payoff matrices, and we investigate whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of a matching equilibrium, where a matching \( \mathfrak{m} \) and a set of agent strategies \( X \) form an equilibrium if no agent has an incentive to deviate from \( (\mathfrak{m}, X) \). To quantify deviations of a candidate solution \( (\mathfrak{m}, X) \) from the equilibrium \( (\mathfrak{m}^\star, X^\star) \), we introduce the notion of matching instability, which serves as a regret measure for the learning problem. We propose a UCB-based algorithm in which agents form preferences and select actions according to optimistic estimates of the payoffs. Our analysis establishes a sublinear, instance-independent regret upper bound, further supported by empirical evidence.

17.
arXiv (CS.CV) 2026-06-17

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Vision–language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($\kappa$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

18.
arXiv (quant-ph) 2026-06-15

Fourier analysis of quantum neural network with non-linear data embedding

arXiv:2606.14206v1 Announce Type: new Abstract: Fourier analysis has become a crucial tool for understanding the expressivity of Variational Quantum Circuit (VQC) models, as well as an important indicator of barren plateaus (BP). While existing literature has only studied angle-embedded VQCs in a noiseless environment, here we develop the Fourier analysis of VQCs with non-linear data embedding, with particular focus on amplitude embedding, which provides a naturally compact encoding scheme. We first investigate a subtle difference in the domain of input features within amplitude embedding that leads to a distinct expressivity of the zero-frequency Fourier coefficient. By assuming that the ensemble of unitaries generated from the parameter space forms at least a 2-design with respect to the unitary group, we derive, via Weingarten calculus, that the mean of the Fourier coefficients is concentrated at zero, and the variance scales at an exponentially decaying order with respect to the multi-dimensional frequency magnitude. When a noise channel with unitary Kraus operators and probabilities $\{p_k\}$ is taken into account, the variance is further suppressed by a factor $\left(\sum_k p_k^2\right)^{Q}

19.
arXiv (quant-ph) 2026-06-15

Efimov Effect in Ultracold Microwave-Shielded Polar Molecules

arXiv:2602.21433v2 Announce Type: replace-cross Abstract: A quantum-mechanical description is presented for the three-body physics of shielded dipolar molecules, including a prediction of observable Efimov physics. Despite the anisotropic and long-range nature of the interaction, shielding enables a regime in which universality emerges already at the two-body level and extends to the three-body sector, where Efimov physics emerges. On the negative side of the scattering-length resonance, computed trimer binding energies display the characteristic scaling expected for Efimov resonances. Finally, the sudden approximation can be used to create trimer bound states, starting from positive energy trap states as a way to create or detect these molecular trimers. Moreover, the three-body parameter expressed in dipolar units is found to be universal.

20.
arXiv (CS.AI) 2026-06-17

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

arXiv:2606.18005v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.

21.
arXiv (CS.CL) 2026-06-11

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

22.
arXiv (CS.CL) 2026-06-11

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

作者:

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds – statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument – rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

23.
arXiv (CS.CV) 2026-06-18

Recognizing and Reconstructing a Multi-Unit Floor Plan

Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.

24.
arXiv (CS.CV) 2026-06-16

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

25.
arXiv (CS.LG) 2026-06-18

Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks

arXiv:2602.02056v3 Announce Type: replace-cross Abstract: Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.