Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
Nature (Science) 2026-06-17

Navigating a crowded developing brain leaves neurons with broken DNA

As neurons migrate to their final destinations in the forming brain, their DNA gets damaged. The brain has evolved a fix, but there can be lasting consequences if repair fails. As neurons migrate to their final destinations in the forming brain, their DNA gets damaged. The brain has evolved a fix, but there can be lasting consequences if repair fails.

03.
arXiv (CS.CV) 2026-06-11

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon Layer-wise Accumulated Structural Attention (LASA), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

04.
arXiv (CS.CL) 2026-06-16

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

05.
arXiv (CS.CV) 2026-06-12

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

06.
arXiv (CS.AI) 2026-06-15

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

arXiv:2606.14031v1 Announce Type: new Abstract: Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug–disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

07.
arXiv (CS.CV) 2026-06-18

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

08.
arXiv (CS.CV) 2026-06-18

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

09.
arXiv (CS.CL) 2026-06-16

G-Loss: Graph-Guided Fine-Tuning of Language Models

Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

10.
arXiv (CS.CL) 2026-06-16

LLM-based Visual Code Completion for Aerospace Geometric Design

Recent advances in both Large Language Models (LLMs) and Vision Language Models (VLMs) have seen a step change in their ability to perform visual code completion, but the aerospace industry, which prioritizes safety and explainabilty over rapid LLM adoption, currently has no publicly announced LLM-based geometric design copilot systems in commercial use by aerospace Original Equipment Manufacturers (OEMs). This paper presents a LLM-based visual programming copilot application for aerospace engineering design tasks, using a visual programming variant of the ReAct methodology and GPT 5.4. In addition to the copilot, we describe Wingbuilder, a new Grasshopper plugin library with custom components for aerospace-specific geometry abstraction, and an associated Aerospace Visual Programming Dataset (AVPD) with 18 aerospace expert designed tasks at different levels of difficulty alongside ground truth solutions. We evaluate our copilot application with a user trial involving two experienced aerospace engineers from a large aircraft manufacturing company. We find our copilot visual programming ReAct methodology was successful in generating suggestions that participants found helpful, but slow ReAct inference times limit its usefulness to more complex time-consuming tasks where waiting for good copilot solution suggestion was worthwhile. Participants reported they liked the tool and would be willing to use it in the future.

11.
arXiv (CS.AI) 2026-06-16

From Correlation to Causation in Lane Change Prediction for Automated Driving: A Causal Explanation Framework

arXiv:2606.15756v1 Announce Type: cross Abstract: Lane-change prediction is a central task in intelligent vehicles, where early maneuver anticipation can support safer decision-making. However, many existing approaches mainly learn statistical associations between observed driving variables and future maneuvers, while overlooking the causal dependencies among the input variables themselves. This limits interpretability, especially when physically related variables such as longitudinal gap, relative longitudinal velocity, and Time-To-Collision (TTC) are treated as independent flat inputs. This article presents a causal-inference-based framework for lane-change prediction and explanation. The proposed approach combines linguistic feature construction, expert-constrained causal discovery, deep structural causal modeling with Deep End-to-end Causal Inference (DECI), intervention-based effect analysis, refutation testing, and recursive causal-chain explanation. The objective is not only to predict the future maneuver, but also to identify candidate variables that directly contribute to the prediction, the upstream factors influencing them, and the causal chains through which these effects propagate. The framework achieves average F1-scores above 95% during the first three seconds before the lane-marking crossing event. Beyond prediction accuracy, the framework uses intervention-based effect analysis to distinguish influential from weakly influential variables under the learned causal structure. It further distinguishes candidate direct contributors from mediated effects and generates contrastive causal-chain explanations that clarify why the predicted maneuver is favored and why the alternative maneuvers are less supported. The main contribution is therefore a mechanism-aware lane-change prediction pipeline that moves beyond correlation-based classification toward more interpretable causal reasoning for maneuver prediction.

12.
arXiv (CS.AI) 2026-06-16

Ranking Abuse via Strategic Pairwise Data Perturbations

arXiv:2604.17805v2 Announce Type: replace-cross Abstract: Pairwise ranking systems based on Maximum Likelihood Estimation (MLE), such as the Bradley-Terry model, are widely used to aggregate preferences from pairwise comparisons. However, their robustness under strategic data manipulation remains insufficiently understood. In this paper, we study the vulnerability of MLE-based ranking systems to adversarial perturbations. We formulate the manipulation task as a constrained combinatorial optimization problem and propose an Adaptive Subset Selection Attack (ASSA) to efficiently identify high-impact perturbations. Experimental results on both synthetic data and real-world election datasets show that MLE-based rankings exhibit a sharp phase-transition behavior: beyond a small perturbation budget, a limited number of strategic voters can significantly alter the global ranking. In particular, our method consistently outperforms random and greedy baselines under constrained budgets. These findings reveal a fundamental sensitivity of MLE-based ranking mechanisms to structured perturbations and highlight the need for more robust aggregation methods in collective decision-making systems.

13.
arXiv (CS.CV) 2026-06-16

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

14.
arXiv (CS.LG) 2026-06-16

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

arXiv:2602.09329v3 Announce Type: replace Abstract: Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.

15.
arXiv (quant-ph) 2026-06-12

Quantum Error Correction Codes for Truncated SU(2) Lattice Gauge Theories

Authors:

arXiv:2511.13721v2 Announce Type: replace Abstract: We construct two quantum error correction codes for pure SU(2) lattice gauge theory in the electric basis truncated at the electric flux $j_max=1/2$, which are applicable on quasi-1D plaquette chains, 2D honeycomb and 3D triamond and hyperhoneycomb lattices. The first code converts Gauss's law at each vertex into a stabilizer while the second only uses half of the vertices and is locally the carbon code. Both codes are able to correct single-qubit errors. The electric and magnetic terms in the SU(2) Hamiltonian are expressed in terms of logical gates in both codes. The logical-gate Hamiltonian in the first code exactly matches the spin Hamiltonian for gauge singlet states found in previous work.

16.
arXiv (CS.AI) 2026-06-15

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

arXiv:2606.14239v1 Announce Type: new Abstract: Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards – signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

17.
arXiv (CS.CV) 2026-06-16

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

18.
arXiv (quant-ph) 2026-06-19

Exact Markovian Dissipation Requires Singular Energy Resources

arXiv:2606.19510v1 Announce Type: new Abstract: The Gorini–Kossakowski–Lindblad–Sudarshan (GKLS) equation describes irreversible quantum dynamical semigroups. We show that this description cannot be exact under physically regular energy conditions. We prove that the open-system survival probability under physically regular energy conditions has sublinear decay, whereas any dissipative GKLS semigroup has a linear short-time decay. Hence exact Markovian dissipation requires singular energy resources: an unbounded-below total Hamiltonian or infinite initial energy, and a divergent interaction-energy moment. Therefore, a dissipative time-independent GKLS equation should be regarded as an effective description rather than the exact reduced dynamics of a Hamiltonian dilation satisfying physically regular energy conditions.

19.
arXiv (CS.AI) 2026-06-19

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

arXiv:2606.19733v1 Announce Type: cross Abstract: Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

20.
arXiv (CS.LG) 2026-06-11

Machine-learning clustering of close-in exoplanet populations: links to pebble accretion

arXiv:2606.11737v1 Announce Type: cross Abstract: Close-in exoplanets exhibit a wide range of orbital architectures and physical properties shaped by both formation conditions and migration processes. Although population-synthesis models predict distinct planetary populations, establishing a quantitative connection between observed exoplanets and synthetic populations remains challenging. We investigate the intrinsic organisation of close-in exoplanets using physically motivated dynamical parameters and connect the resulting populations to pebble-accretion formation pathways. A two-stage Gaussian mixture model (GMM) is applied to an observed sample of close-in exoplanets, performing unsupervised probabilistic clustering in a feature space dominated by dynamical descriptors of planet-star interactions. The resulting clusters are mapped onto a pebble-accretion synthetic population within a statistically motivated three-dimensional parameter space. Formation-related quantities, including gas availability, gas fraction, and ice-rock mass ratio, are then used to interpret the mapped populations. We identify statistically supported sub-populations without imposing predefined classification boundaries, including very-massive gas giants, hot giants, warm-Jupiter-dominated systems, and lower-mass giants. The mapped synthetic populations reveal systematic differences in formation timing, gas accretion, and solid growth histories. In particular, very-massive gas giants are preferentially associated with earlier formation epochs than hot-giant and warm-Jupiter-dominated populations. These results demonstrate that physically motivated machine-learning approaches can provide a statistically robust framework for linking observed exoplanet populations to theoretical planet formation pathways.

21.
arXiv (CS.CL) 2026-06-16

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy. Code is available at https://github.com/DreamSoul-AI/OBCache.

22.
arXiv (CS.LG) 2026-06-12

Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines

arXiv:2606.13454v1 Announce Type: cross Abstract: Equilibrium Propagation offers a compelling alternative to traditional machine learning for training energy-based networks. Here we demonstrate a hybrid optical-digital implementation of EP using a Spatial Photonic Ising Machine (SPIM). The SPIM exploits the gauge transformation method to optically encode both continuous neuron states and rank-1 binary trainable patterns as phase modulations via a spatial light modulator, with inference realized using a finite difference scheme. The experimental system is evaluated on the Wine classification dataset. The potential of this approach, including the use of continuous couplings and structured coupling matrices, is evaluated numerically on the more complex MNIST dataset. Our work provides a concrete pathway toward energy-efficient physical implementations of Equilibrium Propagation.

23.
arXiv (CS.CV) 2026-06-12

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

24.
arXiv (CS.CV) 2026-06-15

High-Fidelity Video Compression based on Invertible Neural Transform and Implicit Conditioning

Learning-based video compression has recently achieved competitive rate-distortion performance compared to conventional video codecs. However, most existing methods rely on non-invertible analysis-synthesis transforms, with reconstruction quality subject to both quantization and transform approximation errors. This limitation becomes particularly restrictive at higher quality points, where quantization errors are small and transform-induced distortion dominates. To address this, we propose InnVC, an Invertible neural network based Video Codec for wide-range and high-fidelity compression. The core idea is to preserve an invertible main transform path prior to quantization, while injecting content-adaptive context through a compact implicit conditioning field. This decouples strongly correlated video content from harder-to-model fine details, allowing different components to specialize in complementary reconstruction tasks for more efficient compression. To further improve compressibility, we introduce a scheduled masking strategy that progressively concentrates informative content into fewer latent channels for more effective entropy coding. Experiments on the UVG and MCL-JCV benchmarks show that InnVC achieves strong compression performance over a broad quality range, being particularly effective in the high-quality regime, yielding BD-rate reductions of 21.66% in PSNR and 46.06% in MS-SSIM relative to x265 on UVG. To the best of our knowledge, InnVC is the first neural video codec covers operating poins from low bitrate to high fidelity within a single architecture scale, spanning more than 20 dB in PSNR.

25.
arXiv (CS.AI) 2026-06-11

From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health

arXiv:2606.11214v1 Announce Type: cross Abstract: Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To investigate this awareness-action gap, we conducted a sequential mixed-methods study comprising expert interviews, an online survey, and systematic mapping. The expert interviews informed the design of the survey, which in turn revealed fragmented definitions of fairness, limited training and guidance, reliance on external sources, and rare use of formal assessment, mitigation, or monitoring. These findings were subsequently mapped onto three established research-practice gap lenses: the Knowledge-Practice Gap, the Knowledge-to-Action Cycle, and the Knowing-Doing Gap, each offering complementary perspectives. Building on this synthesis, we introduce the Fairness-to-Action framework, which integrates methodological, organizational, and systemic dimensions to identify where translation of algorithmic fairness knowledge stalls. Our analysis shows that fairness remains weakly institutionalized, translation mechanisms are externally driven, and system-level priorities continue to emphasize accuracy over fairness. These insights suggest critical leverage points for advancing safe, fair, and ethical ML-driven public health research practice.