Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-18

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($\Delta$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

02.
arXiv (CS.CL) 2026-06-16

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.

03.
arXiv (CS.AI) 2026-06-12

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

arXiv:2606.12415v1 Announce Type: cross Abstract: The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist – understood broadly to encompass any professional with advanced legal training – operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

04.
medRxiv (Medicine) 2026-06-16

Upper airway disease in primary ciliary dyskinesia: Clinical management and factors influencing decision-making, a multicentre analysis

Background Upper airway disease is common in primary ciliary dyskinesia (PCD), but management evidence is limited. We aimed to describe management practices and identify factors influencing management decisions. Methods Using data from the Ear-Nose-Throat (ENT) Prospective International Cohort of patients with PCD (EPIC-PCD) and an ENT-specialist survey across participating centres, we described management practices recorded at routine follow-up. We assessed clinical factors associated with practices via mixed-effects logistic regression models. In a subgroup of patients, we assessed factors associated with initiation or discontinuation of practices. Results We included 579 patients: median age 15 years, 46% female. Nasal rinsing (54%) and nasal corticosteroids (22%) were most frequently prescribed. Among 466 patients with available data, 47 had grommets (10%) and 42 hearing aids (9%). Nasal corticosteroids and rinsing were more frequently prescribed in patients with polyps (odds ratio [OR] 3.74, 95% confidence interval [CI] 1.80-7.76; OR 3.39, 95% CI 1.37-8.37) or turbinate hypertrophy (OR 1.89, 95% CI 1.03-3.47; OR 2.89, 95% CI 1.55-5.38), and upper airway nebulisation in patients with frequent nasal symptoms (OR 2.86, 95% CI 1.11-7.39). Management practices differed between centres, as seen also by the specialists survey responses. In 177 patients with multiple visits, initiation of nasal rinsing was associated with frequent nasal symptoms (OR 3.18, 95% CI 1.24-8.18) and turbinate hypertrophy (OR 3.21, 95% CI 1.20-8.59). Conclusion Upper airway disease management in PCD varies and is partly guided by symptom burden and clinical findings. This variation across centres highlights the need for care standardisation and PCD-specific management guidelines.

05.
arXiv (CS.AI) 2026-06-19

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

arXiv:2606.19921v1 Announce Type: new Abstract: This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

06.
arXiv (CS.CL) 2026-06-18

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

07.
arXiv (CS.AI) 2026-06-16

PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

arXiv:2510.15966v2 Announce Type: replace Abstract: Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task-oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych-inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema-grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state-of-the-art by significantly enhancing adaptability and long-term knowledge retention.

08.
arXiv (CS.LG) 2026-06-17

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

arXiv:2602.23116v3 Announce Type: replace Abstract: We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) – capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix – to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using only strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a generic polylogarithmic regret of $\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $\eta^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

09.
arXiv (quant-ph) 2026-06-16

High-dimensional coherence to entanglement transduction under canonical noise

arXiv:2606.16695v1 Announce Type: new Abstract: We develop an analytical framework for coherence-to-entanglement conversion in bipartite high-dimensional quantum systems, so-called qunits. An arbitrary coherent input qunit is coupled to an incoherent ancilla through a generalized controlled-shift operation, producing a maximally correlated bipartite state. By analyzing the partial transpose of the output state, we establish an exact dimension-independent connection between the input coherence and the generated entanglement. We then study how this conversion is affected by three standard noise processes applied after the conversion step: phase damping, global depolarizing noise, and independent amplitude damping. The resulting expressions show that these channels degrade entanglement in qualitatively different ways. Phase damping leads to a uniform attenuation of the entanglement generated from coherence, depolarizing noise introduces pairwise thresholds associated with entanglement sudden death, and amplitude damping produces an asymmetric decay governed by relaxation toward the ground state. For maximally coherent inputs, the general results reduce to simple closed-form behavior, allowing direct comparison of the three noise mechanisms as the system dimension increases. In particular, global depolarizing noise exhibits a dimension-dependent sudden-death threshold, while amplitude damping leads to a smooth suppression in the maximally coherent case. These results provide useful analytical benchmarks for high-dimensional resource conversion and for assessing noisy entanglement generation in qudit-based quantum-information settings.

10.
arXiv (CS.CL) 2026-06-16

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

11.
arXiv (CS.LG) 2026-06-19

Neural network surrogates with uncertainty quantification for inverse problems in partial differential equations

arXiv:2606.20417v1 Announce Type: new Abstract: Inverse problems for differential equations arise throughout science and engineering, where one seeks to infer unknown model parameters from noisy or incomplete observations. Traditional numerical methods for these problems are often computationally expensive, particularly in Bayesian settings where evaluating the likelihood becomes costly for complex forward models and high-dimensional parameter spaces. To address this challenge, we introduce DeepGaLA, a neural-network surrogate for differential equation solvers that provides uncertainty-aware predictions, reducing overconfident inference when training data are limited. To evaluate the fidelity of the surrogate-induced posterior approximations in practice, we show that a short run of delayed-acceptance Markov chain Monte Carlo can serve as an effective diagnostic. Across a range of numerical experiments, DeepGaLA delivers forward-model approximations with accuracy comparable to established Gaussian-process surrogates, while better maintaining efficiency as parameter dimension grows. Moreover, it can incorporate differential-equation constraints, including in nonlinear settings. Overall, these results indicate that uncertainty-quantified neural surrogates can enable scalable and reliable Bayesian inference for inverse problems in complex systems.

12.
arXiv (CS.AI) 2026-06-17

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

arXiv:2604.24357v2 Announce Type: replace-cross Abstract: Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train–test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM

13.
bioRxiv (Bioinfo) 2026-06-19

Evaluation of analysis modes for RNA coexpression in single-cell and bulk tissue

Coexpression of transcripts presents the most common means of computational inference of transcription factor regulation, and is often combined with other data types to infer regulatory networks. With the growing popularity of single-cell approaches, there are questions about how best to extract coexpression information from the data. Recently we reported a simulation study that explored the differences among coexpression performed at different levels: across single cells (xCell, per cell type), across subjects from pseudobulked single-cell data (xSubject, per cell type), or across subjects using bulk tissue samples (xBulk). Here we test predictions made by those models using real data. We consider both preservation (consistency of coexpression findings across different levels of analysis of the same data) and replicability across independent studies, as well as biological interpretability. We find that preservation across levels is limited, indicating the choice of analysis level will affect outcomes. We show that xCell coexpression is more replicable across studies compared to xSubject. xBulk coexpression is dominated by patterns driven by variability in cellular composition and fails to capture much coexpression that is reliably detected at finer resolutions. While all modes of analysis exhibit some enrichment for known regulatory relationships, it was highest with the xCell mode. Finally, we present a case study of the effect of analysis modes on a schizophrenia-associated pattern, reinforcing the importance of analytic choices in the interpretation and replicability of coexpression analyses. Together with our modeling study, this work emphasizes the importance of understanding sources of expression covariation as they relate to the goals of the analysis, and recommend single-cell-based data with biological replicates should be the focus of attempts to infer dynamic regulatory interactions that are more likely to be replicable by others.

14.
arXiv (quant-ph) 2026-06-17

Time-spectral control of accidental coincidences in daylight entanglement-based free-space QKD

arXiv:2606.17365v1 Announce Type: new Abstract: Daylight entanglement-based free-space quantum key distribution (QKD) is limited by accidental coincidences from receiver-admitted background light. We develop and experimentally validate a receiver-level framework linking receiver bandwidth, accepted temporal width, and background-noise density to Bob singles, sifted-key rate, error rate, and quantum bit error rate (QBER) in telecom-wavelength BBM92 QKD. Indoor sweeps show that useful sifted counts saturate near the source-matched bandwidth, whereas broader bandwidth or higher background mainly increases accidental contamination. Increasing the accepted temporal width leaves Bob singles nearly unchanged but directly raises QBER by enlarging the random-overlap probability. A two-dimensional design map shows that the temporal-window margin contracts rapidly with increasing background-to-signal ratio, while the bandwidth margin remains comparatively broad near source-matched filtering. A 10 m rooftop daylight experiment demonstrates operation in the predicted low-accidental regime, yielding a mean sifted-key rate of 2,811 cps and a mean QBER of 4.43%.

15.
arXiv (CS.AI) 2026-06-19

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

arXiv:2606.20520v1 Announce Type: cross Abstract: Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

16.
arXiv (CS.LG) 2026-06-12

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

作者:

arXiv:2606.07334v2 Announce Type: replace-cross Abstract: This report treats chord-symbol sequences as an interpretable, controllable time series for genre-local harmonic modeling. The frozen Music Transformer base - released as a pop-jazz fine-tune endpoint but verified in this revision weight-identical to the pop-only Phase-0 baseline, so all gains are measured over a pure-pop prior (see Changes in v2) - is extended to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction (macro gains +2.89 to +3.61 percentage points); LoRA and IA3 score highest, but pairwise Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: at a common corpus size IA3 stays on top while LoRA drops to last, so the small method gaps are partly data-driven rather than representational. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting the adaptation effect is largely lightweight conditioning over a reusable harmonic base rather than genre-specific adapter memory. Further diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation that v2 reinterprets as a same-weights control, chord-only genre classification, output-distribution statistics, real-song evaluation, duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. Perceived genre authenticity and musical quality are left to controlled listener evaluation.

17.
arXiv (quant-ph) 2026-06-16

REGRID-QAOA: A Resource-Efficient Graph-Reduced Hybrid QAOA Framework for Physics-Constrained Power System Islanding

arXiv:2606.15083v1 Announce Type: new Abstract: Quantum computing has rapidly emerged as a powerful paradigm for tackling computationally demanding problems. In particular, quantum optimization shows strong promise for hard combinatorial problems in power systems, where increasing distributed energy penetration heightens the need for intentional islanding to maintain grid reliability and resilience. However, power system islanding is an NP-hard combinatorial optimization problem that becomes computationally prohibitive for classical solvers as network size grows, motivating the use of quantum computing as a promising alternative pipeline. This study develops a resource-efficient hybrid QAOA islanding framework that brings physics-constrained power-system partitioning into the quantum optimization workflow. The framework combines coherency-informed graph reduction, physics-aware constraint modeling, and structured post-processing to efficiently convert shallow-circuit QAOA samples into high-quality feasible islanding decisions without deep circuits or large shot budgets. The proposed framework is validated on the standard IEEE benchmark systems (9-, 14-, 24-, 30-, 39-, and 57-bus), demonstrating that the hybrid workflow achieves Gurobi-optimal solution quality with a clear quantum resource advantage over vanilla QAOA, while the resulting islanding solutions satisfy all physical feasibility requirements after network separation. This study establishes QAOA-based islanding as a viable quantum approach for critical infrastructure, with structured post-processing as the key enabler of quantum resource efficiency.

18.
arXiv (quant-ph) 2026-06-15

Resolving the Edge of a Quantum Pyramid

arXiv:2606.14698v1 Announce Type: new Abstract: Standing on the shoulders of giants, we resolve the quantum pyramids conjecture, confirming the globally information-optimal measurement for an ensemble of equiangular equiprobable pure states, as conjectured by Englert and \v{R}eháček (arXiv:0905.0510). We do so by proving the remaining entropy inequalities of Holevo and Utkin (arXiv:2506.06700), which certify optimality for obtuse and flat pyramids. For obtuse pyramids, our key contribution is a rigorous proof that local minimizers of the corresponding entropy inequality cannot have three distinct coordinate values. We show that eliminating this family can be reduced to a neat algebraic reciprocal inequality relating branches of the Lambert $W$ function, which may be of independent interest. For flat pyramids, we prove a tight $\ell^p$ inequality for zero-sum vectors that was recently conjectured, proved analytically in dimension $d=3$, and computationally verified for $d\leq 200$ by Holevo and Utkin (arXiv:2603.24017). We prove this bound for all $d\geq 2$ via a technique in symmetric inequalities known as the equal variables method.

19.
arXiv (CS.CV) 2026-06-19

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer–evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

20.
arXiv (quant-ph) 2026-06-17

SPICE-Q and Large-Scale Quantum Chip Production

arXiv:2606.17907v1 Announce Type: new Abstract: We propose SPICE-Q, a SPICE-inspired design-technology co-optimization framework for superconducting quantum processors. Rather than replacing tools such as HFSS, Qiskit Metal, pyEPR, SQcircuit, SQuADDS, scqubits, or QuTiP, SPICE-Q aims to connect them through a unified, traceable data chain spanning process rules, layout, electromagnetic simulation, energy-participation-ratio and circuit quantization, Hamiltonian extraction, noise analysis, cryogenic test, and manufacturing feedback. The central mapping is from process and PDK constraints to layout geometry, electromagnetic modes, equivalent circuit parameters, effective Hamiltonians, and finally metrics such as frequency, coupling, anharmonicity, decoherence, readout performance, and yield. This flow must capture Josephson-junction variability, transmon frequency allocation, resonator and Purcell constraints, coupler crosstalk, microwave routing, 3D interconnects, material/interface loss, package modes, and wafer-scale process statistics. By introducing standardized model interfaces, statistical parameter models, model cards, version governance, and closed-loop calibration from cryogenic and fabrication data, SPICE-Q frames superconducting quantum-chip design as an engineering workflow rather than a collection of isolated simulations. We argue that scalable and fault-tolerant quantum processors will require such a continuous model chain from device physics and electromagnetic fields to quantum dynamics, noise, manufacturability, and system-level yield.

21.
arXiv (CS.CV) 2026-06-16

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the first sparse attention solution tailored for AR video generation models. It incorporates a Chunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a Hierarchical Sparse Attention to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

22.
arXiv (CS.CV) 2026-06-16

Geometric Action Model for Robot Policy Learning

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

23.
arXiv (CS.LG) 2026-06-18

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

arXiv:2605.07022v3 Announce Type: replace Abstract: Manually curated biomedical repositories – spanning bioactivity, genomics, and chemistry – are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks – blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions – Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard – e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.

24.
arXiv (CS.CV) 2026-06-16

G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

25.
arXiv (CS.AI) 2026-06-11

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

arXiv:2606.08530v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.