Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-24

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

作者:

Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening workflows, and a text-to-speech (TTS) system can preserve the written string while changing the spoken meaning. We introduce CN-NewsTTS Bench v0.1, an open target-level benchmark for evaluating whether Chinese news TTS products pronounce such targets correctly from raw text, without user-side rules, LLM rewriting, SSML hints, or manual edits. The release contains a 200-record development set, an 800-record public test set, 992 public auto-evaluable targets, fixed transcripts from a three-ASR ensemble, an automatic target scorer, and initial results for seven product TTS systems. We additionally report ASR-route diagnostics, ASR-subset ablations, category-level results, confidence intervals, and provider configuration metadata. The best system reaches 0.879 strict accuracy, while several systems remain below 0.60.

02.
arXiv (CS.AI) 2026-06-16

Orchestrated Reality: From Role-Play to Living, Playable Game Worlds – LLM-Driven World Simulation as a Parameterized-Action POMDP

arXiv:2606.16014v1 Announce Type: cross Abstract: Many games rely on storytelling combined with systems that track levelling, NPC behaviour, and consequence simulation; bridging tightly-authored narrative with deeply-simulated worlds – most acute in sandbox and open-world settings – has been prohibitively expensive. LLM-driven worlds open a new path: a single harness can coordinate numerical state, narrative voice, storytelling pacing, and rule logic together. Realising this requires the LLM system to sustain a persistent world (who is where, what has just happened, what is currently true), which today's deployed systems do not: the narrative voice asserts state in free prose without any validated representation, so a fully autonomous game engine remains infeasible. We treat this as an architectural choice, not a limitation of language models, and report work in progress on a framework – orchestrated reality – that makes the world a canonical object owned by a singleton orchestration agent analogous to the tabletop-RPG Game Master (GM). We formalise an LLM-driven game world for a human player as a Parameterized-Action POMDP: state is a tree of canonical JSON entities, actions decompose as $a=(k, x_k)$ (a discrete intent kind plus structured JSON parameters), the agent observes only a narrative projection $o=O(s)$ of state, and the transition kernel $F$ is an LLM-driven Plan-Diff-Validate-Apply (PDVA) pipeline that commits schema-validated, content-hashed JSON deltas. We give the formal model, a JSON-state example, a worked single-turn example, and a catalogue of 15 illustrative incidents drawn from a real deployment showing the framework in action. Empirical validation through a planned human player study – together with multi-NPC concurrent agency and deployment as an RL environment – is situated as future work.

03.
arXiv (CS.CL) 2026-06-18

Continuous Audio Thinking for Large Audio Language Models

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

04.
arXiv (CS.CL) 2026-06-12

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

05.
arXiv (CS.CV) 2026-06-16

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

06.
arXiv (CS.CV) 2026-06-16

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

07.
arXiv (CS.AI) 2026-06-11

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

arXiv:2606.11909v1 Announce Type: new Abstract: Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

08.
arXiv (CS.CL) 2026-06-12

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57–77% overall. On multi-turn missions, all models score 13–29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4–18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

09.
arXiv (math.PR) 2026-06-24

On domains of elliptic operators with distributional coefficients

arXiv:2509.24950v2 Announce Type: replace-cross Abstract: We show how one can use recently gained insights from the study of singular SPDEs, more particularly the study of singular operators via the theory of Paracontrolled Distributions, to construct domains for (singular) elliptic operators. Formally we consider \[ A (u) = (1 - \Delta) u + \nabla V \cdot \nabla u + \xi u + {{div} (\rho u)}, \] where $V \in \mathcal{C}^{\delta}$, $\xi \in \mathcal{C}^{- 2 + \delta}$, $\rho \in \mathcal{C}^{- 1 + \delta}, {div} \rho = 0$} and which satisfy a structural assumption that is notably satisfied when $\xi$ is a sub-critical noise, see {[MvZ22]}. We also show that under this assumption, one can construct a continuous change of variables $\Theta$ which satisfies \[ A \Theta - (1 - \Delta) \in \mathcal{L} (H^{2 - \delta''} ; H^{\delta'}) \] which allows us to define $A$ rigorously and parametrise a domain. Moreover, for suitably regularised operators \[ A_{\varepsilon} (u) := (1 - \Delta) u + \nabla V_{\varepsilon} \cdot \nabla u + (\xi_{\varepsilon} + c_{\varepsilon}) \cdot u + {{div} (\rho_{\varepsilon} \cdot u)}, \] we show that for a strongly converging regularised change of variables $\Theta_{\varepsilon} \rightarrow \Theta$ we have \[ A_{\varepsilon} \Theta_{\varepsilon} \rightarrow A \Theta in \mathcal{L} (H^2 ; L^2) \] which in particular implies norm resolvent convergence to a limiting closed operator. Finally, we give a class of examples and show how to apply these results to prove strong analytical local well-posedness for a singular Schrödinger equation formally given by \[ i \partial_t u + (1 - \Delta) u + \nabla V \cdot \nabla u + \xi \cdot u = - | u |^2 u \] for singular $V, \xi$ and that its solution is the limit of the solution of the classical solutions of a regularised equation

10.
Nature (Science) 2026-06-19

Daily briefing: Human detritus remakes geology

作者:

What, exactly, is a rock? Plus, a stem-cell success for a severe autoimmune disease and evidence that ‘AI deskilling’ is real. Researchers have tracked the electrical activity of individual brain cells during conversation in real time. Plus, the history of GPS and a cross-species transplant that could reveal clues about the origin of animals.

11.
arXiv (CS.CL) 2026-06-24

Progressive Alignment Objectives for Aligner-Encoder based ASR

Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice, this alignment often forms abruptly in the upper layers, making training sensitive and brittle on long utterances. We propose InterAligner, which adds an intermediate Aligner objective so alignment can form progressively across depth, together with an intermediate CTC loss (InterCTC) to stabilize optimization. On LibriSpeech with a 17-layer Conformer, a final-only Aligner reaches 5.0/7.8 WER (test-clean/other). InterCTC improves to 3.4/6.0, and InterAligner further reduces WER to 3.1/5.6 with the largest gains on long utterances.

12.
arXiv (CS.AI) 2026-06-16

Automated ultrasound doppler angle estimation using deep learning

arXiv:2508.04243v2 Announce Type: replace-cross Abstract: Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9{\deg} to 9.4{\deg} for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

13.
arXiv (quant-ph) 2026-06-17

Cavity-enhanced superconducting response in an underdoped cuprate

arXiv:2606.18084v1 Announce Type: cross Abstract: Superconductors carry electrical current without resistance when paired electrons condense into a coherent macroscopic quantum state. In underdoped cuprates, evidence suggests that pairing-related correlations and superconducting fluctuations can survive above the temperature at which global coherence is lost, pointing to phase fluctuations as a key limitation on superconductivity in this regime. Motivated by recent demonstrations of cavity-modified collective states in quantum materials, we investigate whether superconducting coherence can be stabilized by engineering the electromagnetic environment of the superconductor. We study an underdoped YBa$_2$Cu$_3$O$_{7-\delta}$ thin film in a tunable terahertz cavity formed with a semi-transparent gold mirror. From temperature-dependent terahertz transmission measurements, we find that the cavity enhances the superconducting response below the critical temperature, with an increase of the inferred superfluid weight. The effect becomes more pronounced at smaller cavity lengths and is accompanied by an upward shift of the superconducting onset temperature. Calculations based on a cavity-coupled model for phase-fluctuating superconductors capture these trends and support an interpretation in terms of cavity-enhanced phase stiffness. These results showcase the potential of cavity engineering for designing emergent functionalities in correlated systems.

14.
arXiv (CS.LG) 2026-06-24

Accelerated Stochastic Min-Max Optimization Based on Bias-corrected Momentum

arXiv:2406.13041v3 Announce Type: replace Abstract: Lower-bound analyses for nonconvex strongly-concave minimax optimization problems have shown that stochastic first-order algorithms require at least $\mathcal{O}(\varepsilon^{-4})$ sample complexity to find an $\varepsilon$-stationary point. Some works indicate that this complexity can be improved to $\mathcal{O}(\varepsilon^{-3})$ when the stochastic loss gradient is Lipschitz continuous. The question of achieving enhanced convergence rates under distinct conditions, remains open. In this work, we address this question for optimization problems that are nonconvex in the minimization variable and strongly concave or Polyak-Lojasiewicz (PL) in the maximization variable. We introduce novel bias-corrected momentum algorithms utilizing efficient Hessian-vector products. We establish convergence conditions and demonstrate a lower iteration complexity of $\mathcal{O}(\varepsilon^{-3})$ for the proposed algorithms. The effectiveness of the proposed method is validated through applications to robust logistic regression and robust adaptive cruise control.

15.
arXiv (CS.AI) 2026-06-15

Learning Developmental Scaffoldings to Guide Self-Organisation

arXiv:2605.14998v3 Announce Type: replace Abstract: From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

16.
arXiv (CS.LG) 2026-06-17

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

arXiv:2507.20708v3 Announce Type: replace Abstract: The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and introduce mathematically grounded manipulation strategies based on entropic and optimal transport projections. These constructions characterize the minimal distributional shift required to satisfy fairness constraints. To counter such attacks, we formalize representativeness through distributional distance based statistical tests and systematically evaluate their ability to detect manipulated samples. Our analysis highlights the conditions under which fairness manipulation can remain statistically undetected and provides practical guidelines for strengthening supervisory verification. We validate our theoretical findings through experiments on standard tabular datasets for bias detection. Code is publicly available at https://github.com/ValentinLafargue/Inspection.

17.
arXiv (CS.LG) 2026-06-18

Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED

arXiv:2606.18750v1 Announce Type: cross Abstract: A/B testing has become the gold standard for data-driven decision-making in large-scale online experimentation, providing critical guidance for feature launch, pricing optimization, and user experience enhancement. To maximize statistical sensitivity, many technology companies routinely employ Controlled-experiment Using Pre-Experiment Data (CUPED), a technique that achieves substantial variance reduction while preserving the unbiasedness of estimating the average treatment effect. Despite its widespread adoption, several critical methodological and practical nuances of CUPED remain underexplored. This paper systematically addresses five frequently encountered yet overlooked questions regarding the application of CUPED. First, we provide a comparative analysis of various post-CUPED estimators to identify the optimal adjustment specification. Second, we evaluate the validity of regression-based adjustments and delineate robust variance estimation methods tailored for such frameworks. Finally, we extend our investigation to complex but common scenarios, including multi-arm experiments and two-stage sampling designs. Our findings reveal that in these settings, naive reliance on standard variance estimators can lead to severely misleading inferences. By offering rigorous theoretical insights and extensive experimental validation, this work deepens the conceptual understanding of CUPED. Notably, the recommended methodologies have been successfully deployed and integrated into ByteDance's experimentation platform.

18.
Science (Express) 2026-06-04

Long-range extended chains arising from polymerization-driven spontaneous assembly | Science

作者: 未知作者

A central challenge for conjugated polymers is to achieve long-range order while remaining solution-processable, which is essential for matching the electrical performance of their counterparts of crystalline inorganic semiconductors. Here we show that n-doped poly(benzodifurandione) (n-PBDF) can undergo polymerization-driven spontaneous assembly (PSA), in which chain growth, chemical doping, and structural ordering are intrinsically coupled, yielding long-range chain extension over hundreds of nanometers. We reveal that the spontaneously formed n-PBDF nanoribbons arise from a self-initiated, convergent growth mechanism driven by cooperative monomer–polymer interactions and stabilized by proton-coupled duplex chains and the polymer’s intrinsic polyelectrolyte character. With long-range extended chains in the nanoribbons, the aligned n-PBDF thin films demonstrate metallic-level conductivity (>10 4 Siemens per centimeter).

19.
arXiv (CS.LG) 2026-06-24

Stochastic Expectation Maximization for Robust State-Space Radio Interferometric Imaging

arXiv:2606.23944v1 Announce Type: cross Abstract: State–space models provide a flexible framework for analyzing dynamical systems, yet they often rely on Gaussian assumptions that fail to capture heavy-tailed or outlier-prone measurement noise. We propose a robust estimation scheme for linear state–space models subject to compound-Gaussian noise, as encountered for instance in radio interferometry affected by radio-frequency interference (RFI). The method relies on a Stochastic Approximation Expectation–Maximization (SAEM) algorithm in which the standard E-step is replaced by Monte Carlo sampling of the latent states and noise texture through closed-form Gibbs updates, enabling tractable inference despite the heavy-tailed likelihood. Numerical experiments show that the proposed method significantly improves reconstruction fidelity and robustness to RFI, outperforming a Gaussian EM algorithm and even an oracle RTS smoother. These results highlight the benefits of heavy-tailed state–space modeling and SAEM-based inference in interference-dominated imaging scenarios.

20.
arXiv (CS.AI) 2026-06-24

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

arXiv:2606.24509v1 Announce Type: cross Abstract: Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called GFMs, particular interest has been paid to GFMs designed for node property prediction tasks, which is one of the most popular settings in Graph ML with lots of real-world applications from fraud detection in financial and social networks to recommendation systems for e-commerce and user-generated content platforms. While a number of GFMs for this task have been recently proposed, the field has not converged to a unified evaluation setting, and different works evaluate their models in widely different ways, preventing reliable comparison of GFMs with each other and with other types of models. In this work, we conduct a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, comparing them to strong Graph Neural Network (GNN) baselines. We find that, among these GFMs, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.

21.
arXiv (CS.LG) 2026-06-12

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

arXiv:2606.13194v1 Announce Type: new Abstract: Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

22.
bioRxiv (Bioinfo) 2026-06-12

Systematic functional annotation of thousands of BAHD acyltransferases in plant genomes using Protein Language Model and phylogenomic tools

The functional annotation of plant genes lags significantly behind their genomic annotation. Closing this gap requires thorough cataloging of reported protein activities alongside predictive methods that scale beyond sequence-similarity inference. Focusing on the BAHD acyltransferase enzyme family as a model, we assembled FuncZymeDB-BAHD, a large database of 2,705 LLM-retrieved and curated enzyme-acceptor-donor activities covering 336 BAHDs from 156 plant species, a 2-to-6-fold expansion over Swiss-Prot and prior compilations. We further developed FuncPred-OG, which maps queries to orthologous groups and previously characterized enzymes in FuncZymeDB-BAHD, returning hits with high evidence provenance. FuncPred-OG enabled functional prediction of over half of BAHDs across 85 plant proteomes, of which five novel predictions were validated via in vitro assays and recent studies. For the remaining BAHDs without FuncPred-OG annotation, we developed FuncPred-AI, where logistic-regression classifiers trained on protein language model embeddings achieved high Area-Under-the-Precision-Recall-curve (AUPR) scores and correct-hit rates up to 93%. FuncPred-AI yielded >1 probable donor/acceptor annotation for 99.9% (8894/8897) of BAHDs in our pan-plant dataset. Finally, the FuncPred workflow and datasets were deployed on a web portal for broader utilization, potentially reducing experimentalist efforts for selecting candidates from days to minutes. Overall, this framework provides a generalizable template for functional annotation of entire enzyme families.

23.
arXiv (CS.CL) 2026-06-11

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics – Faithfulness, Coverage, Informativeness, and Acuity – to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

24.
arXiv (CS.CL) 2026-06-16

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

25.
arXiv (CS.CV) 2026-06-18

APT: Atomic Physical Transitions for Causal Video-Language Understanding

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.