Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-18

TW-LegalBench: Measuring Taiwanese Legal Understanding

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

02.
arXiv (CS.AI) 2026-06-15

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

arXiv:2606.14239v1 Announce Type: new Abstract: Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards – signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

03.
arXiv (CS.LG) 2026-06-18

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

arXiv:2606.19186v1 Announce Type: cross Abstract: Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

04.
arXiv (math.PR) 2026-06-16

Malliavin Calculus for the stochastic Cahn-Hilliard equation driven by fractional noise

arXiv:2601.10490v2 Announce Type: replace Abstract: The stochastic partial differential equation analyzed in this work is the Cahn-Hilliard equation perturbed by an additive fractional white noise (fractional in time and white in space). We work in the case of one spatial dimension and apply Malliavin calculus to investigate the existence of a density for the stochastic solution $u$. In particular, we show that $u$ admits continuous paths almost surely and construct a localizing sequence through which we prove that its Malliavin derivative exists locally, and that its law is absolutely continuous with respect to the Lebesgue measure on $\bf R$, establishing thus that a density exists. A key contribution of this work is the analysis of the stochastic integral appearing in the mild formulation: we derive sharp estimates for the expectation of the $p$-th power ($p \geq 2$) of the $L^{\infty}(D)$-norm of this stochastic integral as well as for the integral involving the $L^{\infty}(D)$-norm of the operator associated with the kernel appearing in the integral representation of the fractional noise, all of which are essential for this study.

05.
arXiv (CS.LG) 2026-06-17

Finsler Geometry, Graph Neural Networks, and You

arXiv:2606.17185v1 Announce Type: new Abstract: Graph neural network architectures based on the graph Laplacian approximate the Laplace-Beltrami operator, thus limiting their application to isotropic operators. As a nonlinear alternative to the Laplace-Beltrami operator, we consider estimates of the Finsler Laplacian on point clouds sampled from a manifold. We prove that these discrete estimates converge to the true operator on the manifold as the number of point samples grows. Moreover, we show that this operator can be expressed as a graph neural network layer, which we use to define a family of Finslerian graph neural networks constrained to express Finsler geometry. We show that Finslerian graph neural networks recover the geometry underlying nonlinear diffusion equations in practice.

06.
arXiv (CS.LG) 2026-06-11

DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics

arXiv:2606.11651v1 Announce Type: new Abstract: Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

07.
arXiv (CS.CL) 2026-06-17

Perceptual compensation for tonal context in self-supervised speech models

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

08.
arXiv (CS.CV) 2026-06-11

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

09.
medRxiv (Medicine) 2026-06-16

Risk beliefs, intensive digital information and demand for a new preventative health product in public clinics: Evidence from an experiment in Zimbabwe.

Demand for preventative health care is weak in low-income settings. In a field experiment in a low-income, high-risk setting, we evaluated whether demand for a new bio-medical preventative health product, offered free at public health clinics, responds to digital feedback-based intensive information on health risks and benefits of prevention along with a clinic referral enabling access to the product. In our sample of women aged 18-24 years, we find a large correction in risk beliefs sustained six months after the intervention. Against a background of very low baseline usage, within six months we find a 5.8 percentage point increase in take up of the prevention method, a level of uptake which is very large relative to the control group. Reassuringly, there is no meaningful difference in up-take amongst baseline high- risk and low-risk individuals.

10.
arXiv (CS.CL) 2026-06-18

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

11.
arXiv (CS.CL) 2026-06-11

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

12.
medRxiv (Medicine) 2026-06-18

Hospital staff views on the visibility, role and impact of Acute Learning Disability Liaison Services in Wales: a service evaluation

People with a learning disability experience marked health inequalities. In Wales, Acute Learning Disability Liaison Services (ALDLS) are delivered by specialised learning disability services, and all roles within them are undertaken by Learning Disability Liaison Nurses (LDLN). These services aim to enable access to, and delivery of, secondary care by supporting reasonable adjustments, facilitating communication, and coordinating care for people with learning disability during hospital encounters. However, independent evidence of the impact of ALDLS on patient care remains limited. This evaluation tries to address this evidence gap by examining hospital staff perceptions of the visibility, role, and impact of ALDLS across Welsh Health Boards, with the aim of informing service design and development and improving secondary care access and care for people with learning disability. The service evaluation used a qualitative approach involving interviews and a focus group with hospital staff across the seven Welsh Health Boards who had experience working with or interacting with ALDLS staff to care for patients with learning disability. Findings cover six key areas including i) visibility and delivery of ALDLS, ii) Barriers and challenges to effective ALDLS delivery, iii) Enablers of effective ALDLS delivery, iv) Positive impacts for patients with learning disability, v) Negative impacts and unintended consequences when the service is absent or limited, and vi) Participants recommendations for future improvements of ALDLS. To synthesise the findings, we developed an overview diagram, which illustrates how ALDLS may influence care quality in acute hospitals. The overview places the liaison service at the centre, showing how organisational enablers and barriers shape its delivery, and how its core functions support improvements in safety, timeliness, effectiveness, efficiency, equity, and patient-centred care. From the findings we have identified recommendations for practice and policy. These include that ALDLS should be recognised as a core, safety-critical component of acute hospital care for people with a learning disability, rather than an optional add-on. In practice, services should be more visibly embedded within routine pathways, with consistent site-based presence, clear referral criteria, early identification through electronic flagging and notification systems, and routine involvement in multidisciplinary planning for complex admissions and procedures. At policy level, ALDLS provision should be recognised within equality and patient safety frameworks as an essential service requiring sustained investment, national minimum configuration standards, adequate staffing, and better-integrated digital systems to support continuity, equitable access, and person-centred care.

13.
arXiv (CS.CV) 2026-06-12

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods–FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition–across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

14.
arXiv (CS.LG) 2026-06-16

Coercivity and Local Convergence of Physical Learning in Linear Circuits

arXiv:2606.15443v1 Announce Type: cross Abstract: Physical learning methods train physical networks to perform computational tasks using only local update rules, exploiting the physics of the system to handle the global transfer of information. We provide the first local convergence analysis of three such methods – Equilibrium Propagation (EP), Coupled Learning (CL), and a new method we call Adjoint Coupled Learning (AL) – for linear circuits, in the limit of small-nudging for both discrete and continuous time. EP and AL perform gradient descent on a natural loss function, while CL follows modified dynamics with an additional cubic correction. Assuming the existence of a solution, we identify a coercivity condition, expressed as a rank condition on a matrix built from the network's incidence structure, under which the training loss decays exponentially and the parameters converge to the solution manifold. We show that coercivity can fail by exhibiting a kite circuit in which a symmetry causes the coercivity constant to degenerate on the solution manifold, but prove using Sard's theorem that such degeneracies are non-generic: coercivity holds at every point of the solution manifold for almost every choice of desired output.

15.
arXiv (quant-ph) 2026-06-16

Reconstruction of detector error model for quantum error correction

arXiv:2606.16288v1 Announce Type: new Abstract: Fault-tolerant quantum computing fundamentally relies on the accurate characterization of circuit-level noise to optimize decoding algorithms. However, extracting complex multi-body error correlations remains challenging. Contemporary greedy inference algorithms can suffer from statistical distortion, discarding true physical mechanisms while introducing many unphysical false positives. Here, we introduce the Correlation-Analysis-based Hypergraph Reconstruction (CAHR) algorithm, a globally consistent framework to invert experimental syndrome statistics directly into discrete physical hypergraphs. By coupling exact algebraic correlation equations with a top-down concurrent-pruning strategy, CAHR recovers the fault topology without false positives for both $d=5$ rotated surface codes and dense 8-body 2D color codes in our benchmark settings. Furthermore, we show that exact continuous parameter extraction in dense codes is limited by a variance cascade, where absolute statistical variance accumulates linearly from high- to low-degree mechanisms. This motivates a two-stage inference paradigm: utilizing CAHR to extract the fault topology, followed by continuous probability optimization. This provides a practical approach for characterizing and decoding highly correlated noise in realistic quantum hardware.

17.
arXiv (CS.CV) 2026-06-19

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

18.
arXiv (CS.CV) 2026-06-11

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

19.
arXiv (CS.AI) 2026-06-19

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

arXiv:2606.19568v1 Announce Type: cross Abstract: Acoustic gunshot detection is a problem with applications across civilian public safety, military operations, and wildlife conservation, yet the field lacks a rigorous exploration of feature extraction techniques with a focus on generalization to realistic data. The mixed effectiveness of commercial gunshot detection and classification systems indicates an open problem that is not adequately addressed by the current literature. In this paper, we present a systematic investigation of common feature extraction techniques using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers. We benchmark three feature extraction techniques with 12 total unique parameter sets using ResNet-18. Our results demonstrate that using the correct feature extraction technique can improve top-1 accuracy by up to 20%, and utilizing the correct parameters for a given feature extraction technique can improve that value by up to 4.7%.

20.
bioRxiv (Bioinfo) 2026-06-21

SPA-C: an hybrid tool to accurately scaffold genomes using Hi-C and Deep-Learning

Genome assembly is a computational pipeline designed to reconstruct chromosomes from small sequencing reads. Following their assembly, contiguous sequences (contigs) are arranged into chromosome-long sequences during scaffolding. Hi-C, a long-range linkage information between regions of the genome widely used in recent large sequencing projects, is often required to correctly order contigs. Several tools have been developed to automate this task following either statistical or deep-learning approaches. Statistical approaches summarise 2D Hi-C matrices into contact densities across sequences, thus ignoring informative visual patterns. The sole existing deep-learning tool uses a transformer-based computer vision model to correct the assembly. It has been trained on several species and uses Hi-C matrices directly. Yet it comes as a supplementary step in the scaffolding process, introducing extra computation time, and has been trained on a dataset that might contain labelling errors, which could provide sub-optimal results. We propose SPA-C, an hybrid pipeline combining the strengths of both approaches. Linkage prediction is handled with a frugal CNN-based model and a graph-solving algorithm is used to generate the scaffolds. Through our input's design, the model is able to both correct errors within assemblies and link contigs, leveraging small, local Hi-C contact matrices. We handled low-complexity regions that might induce erroneous predictions using an external tool, improving the overall accuracy of generated assemblies. On a benchmark of six various genomes and four standard metrics, SPA-C outperformed four out of four state-of-the-art methods while achieving comparable start-to-end computation time.Python and Bash scripts are available on GitHub (https://github.com/SPA-C/SPA-C.git) and Zenodo (https://doi.org/10.5281/zenodo.19000361).

21.
arXiv (CS.LG) 2026-06-11

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

arXiv:2606.11235v1 Announce Type: new Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

22.
arXiv (CS.CV) 2026-06-18

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

23.
arXiv (CS.CV) 2026-06-18

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

24.
arXiv (quant-ph) 2026-06-17

Entanglement transition in unitary system-bath dynamics

arXiv:2512.06081v3 Announce Type: replace Abstract: The evolution of a system coupled to baths is commonly described by a master equation that, in the long-time limit, yields a steady-state density matrix. However, when the same evolution is unraveled into quantum trajectories, it is possible to observe a transition in the scaling of entanglement within the system as the system-bath coupling increases - a phenomenon that is invisible in the trajectory-averaged reduced density matrix of the system. Here, we go beyond the paradigm of trajectories from master equations and explore whether a qualitatively analogous entanglement-scaling transition emerges in a single unitary evolution of the combined system-bath setup, without monitoring the dynamics of the system. We investigate the scaling of entanglement in a unitary quantum setup composed of a two-dimensional lattice of free fermions, where each site is coupled to a fermionic bath. As the system-bath coupling increases, the logarithmic fermionic negativity reveals an entanglement transition from logarithmic-law to area-law scaling. This occurs while the system's steady-state properties are trivial, highlighting that the signatures of these different scalings are within the bath-bath correlations. Evidence of the transition is also found in the mutual information and the correlations of the full system-bath setup, suggesting that the entanglement transition is underpinned by a change in the spatial structure of quantum information.

25.
arXiv (CS.AI) 2026-06-17

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

arXiv:2606.18075v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.