Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-15

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Authors:

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19–0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise–pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

02.
arXiv (CS.LG) 2026-06-12

PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models

arXiv:2606.10642v2 Announce Type: replace Abstract: Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce PhysMetrics$.$Weather, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at https://github.com/Emmakast/PhysMetrics.Weather.

03.
arXiv (quant-ph) 2026-06-15

Efficient Simulation of Szegedy Quantum Walk Formulations and Algorithms

arXiv:2606.14226v1 Announce Type: new Abstract: Quantum walks provide a versatile framework for quantum algorithms across a wide range of applications. We develop efficient classical simulation methods for Szegedy quantum walks that avoid explicit construction of the full unitary evolution operator. Unlike previous approaches restricted to a particular walk formulation, our framework is built from fundamental update and reflection operators, enabling the simulation of a broader class of Szegedy walk formulations. We further extend these methods to phase-estimation-based algorithms coupled to the walk, including implementations suitable for large sparse graphs. The resulting methods achieve optimal $O(N^2)$ complexity for dense graphs with $N$ nodes. For sparse graphs, the computational cost scales linearly with the number of edges, which is $O(N)$ in many cases. We implement the framework in the Python package SQWLib and illustrate its capabilities through simulations of representative algorithms, including quantum simulated annealing and quantum search on graphs. These results provide a practical tool for studying Szegedy-walk-based algorithms numerically beyond purely analytical treatments.

04.
arXiv (CS.CV) 2026-06-16

A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT

Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: https://github.com/xmed-lab/TriALS-Report.

05.
bioRxiv (Bioinfo) 2026-06-16

scIsoAgent enables autonomous isoform-resolved characterization and sequence-informed interpretation of long-read single-cell transcriptomes

Alternative isoform usage can alter gene function independently of total gene expression, creating a need to resolve transcript isoforms at single-cell resolution. Long-read single-cell RNA sequencing meets this need by linking cellular identity to transcript isoforms and sequence-level features. Realizing its full biological value requires reproducible workflows that connect specialized long-read analysis with biological interpretation. Existing large language model (LLM)-based biomedical agents support general omics analysis, but are not designed for isoform-resolved long-read single-cell workflows. Here, we present scIsoAgent, an autonomous LLM-powered scientific agent for long-read single-cell RNA-seq analysis. scIsoAgent turns heterogeneous long-read single-cell inputs into traceable isoform-resolved workflows, using stage-aware planning and persistent computational context to support both execution and interpretation. Across complementary evaluations, this design improved the continuity from analysis planning to executable, interactive workflows compared with general-purpose LLM baselines. In real-data reanalysis, scIsoAgent recovered major findings from published long-read single-cell resources and extended a representative differential transcript usage event into a sequence-informed functional hypothesis. By linking full-length isoform sequences with model-inferred transcript properties, scIsoAgent connects observed isoform usage with potential sequence-level functional consequences. These results demonstrate that autonomous scientific agents can transform fragmented long-read single-cell analysis into coherent, reproducible workflows for isoform-resolved discovery and biological interpretation.

06.
arXiv (quant-ph) 2026-06-16

Efficient Magic State Factory Via Transversal Non-Clifford Gate

arXiv:2606.16199v1 Announce Type: new Abstract: Magic-state preparation is a central component of fault-tolerant quantum computing. Recent theoretical and experimental successes in code-switch-based magic-state preparation have underscored the promise of these methods for quantum error correction. Similarly, magic-state cultivation has likewise been demonstrated in both numerical and experimental settings. However, a thorough comparison between magic-state cultivation and code-switch-based magic-state factories is still missing. In this work, we carry out end-to-end simulations of magic-state preparation using code switching and compare its resource requirements and performance against magic-state cultivation. As part of this analysis, we develop a lattice-surgery protocol for transfer between the doubled color code and the rotated surface code. We extend the complete code-switching protocol to the $d=5$ doubled color code and perform the corresponding end-to-end simulations. Finally, we propose two fault-tolerant magic-state preparation protocols that combine phase-kickback checks with a transversal non-Clifford gate.

08.
arXiv (CS.AI) 2026-06-15

FPGA-Based Neural Network Accelerators for Space Applications: A Survey

arXiv:2504.16173v3 Announce Type: replace-cross Abstract: Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.

09.
bioRxiv (Bioinfo) 2026-06-17

MetaHarmonizer: robust biomedical metadata harmonization and a contamination control for inflated LLM performance on public benchmarks

Public biomedical repositories hold substantial reuse potential, but inconsistent metadata routinely blocks integration across studies. Recent LLM-based harmonization approaches address scale but suffer from non-determinism, hallucinated ontology terms, and, in their highest-accuracy configurations, dependence on proprietary APIs or labeled fine-tuning data. A more fundamental concern is that LLM accuracies on widely-used public benchmarks may substantially inflate transferable capability: under a contamination-controlled evaluation protocol we developed, the apparent LLM-only advantage on the GDC schema-mapping benchmark is inverted, and three out of five LLMs recover 80 -100% of GDC identifiers from zero-schema context, suggesting direct memorization. Building on this insight, we present MetaHarmonizer, an automated metadata harmonization system designed to be robust by construction: SchemaMapper aligns attribute names across schemas, and OntologyMapper standardizes values to controlled vocabularies. Both modules implement a multi-stage cascade that escalates to more resource-intensive methods only when earlier stages fall short, with all candidates grounded in pre-defined controlled vocabularies to preclude hallucinated outputs and LLMs used only as bounded preprocessing components rather than inference-time dependencies. On the GDC schema-matching benchmark, SchemaMapper with the deployment-optimized LLM-generated alias dictionary achieved 71.6% Top-1 accuracy and the higher Recall@GT than Magneto bipartite variants, recovering significantly more ground-truth mappings; with the best performing alias dictionary, it reached the highest Top-1/Top-5/Recall@GT, and also matched the best Magneto reranker (fine-tuned LLM-reranker) on MRR; and it also outperforms LLM-only performance under contamination-controlled conditions. On four EFO benchmarks, OntologyMapper achieved 77.9 - 95.5% Top-1 accuracy, outperforming text2term by up to 16.4 pp and direct LLM inference (against the smaller corpus) by 19.2 pp because memorization is not a viable shortcut for this task. Across both modules, calibrated confidence scores separate correct from incorrect predictions (AUC 0.73 - 0.94), enabling principled human-in-the-loop triage. Inference is fully local, deterministic, and computationally efficient - seconds on schema mapping and under a minute for ontology mapping of up to ~7,000 terms against the pre-indexed 33,230-term corpus. Released as a Python package with a domain-agnostic architecture, MetaHarmonizer provides a scalable foundation for improving the FAIRness of biomedical data and enabling cross-study integration, alongside an evaluation methodology applicable to any LLM-augmented bioinformatics benchmark built on public benchmarks.

10.
arXiv (CS.CV) 2026-06-17

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($\nu$) and density ($\rho$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $\nu$, $\rho$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

11.
arXiv (CS.AI) 2026-06-12

Towards Personalized Federated Learning for Dysarthric Speech Recognition

arXiv:2606.13253v1 Announce Type: cross Abstract: Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

12.
arXiv (math.PR) 2026-06-12

Counterintuitive problems in discrete probability

arXiv:2606.07516v2 Announce Type: replace Abstract: This manuscript contains a collection of counterintuitive problems in discrete probability, together with detailed solutions. The dataset was constructed as part of a broader research project investigating the capabilities of the latest-generation Large Language Models (LLMs) in solving discrete probability problems, in order to assess whether LLMs tend to make systematic reasoning errors associated with known cognitive biases. The problems collected here are specifically designed to challenge heuristic reasoning strategies that often lead to intuitively appealing but mathematically incorrect conclusions. The dataset combines several types of problems. Some are adapted from classical probabilistic paradoxes and cognitive-bias literature, while others originate from recreational mathematics sources or were developed by ourselves following similar principles. The primary purpose of this document is to provide a transparent and publicly accessible reference for the problems used in our experimental evaluation of language models, as well as providing detailed human-made solutions. At the same time, we believe that this collection may also prove useful for future research on probabilistic reasoning, cognitive biases, and the evaluation of reasoning capabilities in artificial intelligence systems.

13.
arXiv (CS.CV) 2026-06-17

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

14.
arXiv (CS.AI) 2026-06-19

Advancing DialNav through Automatic Embodied Dialog Augmentation

arXiv:2606.19948v1 Announce Type: new Abstract: For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav[han2025dialnav] provides a framework for holistic evaluation of the dialog–execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the RAINbow dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both Val Seen (58.24, +89\%) and Val Unseen (29.05, +100\%) splits, establishing a new state of the art.

15.
arXiv (CS.CV) 2026-06-12

Distributional Loss for Robust Classification

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

16.
arXiv (CS.LG) 2026-06-19

Folded Transport MCMC: Eliminating Label Switching by Sampling on a Fundamental Domain

Authors:

arXiv:2606.04307v2 Announce Type: replace Abstract: In Bayesian mixture models and other exchangeable-component models, the posterior is invariant under permutation of component labels, creating m! equivalent modes-the label-switching problem. Standard MCMC methods either mix poorly across these modes or rely on post-hoc relabelling that cannot guarantee the sampler has converged. We propose Folded Transport MCMC (FolT-MCMC), which eliminates label switching before sampling by restricting the Markov chain to a fundamental domain-a sorted or reflected subspace containing exactly one representative from each symmetric mode. The proposal is a learned normalising flow whose density is symmetrised over the group orbits, ensuring correct targeting on the reduced space. We show that this construction preserves a computable convergence diagnostic based on the oscillation of the log-density ratio, and that the diagnostic becomes sharper on the fundamental domain whenever the original-space flow under-covers one or more symmetric modes. Experiments on Gaussian mixtures (d=2-20), label-switching targets (up to 24 equivalent modes), a standard Bayesian three-component mixture posterior, and real accelerometer data from a supertall building show improvement ratios of 2x to 145x, with the folded diagnostic stable across dimensions while the unfolded diagnostic collapses.

17.
arXiv (CS.CL) 2026-06-16

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

18.
arXiv (CS.LG) 2026-06-11

Interpretable Neural Marked Statistics for Cosmological Inference

arXiv:2606.11295v1 Announce Type: cross Abstract: Recovering cosmological information beyond the power spectrum is a central goal for upcoming cosmological surveys, since late-time non-Gaussian signal in the matter density cannot be accessed through two-point statistics alone. Marked statistics fold part of this information back into the two-point level by reweighting the field with non-linear functions. We propose a neural marking scheme to generalize this process through a set of interpretable, physically motivated transformations that directly allow to interpret the gain in cosmological information at the morphological level. We employ a contrastive learning objective to align learnable marked summaries with the underlying cosmological parameters. At $k_{\max}=0.2\,h\mathrm{Mpc}^{-1}$, our neural mark tightens the marginalized constraint on $\sigma_8$ by $2.9\times$ and on $\Omega_m$ by $1.8\times$ compared to classical marks, breaking the $\Omega_m-\sigma_8$ degeneracy at the Fisher information level. It further reduces the parameter MSE across our cosmological parameter prior by $1.45\times$ over the best classical mark. The learned latent geometry aligns with the $\Omega_m$ and $\sigma_8$ directions in parameter space, indicating that the contrastive objective recovers the dominant axes of cosmological information. Our approach opens the door to more powerful, interpretable summary statistics for cosmological inference.

19.
medRxiv (Medicine) 2026-06-18

The relationship between serotonin transporter occupancy and extracellular serotonin concentration is hyperbolic, not linear: implications for safely tapering antidepressants

Background: Hyperbolic tapering is an increasingly recognized approach for discontinuing serotonin reuptake inhibitor (SRI) antidepressants that involves non-linear dose reductions with equal stepwise reductions in serotonin transporter (SERT) occupancy to mitigate withdrawal symptoms. Its theoretical basis is the hyperbolic relationship between SRI dose and SERT occupancy reported in radioligand imaging studies. Hyperbolic tapering implicitly assumes that changes in SERT occupancy approximate changes in biologic effect and withdrawal risk. Because SERT occupancy plateaus across the therapeutic dose range of SRIs, this framework predicts relatively small biologic effects and withdrawal risk within this range. However, SERT occupancy influences serotonergic activity only indirectly via its effects on extracellular serotonin concentrations, and the relationship between these two variables is poorly characterized. Methods: We developed a two-pathway clearance model derived from mass-action kinetics to evaluate the steady-state relationship between SERT occupancy and extracellular serotonin concentrations under chronic SRI treatment. Results: Our analysis indicates that serotonin concentrations increase hyperbolically as transporter occupancy increases, suggesting that biologically meaningful differences in serotonergic signaling persist across the therapeutic dose range of SRIs despite plateauing occupancy. Conclusions: Our model predicts a hyperbolic relationship between SERT occupancy and extracellular serotonin concentrations, suggesting that changes in occupancy may not map proportionally onto serotonergic effect. These findings provide a potential mechanistic explanation for dose-dependent clinical effects of SRIs despite plateauing transporter occupancy and generate testable hypotheses regarding antidepressant tapering strategies. Empirical validation is warranted.

20.
arXiv (CS.AI) 2026-06-17

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

Authors:

arXiv:2606.17897v1 Announce Type: new Abstract: Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

21.
arXiv (CS.CV) 2026-06-19

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

22.
arXiv (CS.CL) 2026-06-17

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

23.
arXiv (CS.CL) 2026-06-17

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.

24.
arXiv (CS.AI) 2026-06-19

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

arXiv:2606.20459v1 Announce Type: new Abstract: IVF pregnancy rates are routinely modeled using patient-level variables, while high-resolution laboratory environmental data remain underutilized. We show that this is a missed opportunity. Rather than relying on raw sensor averages, we engineer 55 context-aware temporal features, including rolling thermal stability, simultaneous temperature-humidity adherence, peak stress duration, and post-stress recovery speed, that capture the dynamics of incubator microenvironments. On 61 weeks of data from an Asian IVF clinic, these features reduce cross-validated prediction error to 1.27%, compared to 3-5% for raw averages. We then train a hierarchical Bayesian Beta regression model that shares environmental effects across an Asian and a Northern European clinic via partial pooling, while preserving site-specific baselines. On held-out data from the Northern European clinic, the model achieves R2 = 0.86 and a 64% error reduction for the 35-39 age group over a naive baseline, demonstrating that structured environmental monitoring contains clinically meaningful, transferable signal.

25.
arXiv (CS.LG) 2026-06-16

A Fully First-Order Layer for Differentiable Optimization

arXiv:2512.02494v2 Announce Type: replace Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{O}(1)$ time, leading to an overall complexity of $\tilde{O}(\delta^{-1}\epsilon^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. The source code is available at https://github.com/guaguakai/FFOLayer.