Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-16

Dealing with Annotator Disagreement in Hate Speech Classification

Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.

02.
arXiv (CS.CL) 2026-06-12

Reward Modeling for Multi-Agent Orchestration

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

03.
arXiv (CS.CL) 2026-06-11

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional calibration problem: the appropriate uncertainty target for a prompt should be estimated from repeated model outputs rather than from an isolated response. However, group rollouts alone are insufficient, since the resulting target must provide a useful training signal. Existing targets only partially satisfy this requirement. We propose SAGE, Semantic-Answer Guided Entropy, a group-level uncertainty target that constructs an answer-conditioned uncertainty geometry over sampled responses. SAGE preserves categorical, numeric, and symbolic answer distinctions while maintaining a smooth and scale-preserving calibration signal. We further apply this target through Group-Uncertainty Preference Optimization, or GUPO, an uncertainty-channel training framework that supervises verbal uncertainty expressions rather than the full response. Experiments across factual, mathematical, and multiple-choice reasoning tasks show improved uncertainty ranking, lower calibration error, and reduced overconfidence.

04.
arXiv (CS.CL) 2026-06-12

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

05.
arXiv (CS.CV) 2026-06-24

MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce MSPL, a framework that incorporates multi-step visual reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that MSPL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/mspl.

06.
arXiv (math.PR) 2026-06-11

Additive Noise, Shift Recovery, and Signed Signals in the Cumulative Distribution Transform

arXiv:2606.11432v1 Announce Type: cross Abstract: The cumulative distribution transform (CDT) is a quantile-based transport representation that exactly linearizes one-dimensional translations of positive densities. We study how this structure behaves under additive perturbations and how it can be exploited for shift recovery. Under a local nondegeneracy condition, we derive a first-order expansion showing that additive noise in physical space induces a nonlocal perturbation in CDT space through the primitive of the noise, weighted by the reciprocal density. This yields an explicit description of transform-domain sensitivity and shows, in particular, that perturbations are amplified in low-density regions. When the physical-space perturbation is modeled as a centered Gaussian random field, the induced first-order CDT perturbation is again Gaussian, with an explicit covariance kernel. We then use this structure to study recovery in CDT coordinates. In the known-template setting, the transport shift is obtained by projection onto the constant mode, giving an explicit estimator together with exactness in the noiseless case and a stability bound under perturbations. In the unknown-template setting, multiple observations permit joint recovery of the shifts and a common template up to the natural constant-mode gauge, leading to a simple de-shift–and–average procedure. We also consider a signed-signal analogue based on the signed cumulative distribution transform (SCDT), where shifts are estimated numerically by feature matching and unknown templates are recovered by alternating alignment and averaging. Numerical experiments validate the perturbation analysis and illustrate effective recovery for both density-valued and signed signals.

07.
arXiv (CS.LG) 2026-06-24

Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

arXiv:2605.19178v2 Announce Type: replace-cross Abstract: The great success of neural networks primarily arises from the presence of the large number of weight parameters combined with nonlinearities in the input-output relationship of single neurons. In this work, we study the relationship between the statistical properties of the weights and the nonlinearity of the hidden unit in Restricted Boltzmann Machines (RBMs) on the one side, and the distribution induced on binary visible units. We do this for four commonly used activation functions: Linear, Step, ReLU, and Exponential, and make qualitative predictions about the ability of these models to learn distributions with strong higher order interactions over the visible nodes. We show that in general, in an ensemble of RBMs with Gaussian weights, these distributions are rare and hard to learn, except when the hidden unit activation function is an Exponential.

08.
medRxiv (Medicine) 2026-06-23

Attention and memory in Parkinson's disease: a discriminant analysis approach

Background. Cognitive impairment in Parkinson's disease (PD) is highly prevalent and heterogeneous. Assessing multiple cognitive domains is challenging and risks redundancy. This study evaluated whether a discriminant analysis approach could optimize the selection of specific tasks and measures for identifying attention and memory deficits in PD. Methods. Thirty PD patients and 25 cognitively unimpaired (CU) controls completed four experimental tasks: two assessing attention (flanker and spatial Stroop), one for recognition memory, one for working memory (n-back). Following group-level difference analyses, a discriminant analysis was performed to identify which tasks, and performance metrics possessed the highest sensitivity for distinguishing PD patients from CU individuals. Results. At the group level, PD patients exhibited significantly worse conflict costs in both attention tasks and lower sensitivity scores (d') in the recognition memory task compared to CU controls. The discriminant analysis revealed that time-based measures from the spatial Stroop task and the sensitivity score from the recognition memory task provided the highest discriminating power to differentiate between the two groups. Conclusion. These findings suggest that cognitive deficits in PD can be identified with high diagnostic accuracy using a targeted subset of metrics, eliminating the need for extensive and redundant neuropsychological testing batteries for attention and memory, without needing an extensive number of cognitive tasks for attention and memory.

09.
medRxiv (Medicine) 2026-06-10

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (

10.
arXiv (CS.AI) 2026-06-11

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

arXiv:2606.11922v1 Announce Type: cross Abstract: Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at https://github.com/RSC-Toolkit/Lung-SRAD.

11.
arXiv (math.PR) 2026-06-25

Localized Centered Second-Chaos Operator

arXiv:2606.26065v1 Announce Type: new Abstract: We prove a localized continuous-frequency operator estimate for centered Gaussian chaoses of order two. The result applies to operator-valued centered second chaoses, including Wick-centered same-family variants, between Hilbert spaces. In the model, two Gaussian frequency legs at scale $N$, an input leg at scale $Q$, and an output leg at scale $M$ are coupled through a soft incidence kernel; non-orthogonal Gaussian profiles are represented by covariance synthesis maps. The proof combines four oriented flattenings, rectangular non-commutative Khintchine inequalities, soft-incidence Schatten bounds, and Sobolev–Besov dyadic summation. The time lift gives $L^p$ operator convergence, while a Galerkin stabilization hypothesis gives pathwise full-cutoff convergence by the first Borel–Cantelli lemma. Under $\mathcal G(N)\lesssim N^{-\Gamma}$ one obtains the window \[ \Gamma>\frac d2, \qquad s

12.
arXiv (CS.CV) 2026-06-18

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

13.
arXiv (CS.CV) 2026-06-12

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

14.
arXiv (CS.AI) 2026-06-12

EPIG: Emotion-Based Prompting for Personalised Image Generation

arXiv:2606.13247v1 Announce Type: new Abstract: Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model's ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method.

15.
arXiv (math.PR) 2026-06-15

Boltzmann-Like Occupation of Nonequilibrium Steady States on Dense Networks

arXiv:2606.14542v1 Announce Type: cross Abstract: A central problem in statistical physics is to extend the Boltzmann distribution to nonequilibrium steady states (NESS). We prove that NESS on large dense networks have Boltzmann-like occupation despite extensive entropy production. We further show that the active-matter heuristic of "low rattling" is asymptotically exact. Intuitively, these NESS spend a greater fraction of their time in states they leave more slowly. This explanation extends to the broader class of "equiaccessible" steady states, which play a role in our analysis akin to that of equilibrium in linear response.

16.
arXiv (CS.CV) 2026-06-15

HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation

We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.

17.
arXiv (CS.LG) 2026-06-16

Exploding and vanishing gradients in deep neural networks: the effect of residual connections

arXiv:2606.17013v1 Announce Type: cross Abstract: The well known phenomenon of exploding and vanishing gradients in deep neural networks is analyzed using multiplicative ergodic theory. The effect of adding a residual connection is explained in this context. Specifically, a characterization of Liapunov exponents due to Furstenberg and Kifer is exploited in order to make a precise statement about the Liapunov spectrum and the effect of residual connections on it.

18.
arXiv (CS.LG) 2026-06-16

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

arXiv:2606.15452v1 Announce Type: new Abstract: Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in Betti numbers from point-cloud embeddings - that are more stable and discriminative than statistical moments. We introduce PHINN, a flow-matching framework using dynamic Betti curves as conditioning signals and a persistence landscape loss for homology consistency. It scales to multivariate data, includes a natural-language interface to set Betti targets, supports cross-domain meta-learning and few-shot generation, and provides certified adversarial robustness. On financial, epidemiological, and multi-modal benchmarks, PHINN outperforms statistical and diffusion baselines in topological fidelity (beta-RMSE down 41-63%, transition accuracy up 84%) and matches jump-diffusion models in tail coverage while exceeding them in shape fidelity. All results have 95% confidence intervals.

19.
arXiv (CS.AI) 2026-06-16

AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data

arXiv:2606.15650v1 Announce Type: cross Abstract: We present AnonShield, a high-throughput, on-premise pseudonymization system that combines GPU-accelerated NER, streaming processing, caching, and schema-aware configuration. Evaluated on datasets up to 550 MB (70,951 records), AnonShield reduces processing time from over 92 hours to under 10 minutes (up to 738x speedup) while achieving up to 94.2% F1-score and 96.7% recall. Our results show that scalable pseudonymization of vulnerability data is feasible without sacrificing analytical utility, enabling compliant data sharing in operational CSIRT environments.

20.
arXiv (CS.CV) 2026-06-16

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

21.
arXiv (CS.CV) 2026-06-25

Gastroendoscopy View Synthesis: A New Real Dataset and Evaluation

Novel view synthesis (NVS) is an active research topic in computer vision, owing to the success of neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) methods. While NVS opens the door to potential applications in gastroendoscopy, such as extending the field of view of endoscopic images and enabling digital twins for 3D archiving and endoscopist manipulation training, the dataset is insufficient to evaluate NVS for gastroendoscopy. In this paper, we present the first real gastroscopy dataset for NVS, namely the GastroNVS dataset, which contains a set of gastroscopic images, camera poses, and a point cloud for real gastroendoscopy inspection. To assess the suitability of the GastroNVS dataset, we evaluate several 3DGS methods and discuss the challenges for future development. The dataset is available on request from our project page.

22.
bioRxiv (Bioinfo) 2026-06-11

Tumour evolution as ground truth for cancer whole-genome sequencing

Cancer genomes are shaped by evolutionary processes that couple mutagenesis, clonal selection, chromosomal instability, spatial growth and treatment response into structured genomic patterns, yet current benchmarking strategies largely ignore this evolutionary dependency. Here, we present SCOUT, a large-scale synthetic whole-genome sequencing resource of over 200 samples, designed for systematic benchmarking of tumour genomic analysis and evolutionary inference under controlled evolutionary ground truth. Unlike conventional task-specific simulations, SCOUT models tumour evolution as a latent generative process that simultaneously shapes mutations, copy-number alterations, variant allele frequencies, mutational signatures and clonal architectures. SCOUT recapitulates key features of solid and haematological malignancies, including driver mutations, chromosomal instability, intratumour heterogeneity, spatial sampling and treatment-associated evolutionary dynamics in tumour and matched-normal longitudinal and multi-region sequencing designs. Using SCOUT, we benchmarked widely used methods for somatic variant detection, copy-number analysis, mutational signature inference and tumour evolutionary reconstruction. Across analytical tasks, performance deteriorated in low-purity, highly subclonal and structurally complex tumours, while spatial sampling bias and hypermutation generated spurious evolutionary signals that confounded tumour interpretation across multiple inference layers. Evolutionary simulations further distinguished lineage-restricted genetic bottlenecks from multi-lineage resistance dynamics associated with tumour plasticity. Tumour purity consistently exerted a stronger effect on inference accuracy than sequencing depth. Together, our results establish evolutionary ground truth as a prerequisite for reproducible benchmarking and biologically interpretable analysis of cancer whole-genome sequencing data.

23.
arXiv (CS.CV) 2026-06-16

MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning

LaTeX OCR converts scientific document images into editable LaTeX code. Existing systems rely on large paired datasets, which are costly to collect and limited for low-resource languages. This paper presents MIXTEX, a data-efficient system using synthetic pretraining without real LaTeX sources. Unlike Nougat that depends on arXiv datasets, we generate training data by randomly pairing grammatical Wikipedia text with LaTeX formulas, requiring only syntactic correctness. This eliminates dependency on real document collections, enables scalable data generation (120M tokens), and supports low-resource languages. Following synthetic pretraining, adaptation requires only 400 real samples. Evaluation on a 977-sample benchmark with printed and handwritten English and Chinese shows that this two-stage strategy outperforms methods trained on large real datasets while requiring less human effort and computation. Data, code, and models are publicly available.

24.
arXiv (CS.LG) 2026-06-15

IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings

arXiv:2603.05556v2 Announce Type: replace Abstract: Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic arithmetic structure. We present IntSeqBERT, a dual-stream Transformer encoder for masked integer-sequence modelling on OEIS. Each sequence element is encoded along two complementary axes: a continuous log-scale magnitude embedding and sin/cos modulo embeddings for 100 residues (moduli $2$–$101$), fused via FiLM. Three prediction heads (magnitude regression, sign classification, and modulo prediction for 100 moduli) are trained jointly on 274,705 OEIS sequences. At the Large scale (91.5M parameters), IntSeqBERT achieves 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy (MMA) on the test set, outperforming a standard tokenised Transformer baseline by $+8.9$ pt and $+4.5$ pt, respectively. An ablation removing the modulo stream confirms it accounts for $+15.2$ pt of the MMA gain and contributes an additional $+6.2$ pt to magnitude accuracy. A probabilistic Chinese Remainder Theorem (CRT)-based Solver converts the model's predictions into concrete integers, yielding a 7.4-fold improvement in next-term prediction over the tokenised-Transformer baseline (Top-1: 19.09% vs. 2.59%). Modulo spectrum analysis reveals a strong negative correlation between Normalised Information Gain (NIG) and Euler's totient ratio $\varphi(m)/m$ ($r = -0.851$, $p < 10^{-28}$), providing empirical evidence that composite moduli capture OEIS arithmetic structure more efficiently via CRT aggregation.

25.
arXiv (CS.CL) 2026-06-16

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at https://github.com/miso-choi/TruthProbe.