Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-17

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained alignment model for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

02.
arXiv (CS.CV) 2026-06-19

Current World Models Lack a Persistent State Core

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

03.
arXiv (CS.CL) 2026-06-17

Rift: A Conflict Signature for Deception in Language Models

Authors:

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

04.
arXiv (quant-ph) 2026-06-19

Quantum deformations of $\mathcal{U}(\mathfrak{sl}(2, \mathbb{R}))$. Part I: Fidelity and experimental benchmarking

arXiv:2606.19462v1 Announce Type: new Abstract: This work explores the effects of both the standard quantum $q$-deformation and the non-standard $h$-deformation of the Hopf algebra $\mathcal{U}(\mathfrak{sl}(2, \mathbb{R}))$ on multi-qubit systems. By constructing the states of a Hilbert space of $N$ qubits through the Clebsch-Gordan coefficients associated with the deformed algebras, we show that these states naturally coincide with the eigenstates of the Hamiltonian of the $q$- and $h$-deformed Kittel-Shore models. We compare the resulting deformed states with those typically targeted in quantum information experiments, providing a bridge between algebraic constructions and experimentally relevant quantum resources. Fidelities with respect to the undeformed states are computed to establish how the quantum correlations are affected, both for few-qubit systems (including Dicke and non-Dicke states), and in the macroscopic limit ($N \to \infty$) through closed-form formulas derived for arbitrary Dicke states. The results reveal different behaviors between the two deformations. The $q$-deformation smoothly modifies the states and maintains a residual overlap with the original configurations, while the $h$-deformation rapidly makes the states orthogonal to their undeformed counterparts. Both models demand a standard $N^{-1}$ rescaling to preserve fidelity stability in the macroscopic limit.

05.
arXiv (CS.LG) 2026-06-16

Deep Learning-Based Lunar Crater Terrain Relative Navigation

arXiv:2606.14776v1 Announce Type: cross Abstract: Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

06.
arXiv (CS.CV) 2026-06-11

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: https://wrchen530.github.io/dynatok/.

07.
arXiv (CS.LG) 2026-06-17

Informative Missingness to Generate Irregular Clinical Time Series

arXiv:2606.17106v1 Announce Type: new Abstract: Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians' testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.

08.
arXiv (CS.CV) 2026-06-18

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

09.
arXiv (CS.LG) 2026-06-19

Calibrating Generative Models to Feature Distributions with MMD Finetuning

arXiv:2606.19496v1 Announce Type: new Abstract: Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug-like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score-function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature-distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous-space diffusion, and discrete diffusion models using only feature-level supervision. Code is available at https://github.com/smithhenryd/cgm.

10.
arXiv (CS.CV) 2026-06-12

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

11.
arXiv (CS.LG) 2026-06-16

TextResNet: Decoupling and Routing Optimization Signals in Compound AI Systems via Deep Residual Tuning

arXiv:2602.08306v2 Announce Type: replace Abstract: Textual Gradient-style optimizers (TextGrad) enable gradient-like feedback propagation through compound AI systems. However, they do not work well for deep chains. The root cause of this limitation stems from the Semantic Entanglement problem in these extended workflows. In standard textual backpropagation, feedback signals mix local critiques with upstream contexts, leading to Attribution Ambiguity. To address this challenge, we propose TextResNet, a framework that reformulates the optimization process to achieve precise signal routing via four key innovations. Firstly, in the forward pass, it enforces Additive Semantic Deltas to preserve an Identity Highway for gradient flow. Secondly, in the backward pass, it introduces Semantic Gradient Decomposition via a Semantic Projector to disentangle feedback into causally independent subspaces. Thirdly, it implements Causal Routing, which routes projected signals to their specific components. Finally, it performs Density-Aware Optimization Scheduling to leverage the disentangled signals to dynamically allocate resources to key system bottlenecks. Our results show that TextResNet not only achieves superior performance compared to TextGrad, but also exhibits remarkable stability for agentic tasks in compound AI systems where baselines collapse. Code is available at https://github.com/JeanDiable/TextResNet.

12.
arXiv (CS.CV) 2026-06-11

AGE-MIL: Anchor-Guided Evidence Learning for Patient-Level Prediction

Existing computational pathology methods predominantly operate within whole-slide image (WSI)-level multiple instance learning (MIL) paradigms, while patient-level modeling remains underexplored. In routine pathological practice, however, pathologists derive diagnostic and prognostic conclusions by integrating evidence across multiple WSIs rather than relying on any single slide. This discrepancy creates a fundamental misalignment when patient-level supervision is directly imposed on conventional MIL frameworks, often leading to unstable optimization and degraded predictive reliability. To address this issue, we propose Anchor-Guided Evidence MIL (AGE-MIL), a weakly supervised framework for patient-level prediction. AGE-MIL constructs a patient-level anchor from slide representations to capture global pathological context and guide the retrieval and integration of diagnostically relevant local patches, enabling robust patient-level modeling. Patient-level risk is further modeled as an evidence accumulation process, promoting stable optimization under weak supervision. AGE-MIL is evaluated on six clinically relevant patient-level prediction tasks from two independent cohorts. Experimental results show that the proposed framework consistently outperforms eight state-of-the-art MIL methods. Code is available at https://github.com/wodeniua/AGE-MIL.

13.
arXiv (CS.AI) 2026-06-16

Learning Permutation Distributions via Reflected Diffusion on Ranks

arXiv:2603.17353v2 Announce Type: replace-cross Abstract: The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

14.
PLOS Medicine 2026-05-20

Associations between hematologic dynamics during pregnancy and obstetric complications: A retrospective observational study

by Veronica Tozzo, Rachel Petherbridge, Kaitlyn James, Sarah Hsu, Deepti Pant, Chloe Michalopoulos, Brody H. Foy, Tanayott Thaweethai, Christopher Mow, Jacqueline Maya, Carolina Batlle Camero, Lydia Shook, Kathryn J. Gray, Logan Mauney, John M. Higgins, Camille E. Powe Background Pregnancy alters hematologic state as measured by complete blood count (CBC), but the longitudinal changes in CBC indices that define healthy pregnancies are not well established. In a large cohort based at an academic health system in the United States, we aimed to define reference intervals and typical longitudinal changes in CBC indices during pregnancy. We then tested for associations between extreme CBC values for gestational age or extreme longitudinal changes in CBC indices and obstetric complications. Methods and findings We studied nine CBC indices in individuals with singleton pregnancies who delivered after 30 weeks’ gestation and presented for prenatal care prior to 20 weeks. The electronic health record (EHR)-based Maternal Health Cohort (Massachusetts General Hospital; 1998–2016) formed our discovery cohort of 45,992 pregnancies, 18% of which had relevant complications. We developed a validation cohort of 48,868, 27% with complications from EHR data in the Mass General Brigham healthcare system from 2016 to 2024. In pregnancies without complications in the discovery cohort, we derived gestational-age-specific reference intervals (2.5th–97.5th percentile) and established typical intra-pregnancy longitudinal changes. In the validation cohort, we then tested CBC values outside of the 26–29 weeks’ gestation reference interval and CBC rare changes (uncommon changes in magnitude and direction) between 7–14 and 26–29 weeks’ gestation for association with a composite outcome (hypertensive disorders of pregnancy, small for gestational age birthweight, preterm birth) and its individual components using generalized estimating equations. Derived reference intervals differed from those in the literature for mean red cell volume, mean red cell hemoglobin, red cell count, and mean red cell hemoglobin concentration; reference intervals for other indices were similar to those previously published. In validation, hematocrit, hemoglobin, and red cell count values above their gestational-age specific reference intervals were associated with increased risk of the composite obstetric outcome: odds ratios (ORs) of 1.4 (95% CI [1.2, 1.5] p 

15.
arXiv (CS.CL) 2026-06-16

Evaluating LLM Personalization via Semantic Constraint Verification

Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

16.
arXiv (quant-ph) 2026-06-17

Optimal Probe State for Phase Estimation Under Covariant Measurement

arXiv:2606.18169v1 Announce Type: new Abstract: We study the optimization of input states for phase estimation under covariant measurements. Building on Holevo's framework, which provides the optimal covariant measurement for a fixed input state, we further optimize over the input state itself. For a general even $2\pi$-periodic cost function with non-negative Fourier coefficients, we derive a necessary and sufficient condition for the optimal input state: Its Fock coefficients are determined, up to arbitrary phases, by the eigenvector corresponding to the largest eigenvalue of a Toeplitz matrix defined by the cost function. This characterization yields an explicit expression for the attainable lower bound of the average cost under optimal covariant measurements and shows that this bound asymptotically approaches zero in the infinite-energy limit. For the specific cost function $W(\theta,\tilde{\theta})=4\sin^2[(\theta-\tilde{\theta})/2]$, we obtain the optimal input state and the corresponding minimum average cost in closed form, demonstrating Heisenberg scaling with respect to the mean photon number.

17.
arXiv (CS.LG) 2026-06-16

Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks

arXiv:2605.26290v2 Announce Type: replace Abstract: Temporal signed networks (TSNs) model the time evolution of cooperative and adversarial relationships that arise in applications such as social media analysis, trust and reputation systems, and financial transaction networks. While graph neural networks (GNNs) perform well for static or unsigned link prediction, effective learning in temporal signed graphs remains challenging due to the interaction of signed relations, evolving structure, and balance-theoretic constraints. To address this gap, we propose a modular temporal enhancement framework for signed GNNs that integrates historical context into otherwise static architectures. The framework introduces a Historical Context Integration Module (HCIM) that combines learnable recency-aware temporal weighting, LSTM-based embedding trajectory modeling, and multi-head temporal attention to capture both short- and long-term signed interaction dynamics. Historical information is fused with current node representations using either global or node-adaptive weighting, allowing the architecture-agnostic framework to accommodate heterogeneous temporal behaviors. We instantiate the approach on the Self-Explainable Signed Graph Transformer (SE-SGformer), preserving interpretability while extending it with temporal awareness. Experiments on real-world and synthetic TSNs, including Bitcoin OTC, Bitcoin Alpha, Reddit, and small-world network models, demonstrate consistent and statistically significant improvements over the static baseline.

18.
arXiv (CS.CV) 2026-06-11

Contactless 3D Human Body Measurement Using Depth Cameras for Smart Health Monitoring

Contactless body measurement technologies are becoming increasingly significant for smart health monitoring, digital health applications, and remote patient assessment. Traditional anthropometric measurements typically necessitate physical contact and trained personnel, which may constrain scalability in remote healthcare settings. In this study, we introduce a depth camera-based framework for estimating human body measurements utilizing 3D point cloud data. An Orbbec Astra 2 depth camera was employed to capture RGB images, depth maps, and 3D point clouds of participants. The captured point cloud was processed using Python-based tools, including Open3D, NumPy, and OpenCV, to segment the human body from the background. Key anthropometric measurements, such as height and arm span, were computed. The measurements were obtained through a combination of spatial filtering and landmark selection on the 3D point cloud, followed by the projection of the computed measurements onto the corresponding RGB image using camera intrinsic parameters. In addition to linear measurements, the approximate body volume and visible surface area were estimated using voxel-based occupancy analysis and mesh-based surface reconstruction methods. The experimental results from a single depth capture demonstrated that accurate body measurements and geometric estimates could be obtained from depth camera data without physical contact. This study provides a foundation for future real-time systems that integrate depth sensing with intelligent health monitoring and generative AI models for smart healthcare applications.

19.
arXiv (CS.CL) 2026-06-12

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

20.
arXiv (CS.CV) 2026-06-19

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

21.
arXiv (CS.CL) 2026-06-16

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a single semantic direction into the residual stream. We study the richer setting in which two semantically opposing steering vectors are superimposed – a regime we call Creative Collision. Concretely, we construct directorial persona vectors for Steven Spielberg (optimistic, redemptive moral valence) and Martin Scorsese (dark, morally ambiguous) via mean-difference activation contrast on curated screenplay-derived corpora, then interpolate between them with a scalar mixing parameter $\alpha \in [0,1]$ and a steering coefficient $\lambda$. Across five evaluation axes – moral valence, generation coherence, surface style, directional dominance, and vector geometry – three principal findings emerge: (i)~Spielberg's representational signature exhibits robust directional dominance, suppressing Scorsese's moral influence across almost the entire interpolation range; (ii)~intermediate collision points paradoxically improve generation coherence relative to pure single-director steering at high $\lambda$; and (iii)~both personas localise maximally to layer~28 of a 40-layer decoder-only transformer, revealing a shared moral-tone substrate. These results illuminate the geometry of competing semantic directions in transformer residual streams and have direct implications for controllable creative generation and value-aligned narrative synthesis.

22.
arXiv (CS.AI) 2026-06-19

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

arXiv:2606.20381v1 Announce Type: new Abstract: FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

23.
arXiv (CS.CL) 2026-06-16

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: https://github.com/dinhthienan33/SemEval2026-Task4-ttda704.

24.
arXiv (quant-ph) 2026-06-17

Emergent de Sitter Space and Non-Unitary Tensor Networks from Non-Hermitian Quantum Criticality

arXiv:2606.17983v1 Announce Type: new Abstract: Extending the holographic principle to de Sitter (dS) spacetimes remains one of the most vital open frontiers in quantum gravity, where a microscopic, bottom-up tensor-network framework that relates boundary quantum data to emergent de Sitter spacetime is still lacking. In this work, we first show the emergence of de Sitter spacetime from boundary entanglement by formulating a non-unitary continuous multi-scale entanglement renormalization ansatz (cMERA) for a concrete non-Hermitian critical fermion chain. Within this emergent spacetime, we analyze the associated geodesics and show that they act as extremal Ryu-Takayanagi (RT) surfaces undergoing a smooth timelike-to-null transition. Remarkably, we demonstrate that this continuum trajectory dictates a distinct tensor-network architecture in which the bond-counting contribution naturally truncates at the discrete timelike-to-null transition toward the deep infrared. In the resulting architecture, the null ray along the horizon is represented by zero-cost links, since the associated cut severs no tensor legs. This network structure successfully reproduces the logarithmic scaling of non-unitary critical entanglement entropy, offering a bond-counting picture for the de Sitter RT formula. Our results provide the long-sought dS/(c)MERA correspondence at the level of both emergent spacetime and discrete holographic entanglement.

25.
arXiv (CS.AI) 2026-06-17

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

arXiv:2606.17269v1 Announce Type: new Abstract: In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.