Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-16

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

02.
arXiv (CS.CV) 2026-06-24

A Geometry-Informed Computer Vision Method for Detecting and Examining Overtaking Vehicles From A Bicycle

Instrumented bicycle studies have produced direct field evidence on vehicle passing behavior, but extracting overtaking events from continuous rear-facing video has remained dependent on manual, frame-by-frame annotation. This bottleneck constrains sample sizes and limits naturalistic cycling safety research. We present a geometry-informed computer vision pipeline that automates overtaking event detection from a single bicycle-mounted camera without multi-sensor configurations or explicit camera calibration. The system combines RT-DETR object detection with ByteTrack multi-object tracking through a three-stage geometric validation module enforcing bearing angle trend, apparent size growth, and spatial confirmation criteria derived from perspective projection principles. Validated on 315 manually annotated real-world overtaking events from urban roads in Ann Arbor, Michigan, the pipeline achieved 97.8% recall with zero false positives. The system identified overtaking intentions a mean of 2.44 seconds before vehicle passage, with 84.1% of events exceeding the 1.5-second human reaction time threshold, demonstrating feasibility for active cyclist warning. Lateral passing distance measurements from 96 events revealed 33.3% of passes below the 5-foot (152.4 cm) threshold, consistent with non-compliance rates in prior field and self-reported studies. A preliminary calibration-free lateral distance estimation approach using bounding box geometric features achieved mean absolute errors of 13-14 cm under leave-one-out cross-validation, sufficient to distinguish close passes from standard passes for safety categorization. By automating event isolation from consumer-grade footage, the system removes the primary annotation bottleneck of instrumented bicycle research and provides a scalable foundation for vehicle-bicycle interaction analysis across larger datasets and diverse urban environments.

03.
arXiv (CS.CV) 2026-06-18

SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

04.
arXiv (CS.CL) 2026-06-18

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

05.
arXiv (math.PR) 2026-06-15

Universality for Products of Random Matrices with i.i.d. Entries and the Fuss–Catalan Number

arXiv:2606.14450v1 Announce Type: cross Abstract: Let \((w_{ij})_{i,j\ge1}\) be a single infinite array of independent identically distributed real- or complex-valued entries of mean zero, variance \(\sigma^2\), and finite fourth moment. Set \(W_n=(w_{ij})_{1\le i,j\le n}\) and \(X_n=n^{-1/2}W_n\). For every fixed \(k\ge1\), we identify the almost sure limiting operator norm of several fixed products built from this family. Define the \(k\)-th freeness coefficient by \[ \gamma_k:=\sqrt{\frac{(k+1)^{k+1}}{k^k}}. \] Then we prove \[ \|X_n^k\|\to\sigma^k\gamma_k \qquad almost surely. \] The same limit holds for products sampled with replacement from any fixed finite pool of independent copies of \(X_n\); in particular, it holds for the product of \(k\) independent copies. Thus, the freeness coefficient captures the non-commuting characteristic between large random matrices %powers and independent or fixed-pool sampled products under the finite fourth moment assumption. The improvement of the classical Bai–Yin-type power estimate from the scale \(\sigma^k(k{+}1)\) to \(\sigma^k \sqrt{k{+}1}\) is a direct corollary of our result. The main technical challenge is to prove the upper bound using a high-moment expansion of %the upper bound is proved by a high-moment expansion of \(\E\Tr((X_n^kX_n^{*k})^m)\). The leading zero-defect trace words are tree-like and are counted by the Fuss–Catalan number \[ F_{k,m}= \frac1{km+1}\binom{(k+1)m}{m}. \] The combinatorial tool helps to devise a defect-sensitive global enumeration: if \(L=km\) and \[ r=(L+1-v)+(L-q), \] then the number of admissible word classes with defect \(r\) is at most \(F_{k,m}(Cm)^{Dr}\). This polynomial-in-\(m\) loss, with degree proportional to the defect, is summable in the logarithmic moment range.

06.
arXiv (CS.CL) 2026-06-19

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit–DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

07.
Nature (Science) 2026-06-24

Dietary cholesterol activates a Ral-dependent pathway driving LDLR turnover

Authors:

Metabolism of the hepatic low-density lipoprotein receptor (LDLR) is a key determinant of cholesterol homeostasis1,2. The molecular switches that coordinate LDLR trafficking and turnover in response to nutritional cues, including high dietary cholesterol, remain poorly defined3–6. Here we identify a new pathway regulated by Ral GTPases that links extracellular cholesterol signals to the intracellular trafficking machinery controlling LDLR turnover. Chronic dietary cholesterol activates the Ral proteins by increasing RAS activity, routing LDLR to lysosomes for degradation and inhibiting its recycling independently of transcriptional regulation or PCSK9. Constitutive activation of Ral via RalGAPB deletion or overexpression of constitutively active Ral mutants in hepatocytes reduces LDLR levels and impairs cholesterol clearance. Ral engages the endocytic RalBP1–REPS1 complex to promote LDLR internalization and lysosomal routing, where LDLR is degraded by the lysosomal protease cathepsin A (CTSA). Ral activation directs CTSA towards lysosomes for maturation while limiting its secretion, further promoting LDLR degradation in lysosomes. Genetic variants in this pathway significantly associate with altered cholesterol in humans. Pharmacological inhibition of CTSA activity increases hepatic LDLR function and improves cholesterol clearance, offering a potential new therapeutic strategy for hypercholesterolaemia and cardiovascular disease. Chronic dietary cholesterol activates Ral GTPases, which promote LDLR internalization and lysosomal degradation through RalBP1–REPS1 and CTSA, thereby reducing cholesterol clearance, whereas CTSA inhibition restores LDLR function and may offer a therapeutic strategy for cardiovascular disease.

08.
arXiv (CS.CL) 2026-06-12

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

09.
arXiv (CS.LG) 2026-06-11

My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents

arXiv:2606.11256v1 Announce Type: cross Abstract: Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task-specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state-of-the-art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine-tuning, or dedicated generative models.

10.
arXiv (CS.CL) 2026-06-24

Blockwise Policy-Drift Gating for On-Policy Distillation

On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that local teacher-support matching is a simple and effective repair. This paper introduces blockwise policy-drift gating, a lightweight student-only old-current drift controller for OPD under rollout reuse. The method computes log-probability shifts between the behavior student and the current student on the sampled token path, aggregates these shifts over fixed blocks or spans, and uses the resulting detached, mean-normalized gates to reweight OPD position losses. It does not change teacher targets, teacher top-K supports, or the rollout policy. In a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget for all trained variants, we use pass@8 as the primary problem-level solve-rate metric. Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. On Teacher-TopK/LSM, Block64 gives the best four-benchmark mean pass@8 among trained students. The results identify local old-current policy drift as a practical control signal for reused OPD rollouts and motivate block-level gating as a simple default for improving solve-rate robustness.

11.
arXiv (CS.CV) 2026-06-24

EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

12.
arXiv (CS.CV) 2026-06-11

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

13.
arXiv (math.PR) 2026-06-19

The central heat trace on large compact classical groups

arXiv:2511.08288v2 Announce Type: replace-cross Abstract: We study the large-$N$ asymptotics of the central trace of the heat kernel on compact classical groups. For every classical family $G_N\subset \mathrm{GL}_N(\C)$, we prove a full large-$N$ asymptotic expansion, using a highest weights/partitions correspondence adapted to the large-rank regime, under which the eigenvalues of the Laplace–Beltrami operator stabilize as observables in the algebra of shifted symmetric functions. Then, we prove a random surface representation of the trace in terms of ramified coverings of the torus. We provide two independent applications: an explicit large-rank counting law for the Casimir spectrum, with exponential Hardy–Ramanujan-type growth in contrast with the polynomial behavior of Weyl's law at fixed rank, and a rigorous probabilistic formulation of the Yang–Mills/Hurwitz duality on a two-dimensional torus initiated by Gross and Taylor, completing a previous work of the authors. We also extend this duality to a Yang–Mills/Gromov–Witten duality by expressing the coefficients of the central heat trace as explicit functionals of the generating function of Gromov–Witten invariants.

14.
arXiv (CS.CL) 2026-06-16

Understanding the Behaviors of Environment-aware Information Retrieval

Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at https://github.com/LCO-Embedding/Envs-aware-Information-Retrieval.

15.
arXiv (CS.LG) 2026-06-24

Macro Graph of Experts for Billion-Scale Multi-Task Recommendation

arXiv:2506.10520v5 Announce Type: replace-cross Abstract: Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Experts (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for a leading billion-scale recommender system, Alibaba. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.

16.
arXiv (CS.LG) 2026-06-24

LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

arXiv:2602.18934v2 Announce Type: replace Abstract: Membership inference attacks (MIAs) threaten the privacy of machine learning models by revealing whether a specific data point was used during training. Existing MIAs often rely on impractical assumptions, such as access to public datasets, shadow models, confidence scores, or knowledge of the training data distribution, making them vulnerable to defenses like confidence masking and adversarial regularization. Label-only MIAs, even under strict constraints, suffer from high query requirements per sample. We propose a cost-effective label-only MIA framework based on transferability and model extraction. By querying the target model $M$ using active sampling, perturbation-based selection, and synthetic data, we extract a functionally similar surrogate model $S$ on which membership inference is performed. This shifts the query overhead to a one-time extraction phase, eliminating repeated queries to $M$. Our method matches the performance of state-of-the-art label-only MIAs while significantly reducing query costs and operating under strict black-box constraints. On benchmark tabular datasets, we show that a query budget equivalent to testing the membership of approximately $1%$ of the training samples is sufficient to extract $S$ and achieve membership inference accuracy within $\pm 1%$ of that obtained when attacking $M$ directly. We also evaluate the effectiveness of standard defenses, including DP-SGD and regularization, proposed for label-only MIAs against our attack. Finally, we present preliminary results extending our framework to deep neural networks trained on image datasets, demonstrating promising transferability and membership inference performance under label-only access while highlighting directions for further optimization.

17.
arXiv (CS.CV) 2026-06-17

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

18.
arXiv (quant-ph) 2026-06-12

Matrix phase-space representations for quantum symmetries

arXiv:2606.12769v1 Announce Type: new Abstract: We introduce a general phase-space representation that includes global quantum symmetries in the basis expansion. This method, called matrix phase-space, projects the basis onto a reduced Hilbert space, which can greatly reduce sampling errors of many-body quantum simulations and unifies several previous phase-space methods. The purpose of this paper is to provide detailed proofs of basic theorems and operator identities. We also treat several different types of symmetries. To illustrate the benefits of matrix phase-space methods, we give a detailed derivation of a recent application to the topical problem of verifying the outputs of Gaussian boson sampling (GBS) quantum computers with photon number resolving detectors. This has exponential complexity, and using parity symmetry reduces sampling errors by very large factors relative to earlier methods.

19.
PLOS Computational Biology 2026-06-15

Environmental “knees” and “wiggles” as strong stabilizers of species’ range limits set by interspecific competition

by Farshad Shirani, Benjamin G. Freeman Whether interspecific competition is a major contributing factor to setting species’ range limits has been debated for a long time. Theoretical studies have proposed that the interactions between interspecific competition and disruptive gene flow along an environmental gradient can halt range expansion of ecologically similar species where they meet. However, the stability of such range limits has not been well addressed. We use a deterministic mathematical model of adaptive range evolution over a continuous habitat to show that the range limits set by interspecific competition are unlikely to be evolutionarily stable if the environmental optima for fitness-related traits vary (almost) linearly in space. That is, in a linear environment without a dispersal barrier or a third (or more) species, the range borders formed between two competing species constantly move towards the weaker species. We demonstrate that environmental nonlinearities such as “knees” and “wiggles”—wherein an isolated sharp change or a step-like change occurs in the steepness of a trait optimum—can strongly stabilize competitively formed range limits. The stabilization mechanism relies on the contrast that such nonlinearities create in the level of disruptive gene flow to the peripheral population of each species, and succeeds when an additional process, such as Allee effects, prevents the establishment of an infinitesimal population in the presence of an abundant competitor. We show that the stability of the range limits at these nonlinearities is robust against moderate environmental disturbances. Whether strong disturbances such as rapid high-amplitude climate changes can destabilize such range limits depends on how the competitive dominance of the species changes across the nonlinearity. Therefore, our findings underscore the importance of assessing species’ competitive ability when predicting responses to climate change, and identify geographic regions where established range limits are likely to persist as well as regions where shifting limits may eventually stabilize.

20.
arXiv (CS.CV) 2026-06-11

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

21.
arXiv (CS.CL) 2026-06-11

M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.

22.
arXiv (CS.AI) 2026-06-19

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

arXiv:2509.15927v5 Announce Type: replace-cross Abstract: Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose AIGB-Pearl (Planning with \textbf{EvaluAtor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

23.
arXiv (CS.AI) 2026-06-16

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

arXiv:2605.21312v2 Announce Type: replace-cross Abstract: Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.

24.
medRxiv (Medicine) 2026-06-19

Extraction of Glaucoma Diagnosis, Type, and Severity from Clinical Notes using Secure Cloud-based Large Language Models

Purpose: To evaluate the performance of secure cloud-based large language models (LLMs) in extracting glaucoma diagnosis, type, and severity from free-text clinical notes in the electronic health record (EHR). Design: Retrospective chart review analysis. Participants: 1,250 subjects from the Bascom Palmer Ophthalmic Repository. Methods: Clinical notes of glaucoma-related encounters between 2014 and 2024 were extracted from the Bascom Palmer Ophthalmic Repository. Two fellowship-trained glaucoma specialists annotated clinical notes for glaucoma presence, type, and severity at the eye level. The dataset was split into development (10%), validation (10%), and test (80%) sets. Development and validation sets were used for prompt engineering and refinement, and the held-out test set was used for evaluation. Five LLMs (Claude Opus 4.6, DeepSeek-V3.2, GPT-5.2, Grok 4.1, and Qwen3.6-35B-A3B) were accessed via Azure AI Foundry within HIPAA-compliant containers. Model performance was assessed using standard metrics. Clinician-entered ICD-10 codes were also compared with adjudicated labels. Main Outcome Measures: Gwet AC1, accuracy, sensitivity, specificity, and F1-score. Results: Inter-grader agreement was high for glaucoma detection (Gwet AC1= 0.930 (95% CI: 0.917-0.945), type classification (Gwet AC1= 0.917 (95% CI: 0.904-0.930), and severity staging (Gwet AC1= 0.901 (95% CI: 0.884-0.916). For glaucoma diagnosis, LLMs demonstrated high overall accuracy, with Claude achieving 97.5%, DeepSeek 96.0%, GPT 96.2%, Grok 94.4%, and Qwen 95.5%. F1 scores for glaucoma detection ranged from 95.4% to 98.9% across models. For glaucoma type classification, accuracies were 97.1%, 94.2%, 94.2%, 94.0%, and 94.4% for Claude, DeepSeek, GPT, Grok, and Qwen, respectively. F1 scores for the most prevalent type (POAG) ranged from 96.3% to 98.9%. For severity staging, accuracies were 95.0%, 94.8%, 94.5%, 94.0%, and 95.2%, respectively, with F1 scores ranging from 89.7% to 96.3% across severity categories and models. ICD-10 codes demonstrated substantially lower performance for type and severity staging, with overall accuracies of 89.2% and 58.5%, respectively. Conclusions: Secure cloud-based LLMs accurately extracted glaucoma diagnosis, type, and severity information from free-text ophthalmology notes, achieving performance approaching expert clinician adjudication while substantially outperforming ICD-based phenotyping approaches, particularly for disease severity classification. These findings demonstrate the potential of LLMs to transform unstructured clinical documentation into scalable, research-ready phenotypic data for large-scale glaucoma cohort development and EHR-based ophthalmic research.

25.
arXiv (CS.CL) 2026-06-16

Revisiting the Systematicity in Negation in the Era of In-Context Learning

Understanding the meaning of negated sentences remains one of the challenges for language models, even in the era of large language models (LLMs). We analyze systematicity regarding LLM understanding of negation from two perspectives: behavioral systematicity and representational systematicity. For behavioral systematicity, we confirm that through demonstrations and in-context learning, LLMs can recognize negation expressions and scope within sentences to some extent, but they fail to achieve perfect performance. In particular, the difficulty of the negation scope recognition for models varies depending on the output format. For representational systematicity, we analyze the extent to which function vectors can be robustly constructed from in-context examples for tasks that are essential to understanding negation. The experiments suggest that while function vectors can be composed for negation cue extraction tasks, extracting function vectors for recognizing scope is more challenging.