Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-12

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

arXiv:2305.08175v5 Announce Type: replace-cross Abstract: Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

02.
arXiv (CS.CV) 2026-06-11

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5

03.
arXiv (quant-ph) 2026-06-24

Biophysical EPR Using Superconducting Resonators

arXiv:2606.23952v1 Announce Type: new Abstract: We present innovations that enable the use of superconducting resonators for high sensitivity, high bandwidth pulsed electron paramagnetic resonance (EPR) measurements on biologically relevant samples with enhanced stability and throughput. A custom-built X-band pulsed EPR spectrometer with AWG and digital IF capability generated by an FPGA was used to control a novel patterned thin film planar superconducting microstrip resonator capable of generating Rabi fields sufficient to achieve 6 ns pi/2 Gaussian pulses using a 100 W solid-state HPA. The system allows automated sequential calibration, measurement, and analysis of five 3.5 uL samples contained in a sample cartridge. Performance was validated through measurements of double electron-electron resonance (DEER) distances in a variety of spin-labeled protein samples with biologically relevant concentrations, including measurements below 10 uM. The results enable broadening the scope of applications for both superconducting resonators and the use of EPR in biotechnology.

04.
arXiv (math.PR) 2026-06-18

Stability of Khintchine-type inequalities via log-monotonicity

arXiv:2606.19313v1 Announce Type: new Abstract: We investigate Khintchine-type inequalities for the weighted sums $S=\sum_ka_kX_k$ of independent copies of a symmetric random variable $X$. We show how log-monotonicity of the sequence $r_k(X)=k! \mathbb{E}[X^{2k}]/(2k)!$ implies sharp comparisons between the $L_p$ and $L_2$ norms of $S$ for every even integer $p\geq 2$, extending classic Khintchine-type inequalities and yielding new results in the log-convex setting. We also investigate the stability of our inequalities. Our first stability inequality sharpens the classic inequality by a deviation of the coefficient vector from the coordinate extremizers, while the second quantifies deviation from the Gaussian limit. Our results recover recent stability inequalities for random signs and apply to a broad class of distributions, including type-$\mathscr{L}$ random variables, ultra sub-Gaussian random variables and Gaussian mixtures.

05.
arXiv (CS.AI) 2026-06-12

What Type of Inference is Active Inference?

arXiv:2606.04935v2 Announce Type: replace Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

06.
arXiv (CS.LG) 2026-06-24

Experiments with Optimal Model Trees

arXiv:2503.12902v4 Announce Type: replace Abstract: Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

07.
arXiv (quant-ph) 2026-06-15

Electromagnetic Wightman functions and vacuum densities for a brane intersecting the AdS boundary

arXiv:2604.17583v2 Announce Type: replace-cross Abstract: We investigate the combined effects of a brane intersecting the AdS boundary and background gravitational field on the local characteristics of the electromagnetic vacuum. Two types of boundary conditions on the brane are considered, which are higher-dimensional generalizations of the perfect electric (PEC) and perfect magnetic (PMC) boundary conditions in Maxwell's electrodynamics. The brane-induced contributions to the Wightman functions of the vector potential and field tensor are explicitly extracted. Simple expressions in terms of elementary functions are provided. The behavior of the vacuum expectation values (VEVs) is mimicked by a scalar field with a negative effective mass squared determined by the radius of the AdS spacetime. The expectation values of the electric and magnetic fields squares and of the energy-momentum tensor are investigated as local characteristics of the vacuum state. The brane-induced contributions to these VEVs have opposite signs for the PEC and PMC conditions. For the PMC condition, this contribution is negative for the electric field squared and positive for the magnetic field squared. The VEV of the energy-momentum tensor has a nonzero off-diagonal component. The brane-induced vacuum energy density is positive for PMC condition, whereas the normal and parallel stresses change sign as functions of the distance from the brane. Unlike the problem involving a planar boundary in the Minkowski bulk, the vacuum energy-momentum tensor does not vanish in (3+1)-dimensional AdS spacetime.

08.
arXiv (CS.LG) 2026-06-17

Resource-Efficient Variational Quantum Classifier

arXiv:2511.09204v3 Announce Type: replace-cross Abstract: We introduce the unambiguous quantum classifier based on Hamming distance measurements combined with classical post-processing. The proposed approach improves classification performance through a more effective use of ansatz expressivity, while requiring significantly fewer circuit evaluations. Moreover, the method demonstrates enhanced robustness to noise, which is crucial for near-term quantum devices. We evaluate the proposed method on a breast cancer classification dataset. The unambiguous classifier achieves an average accuracy of 90%, corresponding to an improvement of 6.9 percentage points over the baseline, while requiring eight times fewer circuit executions per prediction. In the presence of noise, the improvement is reduced to approximately 3.1 percentage points, with the same reduction in execution cost. We substantiate our experimental results with theoretical evidence supporting the practical performance of the approach.

09.
arXiv (CS.AI) 2026-06-15

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

arXiv:2606.14202v1 Announce Type: cross Abstract: Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

10.
arXiv (CS.CV) 2026-06-15

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $HiLo-Token$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

11.
arXiv (CS.LG) 2026-06-15

How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

arXiv:2606.13733v1 Announce Type: cross Abstract: Multi-agent systems (MAS) were expected to overcome the limitation of single-agent systems (SAS) through collaboration. However, under typicality conditions on the task's constraint graph and bounded inter-agent communication, we prove that the success probability of a MAS is closely tied to the connectivity of task constraints, where each agent has limited information-processing capacity. Specifically, the success probability decays exponentially with an information bottleneck that emerges from partitioning the task's constraint graph among agents. We define this quantity as the minimum cut cost $C_{\min}$ of the potential constraint graph of each task. This information-theoretic bound applies to both open systems with external feedback and closed systems without. We validate our theory on both synthetic experiments and real-world empirical data from SWE-bench submissions. From our framework, effective MAS design should incorporate task-inherent constraints alongside engineering optimization, and when $\Cmin$ is high, practitioners should restructure tasks rather than simply scaling agents or communication.

12.
medRxiv (Medicine) 2026-06-15

Mucosal and Systemic Antibodies Associated with Clinical Protection in a Pertussis Controlled Human Infection Model

Background The engagement of mucosal and systemic immunity in preventing Bordetella pertussis colonization and infection in humans, the impact of prior vaccination on host immunity and protective outcomes, and the dynamics of the host response following exposure remain poorly understood. Methods Healthy adults were challenged with increasing colony-forming units (CFUs) doses, 106-108, of B. pertussis D420 intranasally (NCT05136599). Shedding (PCR and culturing) and symptom development were monitored up to 21 days post-challenge. Serum and nasal wash IgA and IgG were measured before challenge (baseline) and up to 6 months post-challenge. Findings Antibodies increased post-challenge only in infected individuals, primarily nasal IgA. Participants who remained uninfected had higher baseline levels of filamentous hemagglutinin (FHA)- specific mucosal IgA and IgG, and higher serum IgA against fimbriae 2/3 (FIM). FHA was negatively associated with bacterial load and was a key discriminator between shedders and non-shedders, up to one week post-challenge. By day 14 post-challenge, pertussis toxin (PT) IgG and FIM IgA in both serum and mucosal samples were negatively associated with bacterial colonization. The majority (96.7%) of acellular pertussis (aP) vaccine recipients (n=23, median age 2.0 years) became infected, compared to 69.4% of those who received whole-cell pertussis vaccine (n=36; median age 32.0 years), and their antibody responses remained distinct following infection. Interpretation Nasal FHA antibodies emerged as early predictors of protection against pertussis infection, while PT IgG and FIM IgA antibodies may reflect clearance after infection. aP-primed individuals were more susceptible to infection, despite their younger age and more recent vaccination. Funding CDC Contract #75D30122C15467 and CDC IPA Agreement #24IPA2417512 Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention, US Department of Health and Human Services.

13.
arXiv (CS.CV) 2026-06-24

TuringViT: Making SOTA Vision Transformers Accessible to All

Modern VLMs and VLA systems commonly adopt off-the-shelf ViTs such as SigLIP2 as visual encoders, but diverse downstream requirements in latency, temporal modeling, and VLM integration often call for customized SOTA-level ViTs. Training such encoders remains beyond the reach of much of the community, as it requires massive image-text data, while standard softmax attention makes high-resolution or dynamic-resolution pretraining prohibitively costly and often forces low-resolution pretraining followed by post-hoc adaptation. TuringViT addresses these challenges with three key designs: Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to construct supervision-rich image-video training data, and native dynamic-resolution pretraining that supports flexible inputs from the start and transfers seamlessly to downstream VLMs. As a result, TuringViT outperforms leading open-source ViT baselines with only 10% of the data, achieves stronger downstream VLM performance, and delivers substantially better latency scaling on high-resolution inputs. Our scaling-law analysis further shows that TuringViT continues to improve predictably with curated data scale, far from saturation. Its fast adaptation, hardware-friendly design, and efficient deployment have made it a unified visual foundation across XPeng's AI systems. More broadly, TuringViT provides a reproducible pipeline that dramatically lowers the cost for the community to train, customize, and deploy SOTA-level ViTs, moving toward making such Vision Transformers accessible to all.

14.
arXiv (math.PR) 2026-06-16

An Analytical Methodology for Quantifying Airspace Conflict Rate and Complexity

arXiv:2606.14897v1 Announce Type: cross Abstract: Air traffic growth, advanced air mobility, and increasingly autonomous operations are driving the need for scalable and adaptive airspace design methodologies. Central to this challenge is quantifying how traffic flow structure and demand, governed in part by airspace geometry, influence conflict generation and operational complexity. This paper presents an analytical framework for computing conflict rate and conflict probability in structured airspace using stochastic flow models. Traffic streams are modeled as renewal processes with prescribed inter-arrival time distributions, while interactions between flows are captured through geometry-dependent minimum spacing constraints at merges and crossings. Within this formulation, closed-form upper bounds on the expected conflict rate and conflict probability per aircraft are derived as functions of flow configuration and demand. These metrics are interpreted as complementary measures of airspace complexity, reflecting controller workload and per-aircraft operational risk. The methodology is applied to representative hexagonal cell geometries with varying routing structures and flow distributions. Results reveal non-monotonic tradeoffs between routing flexibility, capacity, and conflict generation, with intermediate flow configurations outperforming both highly constrained and highly distributed cases. The proposed framework provides a tractable tool for evaluating airspace design alternatives and complexity-informed traffic management strategies.

15.
arXiv (CS.LG) 2026-06-19

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

arXiv:2606.20035v1 Announce Type: cross Abstract: Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic–exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

16.
arXiv (CS.CL) 2026-06-15

ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch – a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss' kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.

17.
arXiv (CS.AI) 2026-06-16

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

arXiv:2606.14817v1 Announce Type: cross Abstract: This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

18.
arXiv (CS.CL) 2026-06-17

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

19.
arXiv (CS.CV) 2026-06-19

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

20.
arXiv (CS.AI) 2026-06-12

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

arXiv:2606.13051v1 Announce Type: new Abstract: Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

21.
arXiv (CS.CL) 2026-06-18

TW-LegalBench: Measuring Taiwanese Legal Understanding

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

22.
arXiv (CS.AI) 2026-06-11

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

arXiv:2509.23248v3 Announce Type: replace Abstract: The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

23.
arXiv (CS.CV) 2026-06-11

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

24.
arXiv (CS.CL) 2026-06-16

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

Authors:

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy – the number of distinct reasoning steps required to answer a clinical question from an EHR – as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p

25.
arXiv (CS.CL) 2026-06-19

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.