Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
medRxiv (Medicine) 2026-06-17

Short-term relaxation after cervical rotatory manipulation is more closely associated with somatosensory input than cracking sound: a randomized controlled EEG study

Background Cervical rotatory manipulation is commonly used for neck-related symptoms and is often accompanied by a cracking sound. This sound is frequently regarded as a sign of successful manipulation, but whether it contributes substantially to the immediate relaxation response remains unclear. Objective This study examined whether short-term relaxation after cervical rotatory manipulation is more closely related to manipulation-associated sensory input than to the cracking sound cue alone. Methods In this single-session, three-arm, parallel randomized controlled study, 54 healthy volunteers were allocated to cervical rotatory manipulation, sham manipulation, or sham manipulation plus simulated cracking sound. Subjective outcomes were assessed before and after intervention, including positive affect, negative affect, comfort, and satisfaction. Eyes-closed resting-state electroencephalography was recorded before and after intervention. Prespecified neural outcomes included frontal alpha power, frontal alpha/beta ratio, occipital individual alpha frequency, and alpha-band fronto-parietal and fronto-temporal functional connectivity. Results Cervical rotatory manipulation produced greater improvements in positive affect, comfort, and satisfaction than sham manipulation or sham manipulation plus simulated cracking sound, whereas negative affect remained generally stable across groups. These subjective responses were accompanied by short-term electroencephalography changes, particularly in frontal alpha/beta and alpha-band fronto-parietal and fronto-temporal functional connectivity. Changes in frontal alpha/beta ratio were positively associated with changes in positive affect. In contrast, simulated cracking sound alone did not reproduce the full subjective or electroencephalography response observed after real manipulation. Conclusions The immediate relaxation response after cervical rotatory manipulation appears to be more closely related to manipulation-associated sensory input than to the cracking sound cue alone. These findings provide preliminary neurophysiological evidence for distinguishing real manipulation effects from sound-related contextual cues.

02.
arXiv (CS.CV) 2026-06-16

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

03.
arXiv (math.PR) 2026-06-19

The systole of random hyperbolic 3-manifolds

arXiv:2406.11783v2 Announce Type: replace-cross Abstract: We study the systole of a model of random hyperbolic 3-manifolds introduced by Petri and Raimbault, answering a question posed in that same article. These are compact manifolds with boundary constructed by randomly gluing truncated tetrahedra along their faces. We prove that the limit, as the volume tends to infinity, of the expected value of their systole exists and we give a closed formula of it. Moreover, we compute a numerical approximation of this value.

04.
arXiv (CS.CL) 2026-06-15

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

05.
arXiv (math.PR) 2026-06-18

Rigidity of infinite exchangeable sequences with Gaussian marginals

arXiv:2606.18654v1 Announce Type: new Abstract: We study infinite exchangeable sequences with Gaussian one-dimensional marginals. We formulate the conjecture that joint Gaussianity of a single pair of coordinates forces the entire sequence to be a Gaussian process. Although this conjecture remains open, we prove that joint Gaussianity of the first four coordinates is sufficient. We also establish the corresponding two-point criterion under the additional assumption that the directing measure is almost surely infinitely divisible.

06.
arXiv (CS.AI) 2026-06-12

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

arXiv:2606.12976v1 Announce Type: new Abstract: Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

07.
arXiv (CS.AI) 2026-06-18

DeepInflation: an AI agent for research and model discovery of inflation

arXiv:2601.14288v2 Announce Type: replace-cross Abstract: We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

08.
arXiv (quant-ph) 2026-06-24

Quantum turbulence in the many-body regime

arXiv:2606.23822v1 Announce Type: cross Abstract: We discuss phenomenology associated with turbulent hydrodynamics in quantum fluids from a condensed-matter perspective. We begin with weakly-interacting superfluids, often modeled by a mean-field theory governed by the Gross-Pitaevskii equation. Considering the effect of quantum fluctuations beyond the mean-field approximation, we propose a study of many-body quantum effects in turbulent hydrodynamics, especially near zero temperature. We motivate examples of quantum many-body systems where such effects may be uncovered. These include bosons confined in a periodic potential in low spatial dimensions (one and two), and the associated quantum critical point of the superfluid-insulator transition, realized in present-day ultracold-atom and quantum computing platforms. We conclude by listing a set of (open) questions that may be answered using modern quantum many-body techniques. This article is part of the theme issue 'Frontiers of turbulence and statistical physics'.

09.
arXiv (CS.CL) 2026-06-16

Can LLM Coding Agents Reason About Time Series?

Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

10.
arXiv (CS.AI) 2026-06-24

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

arXiv:2606.23743v1 Announce Type: cross Abstract: Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.

11.
arXiv (math.PR) 2026-06-16

Atypical Decay Rates for Atypical Heights in Random Recursive Trees

arXiv:2604.20139v2 Announce Type: replace Abstract: We establish the large deviation probabilities for the height of random recursive trees, revealing polynomial upper-tail decay and stretched-exponential lower-tail decay. Remarkably, the lower tail features an atypical prefactor that grows to infinity more slowly than any $n$-fold iterated logarithm.

12.
arXiv (CS.CL) 2026-06-11

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

Authors:

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation – Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition – the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p

13.
arXiv (CS.AI) 2026-06-12

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

arXiv:2606.12841v1 Announce Type: cross Abstract: Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

14.
arXiv (CS.AI) 2026-06-18

Recursive Joint Simulation in Games

arXiv:2402.08128v3 Announce Type: replace Abstract: Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' – meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

15.
arXiv (CS.AI) 2026-06-24

When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs

arXiv:2606.24370v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into decision-support roles in business and policy contexts. While prior benchmark studies have primarily evaluated LLMs' causal reasoning capabilities, a more fundamental epistemic dimension has been overlooked: Causal Caution, defined as the propensity to refrain from causal judgment when empirical evidence is insufficient. This study examines the systematic suppression of Causal Caution that occurs when LLMs shift from academic to practical advisory contexts. Using an evaluation rubric inspired by Pearl's Causal Hierarchy (the PCH score), we conducted experiments on four high-performance LLMs – Claude Sonnet 4.6, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro – across 480 trials. Causal Caution maintenance rates were 91.7–100.0% in academic contexts but dropped to 6.7–18.3% in practical advisory contexts (Fisher's exact test, p < .001 across all models). Furthermore, when restricted to practical prompts requesting concrete recommendations or explanatory rationales, only 1 of 200 responses (0.5%) maintained Causal Caution. A brief self-correction prompt – "Please reconsider this judgment from the perspective of causal relationships" – restored the expression of Causal Caution to maintenance rates of 71.4–100.0% (McNemar's test, p < .001 across all models). These results suggest that helpfulness-oriented response patterns may suppress the expression of Causal Caution in practical advisory contexts, with important implications for organizational governance. The findings indicate that this suppression reflects context-dependent variation in expression rather than an underlying capability limitation, suggesting that multi-agent architectures that separate proposal generation from causal auditing may offer a promising governance design.

16.
arXiv (quant-ph) 2026-06-16

Entanglement as a Witness of Quantum Coherence: A Bipartite Monty-Hall Protocol

arXiv:2604.25953v3 Announce Type: replace Abstract: We present a bipartite protocol inspired by the Monty Hall puzzle that operationally distinguishes quantum coherence from classical ignorance. A principal qutrit is entangled with an ancillary qutrit via a controlled unitary, preparing $|\Psi\rangle = \frac{1}{\sqrt{3}}(|A,0\rangle + |B,1\rangle + |C,2\rangle)$. A rank-1 projective discard then eliminates one basis state, leaving a coherent superposition of the two remaining states. Finally, the ancilla and qutrit are measured, yielding joint probabilities that encode the interplay between superposition and measurement back-action. We show that the conditional probability $P(B|anc=0)$ takes the value $1/4$ in both quantum mechanics and the classical ignorant-host model, making it unsuitable as a witness. The true quantum-classical separation emerges in conditional joint probabilities that correlate ancilla outcomes with specific discard operations. We define witnesses $\mathcal{W}_{i,j} = P(anc=i, qutrit=j \mid discard k)$ where $j$ differs from the ancilla-implied state. Quantum mechanics predicts $\mathcal{W} = 1/4$, while any classical epistemic model with perfect initial correlations yields $\mathcal{W} = 0$. We provide the explicit $9 \times 9$ unitary matrix, a complete analysis of all measurement outcomes, and a detailed proof of the violation. The witness is fully immune to white noise and robust against moderate dephasing. The protocol requires only a single pair of entangled qutrits and sequential measurements – no spatial separation, no multiple copies, and no complex sets of incompatible observables. This makes it suitable for advanced undergraduate laboratories and provides a pedagogically accessible test of the ontic-epistemic distinction in quantum foundations.

17.
bioRxiv (Bioinfo) 2026-06-11

GeroEngine: Generative single-cell aging trajectories reveal a bidirectionally traversable identity core and direction-specific inflammatory remodeling

Authors:

Single-cell RNA sequencing (scRNA-seq) maps aging tissues at high resolution but is destructive, preventing longitudinal tracking; dropout and zero-inflation artifacts, amplified by shift-invariant linear simulations, confound age-associated variability. We developed GeroEngine, a technical-artifact-aware framework combining VAE-based trajectory simulation, LOPO cross-validation, linear baselines, reverse traversal, and reverse-directed network inference. In microglia and HSCs, the VAE reduced technical-artifact carryover while preserving trajectory heterogeneity and improving alignment to artifact-reduced reference manifolds. Consensus GeroTargets and GeroRegulators defined tissue-specific GeroNetworks organized into three pillars: lineage/replication identity collapse, a sex-dimorphic endocrine/stress core, and inflammatory remodeling. Forward and reverse simulations aligned to the common young[-&gt;]old aging axis revealed a sign-coherent, direction-specific program: identity/replication targets were bidirectionally recovered, whereas MHC/NF-{kappa}B inflammatory programs were preferentially forward-recovered. These results support identity collapse as a deep traversable core of aging and nominate upstream homeostatic restoration over downstream inflammatory suppression.

18.
arXiv (CS.LG) 2026-06-12

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

arXiv:2606.13221v1 Announce Type: new Abstract: Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

19.
arXiv (CS.AI) 2026-06-24

InSight: Self-Guided Skill Acquisition via Steerable VLAs

arXiv:2606.24884v1 Announce Type: cross Abstract: Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.

20.
bioRxiv (Bioinfo) 2026-06-15

SMLMFlow: Improving Structural Resolution in Single Molecule Localization Microscopy with Flow Matching

While Single Molecule Localization Microscopy (SMLM) aims to generate precise coordinates of molecular targets in cells, the resulting point clouds are inherently blurred by additive noise sources across the experimental, imaging, and processing workflow. This blurring often limits SMLM's ability to accurately quantify complex assembled structures required to address biological issues, despite reported localization precision down to a couple of nanometers. Here, we present SMLMFlow, a machine learning framework for improving structural resolution in SMLM datasets that combines a graph neural network and a hierarchical transformer with flow matching. We show that SMLMFlow improves structural resolution and downstream quantification across different structures, including filaments and protein nano-clusters, and generalizes to new unseen photophysics models.

21.
medRxiv (Medicine) 2026-06-10

Assessment of the accuracy of lung lesions diagnosis in adolescents with osteosarcoma using artificial intelligence

Background. Lung metastases in osteosarcoma (OS) are the main cause of the death. The accuracy of the diagnosis of nodules by computed tomography (CT) of the lungs is critically important for determining the disseminated stage of the disease and planning surgical treatment. The use of artificial intelligence (AI) in the search for lung nodules increases the accuracy of diagnosis and reduces the chance of missing metastases. Objective: to evaluate the accuracy of lung nodules diagnosis in adolescents with OS using AI. Methods. A retrospective assessment of CT scans of adolescents with OS was performed. A pathological nodule with an average size of [&ge;]4 mm was considered a target finding. The diagnostic accuracy of an AI algorithm previously trained on an adult dataset was evaluated, and the number of false positives (FP) and false negatives (FN) was determined. Sensitivity, specificity, accuracy, area under the ROC curve (AUC), positive predictive value, negative predictive value, and F1-measure were calculated. Based on the obtained results, the effectiveness of the algorithm was assessed. Results. 248 CT scans of adolescents with OS were evaluated. The following results were obtained: in 5 cases, the AI algorithm showed a FP result (2.02%), in 34 cases, it showed a FN result (13.71%), and in 209 cases, a correct result (both true positive and true negative) (84.27%). The diagnostic accuracy of the algorithm was 0.843 (95% CI 0.794-0.887). The application of the AI algorithm in the practice of an X-ray doctor in a specific clinical task would allow to increase the sensitivity from 0.805 to 0.891, while ensuring an absolute decrease in the number of FN results by 8.59% and a relative decrease by 44%. Conclusion. The obtained results confirm the practical value of the application of the AI algorithm and justify the implementation of AI-assisted systems in the diagnostic protocols for lung metastases in adolescents with OS.

22.
arXiv (CS.CL) 2026-06-17

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

23.
arXiv (CS.AI) 2026-06-24

SemChunk-C: Semantic Segmentation for C Code

arXiv:2606.23697v1 Announce Type: cross Abstract: Semantic segmentation of code written in a C-family language remains a challenging problem, due to the language's complex syntax, macro expansion, and irregular structural patterns. Existing chunking methods, such as fixed-sized windows, heuristic splitting, and syntax-based tools, often fail to capture meaningful functional units, limiting the efficacy of retrieval and other downstream LLM driven tasks. In this paper, we address the problem of chunking in C-related languages. First, we define a set of code chunk categories. Second, we train an LLM-based classifier to a) identify chunk boundaries, and b) assign each chunk a descriptive functional attribute (a category), which can be useful for downstream tasks. By leveraging the LLM's ability to capture semantic context within the code, we assume flexible chunk boundaries, allowing to adapt to the specific structure and context of each instance. Third, we introduce SemChunk-C, a family of lightweight language models for semantic chunking of C-related files (.c, .cpp, .h, .cs, etc.). These models are based on the first four Ettin encoders [1] with 17M, 32M, 68M, and 150M parameters. Despite their relatively small size, they are capable of identifying cohesive code units, such as data structures, interface blocks, and other components. Furthermore, we demonstrate the robustness of our approach on real-world code, including challenging constructs such as nested definitions and macros. We test our approach on various datasets, and show that it achieves high boundary accuracy and semantic coherence, matching or outperforming chunkers that are based on much larger code-oriented LLMs. We also validate the improved performance of the downstream tasks on a few curated benchmarks.

24.
arXiv (CS.AI) 2026-06-18

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

arXiv:2606.18258v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

25.
arXiv (CS.CL) 2026-06-12

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target–text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high-complexity bottleneck.