Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-19

Human-like autonomy emerges from self-play and a pinch of human data

arXiv:2606.19370v1 Announce Type: cross Abstract: Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.

02.
arXiv (math.PR) 2026-06-16

Flowing to Normality and the Fate of the Single Ring Theorem

arXiv:2606.15791v1 Announce Type: cross Abstract: Random non-hermitian matrix ensembles with double-sided rotation invariance obey, in the limit of large matrix size, the Single Ring Theorem, which states that the support of the mean eigenvalue distribution in the complex plane is either a disk or an annulus. In contrast, rotational-invariant random normal matrix ensembles can have mean eigenvalue densities supported over any number of concentric annuli in the complex plane. In this paper we introduce and investigate, both analytically and numerically, a non-hermitian matrix model which flows from a generic matrix distribution obeying the Single Ring Theorem to a distribution of normal matrices by tuning a parameter which penalizes non-normality. We observe numerically breakdown of the Single Ring Theorem as the model flows towards normality, and determine the critical value of the parameter at which the transition occurs. We also study in detail the behavior of the singular values of these matrices under the flow. These singular values form a Fermi gas confined to the positive half-line. In particular, we find that at small values of the flow parameter, the interparticle spacings in the gas exhibit Wigner-Dyson repulsion, whereas for asymptotically large values of the flow parameter, at the normal matrix endpoint of the flow, the spacing statistics is Poissonian. The flow interpolates continuously between these two types of statistics. However, this change in statistics is not related directly to breaking of the Single Ring Theorem, which occurs very early-on along the flow, in the regime of Wigner-Dyson statistics. Finally, we introduce a certain ensemble of random permutations associated with the gas, and make a conjecture on how to use it in order to reconstruct approximately the average density of complex eigenvalues from that of the singular values in the large-$N$ limit.

03.
arXiv (CS.CV) 2026-06-16

Towards Global AI-Driven Cervical Cancer Screening

The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

04.
arXiv (CS.CL) 2026-06-17

Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors

Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface's component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.

05.
arXiv (math.PR) 2026-06-12

Quenched and Annealed CLTs for the one-periodic Aztec diamond in random environment

arXiv:2510.11846v2 Announce Type: replace Abstract: We study the asymptotic behavior of random dimer coverings of the one-periodic Aztec diamond in random environment. We investigate quenched limit theorems for the height function and we extend annealed limit theorems that were recently studied in [arXiv:2507.08560]. We consider more general choices of random edge weights (independence is not assumed) and we distinguish two cases where the random edge weights satisfy the Central Limit Theorem (CLT) under different scalings. For both cases, we prove convergence to the Gaussian Free Field for the quenched fluctuations. For the annealed version, it had been shown in [arXiv:2507.08560], that Gaussian Free Field fluctuations can be dominated by the much larger fluctuations of the random environment. To access quenched fluctuations we analyze the Schur process with random parameters in a way that allows to prove the annealed CLT for the height function for non i.i.d. weights. We consider specific examples where we determine the asymptotic fluctuations.

06.
arXiv (CS.AI) 2026-06-24

Multimedia and Visual Analytics in the Agentic Era

arXiv:2504.06138v3 Announce Type: replace-cross Abstract: Professional users need tools to help them gain actionable insights from large multimedia collections. Foundation models and AI agents have rapidly changed the playing field, and improving their accuracy, trustworthiness, and reasoning capabilities are active topics in the computer vision, machine learning, and multimedia communities. Most current research focuses on benchmark driven algorithmic improvements. The multimedia community is the place to go beyond algorithms and consider complete multimedia analytics systems that support professional users in their complex tasks and achieve a true teaming of humans and AI. Supporting users with machine learning and visualizations has been studied for decades in the visual analytics field. In this paper, we propose a framework to bring multimedia and visual analytics together and indicate how it could impact current and new multimedia analytics solutions. Additional information can be found at https://staff.fnwi.uva.nl/m.worring/analytics-model.html

07.
arXiv (CS.LG) 2026-06-16

Discovering Subgroups with Exceptional Survival Characteristics

arXiv:2602.22179v2 Announce Type: replace Abstract: In many applications, it is important to identify subpopulations that survive longer or shorter than the rest of the population. In medicine, for example, it allows determining which patients benefit from treatment, and in predictive maintenance, which components are more likely to fail. Existing methods for discovering subgroups with exceptional survival characteristics rely on restrictive assumptions about the survival model (e.g. proportional hazards), require pre-discretized features, and, as they compare average statistics, tend to overlook individual heterogeneity. In this paper, we propose Sysurv, a non-parametric, fully differentiable method that discovers human-readable rules selecting subgroups with exceptional survival characteristics. Empirical evaluation on a wide range of datasets and settings, including a case study on cancer data, shows that Sysurv reveals insightful and actionable survival subgroups, outperforming the state of the art.

08.
arXiv (CS.AI) 2026-06-16

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

arXiv:2606.15315v1 Announce Type: new Abstract: Personalized public transit routing in public transit systems remains challenging due to the difficulty of capturing and integrating diverse user preferences into routing algorithms. This paper presents ChatPlanner, a novel framework that leverages Large Language Models (LLMs) to enable preference aware public transit routing. Our approach employs fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to extract routing parameters and interpret nuanced user preferences from natural language queries, subsequently integrating these preferences into the objective function of a public transit routing algorithm. This study designs preference aware datasets incorporating eight personas and five contexts to establish scoring standards for both fine-tuning and RAG. This work conducted three experiments to validate the solutions' feasibility, extraction of routing information and preferences, and solution set quality and completeness. Results demonstrate that ChatPlanner generates feasible solutions reliably. Fine-tuning enforces the required output structure and learns general preference patterns, while RAG provides query-specific context to resolve imprecise or conversational expressions and calibrate continuous scores. The combination of both achieves the highest accuracy in routing information extraction and user preference interpretation. Results based on selected case studies show that by capturing user preferences, ChatPlanner identifies valuable solutions across different dimensions that existing route planners overlook, generating more valuable route alternatives. This research establishes a new paradigm for integrating natural language understanding into transportation optimization.

09.
arXiv (quant-ph) 2026-06-11

Mach's principle in atomic transitions

arXiv:2606.11608v1 Announce Type: new Abstract: We investigate the atomic transition probabilities in atom-mirror set-ups that are in circular motion. In one scenario, the atom is in circular motion inside a static cylindrical mirror. In the other scenario, the cylindrical mirror rotates around its central axis while the atom remains static. We report structural similarity in the atomic transition probabilities between these two cases – these probabilities are equivalent upon interchanging the field frequencies between the two scenarios. We interpret such an observation as a semi-classical phenomenon analogous to the classical Mach's principle.

10.
arXiv (CS.AI) 2026-06-24

A Unified Framework for Runtime Verification and Model-Based Diagnosis in LOLA

arXiv:2606.23720v1 Announce Type: cross Abstract: We present an integrated framework that unifies runtime verification and model-based diagnosis within the stream specification language LOLA. By encoding system descriptions, component health states, and observations into a single stream-based formalism, the approach enables continuous, online fault localization directly alongside fault detection, without requiring separate toolchains. The framework supports both time-invariant and transient faults, and naturally accommodates nondeterministic observations.

11.
arXiv (CS.AI) 2026-06-18

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

arXiv:2606.18548v1 Announce Type: cross Abstract: Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

12.
arXiv (CS.LG) 2026-06-17

Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

arXiv:2606.17093v1 Announce Type: new Abstract: Learning-based single-shot fringe projection profilometry (FPP) has been studied mostly at close range. The long-range regime (standoff beyond 1 m) remains largely unaddressed: inverse-square intensity falloff lowers fringe signal-to-noise ratio and degrades physical ground truth, the single-shot problem is ill-posed because fringe-order information is absent from one image, and these architectures have not been studied mechanistically. We present a diagnose-repair-verify study using mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as convergent diagnostics: they agree on one physical failure locus, driving and verifying an architectural repair. On a photorealistic synthetic benchmark (15,600 fringe images, 50 objects at 1.5-2.1 m), a best UNet baseline reaches 14.54 mm object mean absolute error (MAE). Three probes (linear probing, Grad-CAM, flat-plane out-of-distribution test) converge: the baseline solves the task via object-boundary shape priors rather than fringe-phase decoding. We repair this with PhiCalNet, which outputs wrapped phase rather than depth and applies a fixed differentiable calibration layer mapping phase to depth, removing the shape-prior solution from the hypothesis space architecturally rather than by a loss penalty. A physics-informed loss that enforces the same physics as a soft penalty on a depth-regressing network yields no measurable gain, isolating the architecture as the operative factor. PhiCalNet reduces object MAE 3.3x to 4.46 mm; the residual is carried by 0.103% of pixels at the +/-pi wrap discontinuity. Pixel-wise conformal UQ confirms the diagnosis: rejecting the top 5% of object pixels by snapshot disagreement cuts PhiCalNet RMSE by 64% (20.6->7.4 mm) versus 3.5% for the baseline. MI and UQ converge on the same failure locus.

13.
arXiv (CS.LG) 2026-06-12

Accelerating Speculative Diffusions via Block Verification

arXiv:2606.13426v1 Announce Type: new Abstract: Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions – which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

14.
Nature Biotechnology 2026-06-22

Affordable centimeter-scale 3D microscopy with submicrometer resolution

作者: 未知作者

Submicrometer-resolution three-dimensional (3D) imaging of large samples has been constrained by the short working distance, high cost and inflexible design of immersion objectives. We developed hybrid solid–liquid optics (HySIL) — a refractive framework with index-matched components — for submicrometer-resolution 3D imaging of centimeter-scale samples in various immersion media using inexpensive air objectives.

15.
arXiv (quant-ph) 2026-06-12

Statistical Mechanics and Symmetries of Non-Abelian Anyon Proliferation: From Deformation to Decoherence

arXiv:2606.12527v1 Announce Type: new Abstract: Topological quantum computation relies on braiding non-Abelian anyons, but requires the underlying topological order to survive imperfect state preparation and environmental noise. We show that the instability of topological order to wavefunction deformations and to decoherence, with the latter probed by syndrome distributions, are generically captured by stat-mech models whose symmetries naturally expose the corrupting anyonic excitations. As an example, we combine this framework with Monte-Carlo simulations to resolve the stability of $D_4$ topological order under deformations and quantum channels that proliferate multiple non-Abelian anyon species that individually are unable to condense. We show that beyond a finite threshold, proliferation of two non-Abelian anyon species parasitically condenses a shared Abelian-anyon fusion outcome, destroying the topological order. Our symmetry-based approach sharply differentiates the resulting trivial phase from that obtained by condensing all Abelian charges; in other words, the trivial phase "remembers" which anyons condensed. This framework provides a first step into identifying the relevant symmetry for optimal decoders, conditioned on syndrome measurements, of non-Abelian topological order.

16.
arXiv (quant-ph) 2026-06-17

Frequency-Division Multiplexed CV-QKD System

arXiv:2603.20718v2 Announce Type: replace Abstract: We propose a frequency-division multiplexed (FDM) continuous-variable quantum key distribution (CV-QKD) system with enhanced spectral efficiency through optimized channel spacing of low-symbol-rate signals. A four-channel 10-Mbaud FDM-CV-QKD system was experimentally demonstrated using Gaussian modulation, a transmitted local oscillator, and homodyne detection. Despite the inter-channel interference, under a finite-size scenario (m=1.25x10^6), the system achieved a 3.6-fold back-to-back secret key rate gain and outperformed the single-channel frequency-upconverted signal up to 26.8 km.

17.
arXiv (CS.AI) 2026-06-16

Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

arXiv:2606.15441v1 Announce Type: cross Abstract: Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

18.
arXiv (CS.LG) 2026-06-19

Predictability as a Fine-Grained Measure for Privacy

arXiv:2606.20546v1 Announce Type: new Abstract: Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.

19.
arXiv (CS.AI) 2026-06-19

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

arXiv:2604.11556v2 Announce Type: replace-cross Abstract: LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

21.
arXiv (quant-ph) 2026-06-11

Dark state spectroscopy in nonlinear waveguide quantum electrodynamics

arXiv:2606.11997v1 Announce Type: new Abstract: Quantum systems face a fundamental trade-off: they must remain decoupled from the environment to maintain long coherence times, yet they require interactions with the environment to be accessible for measurement. As a prime example, emitter arrays coupled to waveguides facilitate collective modes that, owing to interference, can suppress radiation into the waveguide. While complete destructive interference creates perfectly dark states with infinite lifetimes, their inherent decoupling makes them unmeasurable in standard waveguide quantum electrodynamics. Consequently, current approaches must rely on system non-idealities that permit measurement but limit the coherence times. In this work, we lift this limitation by proposing the use of weakly squeezed light generated in \{chi}(2) nonlinear waveguides for the spectroscopy of completely dark states. We show that the fluorescence spectrum probes transitions between the dressed dark states of the emitter array. This work paves the way towards the measurement and control of dark states, with applications for robust quantum memories, computation, and communication.

22.
arXiv (quant-ph) 2026-06-12

Achieving Heisenberg limit under noisy conditions with quantum Zeno dynamics and dynamical decoupling

arXiv:2606.13205v1 Announce Type: new Abstract: Quantum Zeno dynamics (QZD) and dynamical decoupling (DD) are useful tools that enable the effective suppression of noise in quantum systems. We consider the problem of when (i) noise can be suppressed and (ii) Heisenberg limit (HL) can be achieved in quantum metrology, and prove necessary and sufficient conditions for when QZD and DD are useful for achieving these two goals. We also show that in the Markovian regime, there are scenarios where preventing errors using QZD/DD may enable HL to be achieved where current QEC methods may not. Finally, we demonstrate that the combination of both techniques can allow individually imperfect QZD and DD strategies to saturate HL.

23.
arXiv (CS.CV) 2026-06-19

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

作者:

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

24.
arXiv (CS.CV) 2026-06-16

SceneCraft: Interactive System for Image Editing via Scene Graph

Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

25.
arXiv (CS.CV) 2026-06-11

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%–91% recognition – too low – and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.