论文广场 - AcademicHub

01.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2602.18154

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

作者:

Mirae Kim ↗Seonghun Jeong ↗Youngjun Kwak ↗

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2602.13344

FireRed-Image-Edit-1.0 Technical Report

作者:

Super Intelligence Team ↗Changhao Qiao ↗Chao Hui ↗Chen Li ↗Cunzheng Wang ↗Dejia Song ↗Jiale Zhang ↗Jing Li ↗Qiang Xiang ↗Runqi Wang ↗Shuang Sun ↗Wei Zhu ↗…

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

阅读与讨论 → 访问原文 →

03.

bioRxiv (Bioinfo) 2026-06-22 DOI: HASH:5d04178a2a73f8cad4dc53d0b32fa659

When Less Is Not More: DICEPro Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution.

作者:

BA ↗Thiebaut ↗Hinaut ↗Hejblum ↗B. P ↗

Cellular deconvolution aims to estimate the frequencies of different cell populations from gene expression measurements in a biological sample. Supervised approaches, such as CIBERSORTx and DISSECT, critically depend on the reference signature matrix, which encodes the gene expression profiles of cell-types based on prior knowledge. Despite numerous deconvolution methods, the impact of missing cell populations in the reference matrix remains understudied. Here, we evaluate the robustness of state-of-the-art deconvolution approaches using simulations based on real dataset examples combined with statistical modeling, validated against published data, and multiple real benchmark datasets. Results show that deconvolution performance remains stable when the reference matrix includes most cell-types, but declines sharply as the matrix becomes incomplete, especially for abundant cell populations. To address the limitations of incomplete reference matrices, we introduce DICEPro, an optimization-based framework designed to enhance existing deconvolution methods. By systematically adjusting the reference signatures, DICEPro better accounts for missing or underrepresented cell populations, leading to improved precision and robustness. We show that DICEPro consistently boosts deconvolution performance across both simulated datasets, derived from real data examples, and multiple real biological datasets, offering a practical solution when standard methods are hindered by incomplete references.

阅读与讨论 → 访问原文 →

04.

medRxiv (Medicine) 2026-06-22 DOI: HASH:b2879555027a8a36b99a0407967c4b73

Modelling the decadal expansion of West Nile virus in Italy: the role of climatic, anthropogenic, and macroecological drivers

作者:

Marcolin ↗Bella ↗Del Manso ↗Dorigatti ↗Pezzotti ↗Poletti ↗Riccardo ↗Di Marco ↗

Abstract BACKGROUND West Nile virus (WNV) is a growing health burden in Italy. Anticipating human infection risk is hampered by the pathogen's complex ecology, highlighting the need for comprehensive early-warning tools. AIM We aimed to model municipal-level WNV risk in Italy and characterize its decadal expansion in Italy, providing a comprehensive ecological understanding of viral emergence. METHODS We applied a machine learning framework to annual human WNV case data from 2014 to 2024. The model integrated a suite of environmental, socio-economic, and macroecological predictors to generate risk projections. We evaluated the model's performance through multiple validation settings. We also performed an anticipation test for the 2025 epidemic season, using 2024 environmental data to assess the model's predictive accuracy against observed 2025 human cases. RESULTS Our model achieved robust performance (True Skill Statistic > 0.4) and captured WNV progressive expansion from 184 predicted positive municipalities in 2014 to 2,012 in 2024 (an 11-fold increase in 11 years). Seasonal minimum temperature was the primary risk driver, followed by monitoring year and population density, indicating active spatial spread. Environmental suitability consistently preceded clinical detection. Municipalities with cases in 2023-2024 exhibited significantly higher predicted suitability during 2018-2022 than those without cases (average risk 0.58 vs 0.20). Our model successfully identified emerging risk hotspots along the Adriatic coast and southern Italy before the official human spillover of 2025. CONCLUSION Embedding macroecological drivers into WNV risk modelling provides an improved understanding of drivers of rapid WNV expansion. Our model enables proactive risk mapping, surveillance efforts, and targeted public health measures.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14035

Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

作者:

Dinh-Khoi Vo ↗Nhut-Thanh Le-Hinh ↗Viet-Tham Huynh ↗Tam V. Nguyen ↗Minh-Triet Tran ↗Trung-Nghia Le ↗

Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data. We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output. Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments. Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20.github.io/FocusDiff.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.20244

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

作者:

Bo Yin ↗Xiaobin Hu ↗Chengming Xu ↗Ruolin Shen ↗Mo Yang ↗Jiangning Zhang ↗Peng-Tao Jiang ↗Cheng Tan ↗Shuicheng YAN ↗

arXiv:2606.20244v1 Announce Type: cross Abstract: Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

阅读与讨论 → 访问原文 →

07.

medRxiv (Medicine) 2026-06-15 DOI: HASH:266b3d0161a06b994fc62243b3890604

Wellbeing After Stroke-2 (WAterS-2): a feasibility study with process evaluation exploring inclusive, accessible, online psychological support after stroke

作者:

Longley ↗Woodward-Nutt ↗Cotterill ↗Chouliaria ↗Thomas ↗Bamford ↗Bowen ↗Patchwood ↗

Objectives: Explore feasibility and acceptability of upskilling a workforce to deliver a co-developed intervention, based on Acceptance and Commitment Therapy (ACT), to support psychological adjustment post-stroke targeting underserved groups. Design: Multi-site, single-arm feasibility study with embedded mixed-methods process evaluation (ISRCTN17628580). Setting: Four NHS community stroke services across England. Participants: 1. Stroke survivors [≥]18 years of age, [≥]4 months post-stroke, reporting psychological difficulties adjusting to stroke, able to consent and access remote group sessions in English; 2. Group facilitators from NHS stroke services, not ACT specialists. Intervention: WAterS-2: an eight-session, remotely-delivered ACT-informed group intervention. Outcome measures: Recruitment, fidelity, safety, acceptability and perceived value were assessed using fidelity checklists, post-intervention surveys and semi-structured interviews with stroke survivors and facilitators. Clinical outcomes including mood (HADS), wellbeing (ONS4), psychological flexibility (AAQ-ABI), measured post-group and three-months later. Results: Nineteen stroke survivors recruited (mean 9.6 months post-stroke; n=5 (26%) minoritised ethnicities; n=10 (52%) with aphasia). Thirteen facilitators - including two peer support workers - delivered the intervention with fidelity following structured training across four services. Drop-out was low (2/19; 11%); with 15 (79%) attending [≥]5/8 sessions. Remote data collection was feasible (79% follow-up completion), with no adverse events recorded. Acceptability was high: survivors valued peer connection, grounding and mindfulness practices. ACT metaphors were helpful for some but challenging for others, including some with aphasia. Online delivery was suitable but limited informal connection. Facilitators reported increased capability, incorporating ACT skills into routine care. NHS workforce pressures and geographically-constrained referral pathways limited recruitment reach. Conclusions: WAterS-2 is feasible, safe, acceptable and inclusive. A mixed workforce, including NHS peer support workers, can be upskilled to deliver with fidelity. Inclusion of underserved groups is achievable but requires active strategies beyond standard NHS referral routes. Findings inform a provisional logic model and a future pragmatic trial.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CL) 2026-06-18 DOI: arXiv:2606.18986

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

作者:

Yafeng Wu ↗Huu Hiep Nguyen ↗Thin Nguyen ↗Hung Le ↗

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

阅读与讨论 → 访问原文 →

09.

bioRxiv (Bioinfo) 2026-06-19 DOI: HASH:5832b185205a24918018690f5d2e8931

Sanjeevani: A manually curated anti-cancerous phytochemical database integrated with downstream analysis tools.

作者:

Jha ↗Shukla ↗Shingan ↗Das ↗

Background: Cancer continues to pose a massive global health burden. While plant-derived phytochemicals offer promising therapeutic leads, existing natural product databases often lack cancer specificity, dataset downloadability, and integrated screening tools. Methods: We developed Sanjeevani, an integrative web platform cataloguing 4,823 curated anticancer phytochemicals. Using a balanced dataset of 9,646 molecules, we trained Support Vector Machine (SVM), Random Forest, and K-Nearest Neighbours classifiers using a hybrid feature representation of RDKit descriptors and 2048-bit ECFP4 fingerprints. The platform also integrates AutoDock Vina for web-based molecular docking for binding affinity, poses prediction and ADMET-AI for pharmacokinetics estimation. Results: The SVM model demonstrated the strongest predictive capability, achieving a top test accuracy of 0.966 and a ROC-AUC of 0.992. Benchmarking across five docking tools confirmed that AutoDock Vina successfully balanced computational automation with literature-consistent binding affinity replication. The final architecture provides rapid interactive 2D/3D visualizations integrated with downstream analysis tools. Conclusion: Sanjeevani provides an open-access, one-stop pipeline that bridges the gap between raw natural product data and actionable computational screening, accelerating natural product-based oncology drug discovery.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15144

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

作者:

Jann Railey Montalan ↗David Demitri Africa ↗Jimson Paulo Layacan ↗Richell Isaiah Flores ↗Ivan Yuri De Leon ↗Lance Calvin Gamboa ↗

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2606.13720

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

作者:

Elisabetta Rocchetti ↗Alfio Ferrara ↗

arXiv:2606.13720v1 Announce Type: new Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) – nullspace projection and counterfactual flipping – on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations between the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite – an intriguing distinction that warrants further investigation in future work.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15301

Discovering Lattice Reduction Strategies via Self-Play

作者:

Mohamed Malhou ↗Kristin Lauter ↗Ludovic Perret ↗

arXiv:2606.15301v1 Announce Type: cross Abstract: The Lenstra-Lenstra-Lovász (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show that deep reinforcement learning can discover strictly superior, generalizable reduction strategies by interacting with the primitive action space of LLL. We formulate lattice reduction as a single-player Markov Decision Process (MDP) and train a deep residual network using an AlphaZero-style self-play pipeline augmented with adaptive-horizon MCTS (Monte Carlo Tree Search), which couples multi-step network predictions with an entropy-gated expansion mechanism. The resulting policy, DeltaStar, is trained exclusively on small $8$-dimensional $q$-ary lattices and requires fewer primitive row operations than LLL. Crucially, it generalizes zero-shot to unseen moduli and higher dimensions up to $n=32$ without retraining.

阅读与讨论 → 访问原文 →

13.

Nature (Science) 2026-06-12 DOI: HASH:8c8b5c0f67a9ac34a88d4f948fbbc87e

‘Student Geng’ ignites research-integrity scandal in China after calling out senior academics<b> </b>

作者:

Xiaoying You ↗

Video blogger’s viral accusations of data manipulation in Nature journals have sparked intense debate and speedy institutional investigations. Video blogger’s viral accusations of data manipulation in Nature journals have sparked intense debate and speedy institutional investigations.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CL) 2026-06-17 DOI: arXiv:2512.15792

A Multifaceted Analysis of Social Biases in Large Language Models

作者:

Xulang Zhang ↗Rui Mao ↗Erik Cambria ↗

Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.12069

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

作者:

Hong Li ↗Yankang Dong ↗Yue Xu ↗Yihan Tang ↗Mingzhu Li ↗Jiamin Qiu ↗Qihang Yao ↗Xing Zhu ↗Yujun Shen ↗Nan Xue ↗Yong-Lu Li ↗

Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14556

Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

作者:

Paul Koch ↗Vivek Chavan ↗

Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.13081

Emotional regulation improves deep learning-based image classification

作者:

Riccardo Emanuele Landi ↗Jo\~ao M. F. Rodrigues ↗Marta Chinnici ↗

arXiv:2606.13081v1 Announce Type: cross Abstract: Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2507.10834

From Small to Large: A Graph Convolutional Network Approach for Solving Assortment Optimization Problems

作者:

Guokai Li ↗Pin Gao ↗Stefanus Jasin ↗Zizhuo Wang ↗

arXiv:2507.10834v4 Announce Type: replace Abstract: Assortment optimization seeks to select a subset of substitutable products, subject to constraints, to maximize expected revenue. The problem is NP-hard due to its combinatorial and nonlinear nature and arises frequently in industries such as e-commerce, where platforms must solve thousands of such problems each minute. We propose a graph convolutional network (GCN) framework to efficiently solve constrained assortment optimization problems. Our approach constructs a graph representation of the problem, trains a GCN to learn the mapping from problem parameters to optimal assortments, and develops three inference policies based on the GCN's output. Owing to the GCN's ability to generalize across instance sizes, patterns learned from small-scale samples can be transferred to large-scale problems. Theoretical results are established to show the expressive power of the proposed GCN, and explain the underlying mechanism of the size generalization ability. Numerical experiments show that a GCN trained on instances with 20 products achieves over 85% of the optimal revenue on problems with up to 2,000 products within seconds, outperforming existing heuristics in both accuracy and efficiency. We further extend the framework to settings with an unknown choice model using transaction data and demonstrate similar performance and scalability.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.12099

ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

作者:

Junlin Hao ↗Haoshuai Fu ↗Xibin Song ↗Wei Li ↗Ruigang Yang ↗Xinggong Zhang ↗Jinchuan Zhang ↗

Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17118

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

作者:

Yuanteng Chen ↗Peisong Wang ↗Zhilei Liu ↗Nanxin Zeng ↗Yuantian Shao ↗Shiqiang Lang ↗Tao Liu ↗Chuangyi Li ↗Qinghao Hu ↗Gang Li ↗Jing Liu ↗Jian Cheng ↗…

arXiv:2606.17118v1 Announce Type: cross Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17220

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

作者:

Mingxu Tao ↗Jiawei Hu ↗Xian Zhou ↗Wenpeng Hu ↗Jiajun Cheng ↗Yunbo Cao ↗Zhunchen Luo ↗Guotong Geng ↗

arXiv:2606.17220v1 Announce Type: new Abstract: Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CL) 2026-06-18 DOI: arXiv:2606.18584

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

作者:

Fan Xu ↗Jian Luo ↗MingWen Wang ↗GuoDong Zhou ↗

Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2512.09373

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

作者:

Haobo Jiang ↗Jin Xie ↗Jian Yang ↗Liang Yu ↗Jianmin Zheng ↗

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17438

Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

作者:

Ingu Yeo ↗Hyung-Gun Chi ↗Jae-Sang Hyun ↗

This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

阅读与讨论 → 访问原文 →

25.

medRxiv (Medicine) 2026-06-18 DOI: HASH:241755eb4546a393bea28600c499bfca

Empirical Validation and Predictive Utility of the Perinatal Grief Scale in Men after Perinatal Loss

作者:

Ravaldi ↗Mosconi ↗Nespoli ↗Fumagalli ↗Vannacci ↗

Background. The Perinatal Grief Scale (PGS) is a widely used instrument for assessing grief following pregnancy loss, yet no study has validated it specifically in men despite documented use in several studies. This gap is critical given fathers' persistent underrepresentation in perinatal bereavement research and the absence of empirically supported screening thresholds for this population. Methods. This cross-sectional validation study used data from the OPALE project (Observatory on PerinatAL hEalth) conducted by the CiaoLapo Foundation in Italy. Among 276 fathers who experienced stillbirth or miscarriage, we examined criterion validity by testing the association between PGS scores and trauma-related symptomatology assessed via three validated instruments: the Revised Impact of Event Scale (RIES, n=103), National Stressful Events Survey Short Scale (NSESSS, n=95), and SCL-90 (n=173). We systematically tested multiple threshold combinations to identify optimal discriminative performance. Results. The PGS demonstrated excellent criterion validity. The optimal threshold (PGS >=92) showed sensitivity 81.0%, specificity 81.8%, and Youden's J index 0.628. Fathers scoring >=92 had 19.12 times the odds of high trauma symptoms (95% CI: 9.35 to 39.14, p

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络