Esther Sun
  • Home
  • Publications
  • Highlights
  • Artwork

Publications

Interspeech 2026 Working on

Towards Comprehensive Emotional Characterization: Ambiguity-Aware Emotion Reasoning in Speech

First Author

Hierarchical Emotion Modeling · Cross-Model Distillation · Explainable SER

Key Points: TBD

ICML 2026 Under Review

ADEPT: RL-Aligned Agentic Decoding of Emotion via Evidence Probing Tools — From Consensus Learning to Ambiguity-Driven Emotion Reasoning

First Author | 🔗 Paper

Agentic LLM reasoning · Tool-augmented inference · RL alignment (GRPO)

TL;DR: Proposes ADEPT, a pioneering agentic framework that transforms Speech Emotion Recognition (SER) from a static classification task into an active, evidence-grounded reasoning process using RL-aligned MLLMs. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning.

Key Points:

  • Autonomous Agentic Framework: Engineered an MLLM-based autonomous agent designed for multi-turn Speech Emotion Recognition (SER). The framework redefines emotion inference as an iterative inquiry process, where the agent programmatically orchestrates a specialized toolkit of semantic and acoustic probes to decode complex paralinguistic cues.
  • Evidence-Grounded Rationalization: Developed a retrieval-augmented rationalization mechanism that anchors model predictions in verifiable physical evidence. By extracting and analyzing pitch trajectories, energy profiles, and spectral centroids, the system effectively mitigates text-biased hallucinations and provides high-fidelity, auditable reasoning traces.
  • Reinforcement Learning Alignment (GRPO): Implemented Group Relative Policy Optimization (GRPO) to refine the agent’s multi-step decision-making trajectories. This alignment strategy optimizes the policy-driven evidence acquisition process, rewarding logical rigor and penalizing non-informative tool invocations to ensure the reasoning path aligns with human-expert diagnostic standards.
  • Ambiguity-Driven Reasoning: Formulated a novel approach to address the “Consensus Paradox” by treating annotator disagreement as a valuable supervision signal rather than noise. By modeling emotional ambiguity through a consensus-driven inquiry process, the framework significantly improves robustness and recovery of co-occurring minor emotions on benchmark datasets including MSP-Podcast and IEMOCAP.

This work explores how agentic multimodal models can move beyond pattern recognition toward verifiable, evidence-grounded reasoning under perceptual ambiguity.

ICASSP 2026 Accepted

Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

First Author | 🔗 Paper

Discrete Audio Tokenization · Multi-layer Attention Fusion · SSL · Neural Codecs

TL;DR: Introduces a multi-layer fusion framework to recover the significant performance loss in Speech Emotion Recognition (SER) caused by audio discretization, enabling semantic-rich discrete tokens to rival high-fidelity continuous features.

Key Points:

  • Performance Recovery: Developed a novel framework that recovers 75% of the performance drop induced by discretization by integrating multi-layer WavLM tokens with 74-dimensional openSMILE paralinguistic features.
  • Architecture Innovation: Designed and benchmarked dual fusion strategies—Layer-First and Modality-First—achieving an 8% gain with the Layer-First approach by prioritizing hierarchical feature extraction.
  • Ablation & Trade-offs: Conducted systematic studies across 24 WavLM layers and 5 codebook sizes (K=256 to 4000) to identify optimal compression-accuracy trade-offs for emotion preservation.
  • Neural Codec Benchmarking: Evaluated against three major neural codecs (SpeechTokenizer, DAC, and EnCodec), demonstrating that semantic-rich SSL tokens provide a 50%+ performance advantage over reconstruction-focused models for downstream SER tasks.

This work establishes semantic-rich discrete speech tokens as a viable alternative to continuous acoustic features for affective computing, bridging representation learning and downstream emotion reasoning under aggressive compression constraints.

NeurIPS 2025 Workshop Accepted

Systematizing LLM Persona Design: A Four-Quadrant Technical Taxonomy for AI Companion Applications

First Author | 🔗 Paper

AI Companionship · Embodied Intelligence · Technical Taxonomy

TL;DR: Systematizes the fragmented landscape of AI persona design by introducing a novel four-quadrant technical taxonomy that maps companion applications from virtual agents to embodied systems, providing actionable design guidelines for researchers and practitioners.

Key Points:

  • Four-Quadrant Taxonomy: Proposed a systematic framework categorizing AI companions along two orthogonal axes—embodiment level and interaction modality—to clarify design trade-offs across diverse application domains.
  • Technical Design Patterns: Identified and documented recurring architectural patterns for persona consistency, memory management, and emotional coherence across 50+ commercial and research systems.
  • Embodiment Spectrum Analysis: Analyzed the progression from text-only virtual companions to fully embodied robotic agents, mapping key technical challenges at each stage of the embodiment continuum.
  • Actionable Guidelines: Synthesized practical design recommendations for balancing persona authenticity, user safety, and system scalability in real-world AI companion deployments.

This work provides a unifying systems-level lens for LLM persona agents, enabling principled design and evaluation of AI companions across virtual, multimodal, and embodied interaction settings.

IEEE TPAMI 2024 Under Review

ForgeryGPT: Multimodal Large Language Model for Explainable Image Forgery Detection and Localization

Co-author | 🔗 Paper

Fine-grained forgery localization · Vision–language reasoning · Multimodal LLM grounding

TL;DR: Leverages Multimodal LLMs to pioneer an explainable image forgery detection and localization system that generates natural language rationales for identified visual manipulations.

Key Points:

  • Hybrid Vision Architecture: Developed a “Vocabulary-enhanced Vision Encoder” integrating a trainable ViT with a frozen CLIP encoder to capture fine-grained, domain-specific forgery artifacts.
  • Adaptive Forgery Prompting: Proposed an “Object-agnostic Forgery Prompt” mechanism using 12-dimensional learnable embeddings to enable dynamic adaptation across diverse forgery scenarios, achieving a 15% improvement over baselines.
  • Large-Scale Synthesis Pipeline: Constructed a comprehensive 768K+ multimodal dataset using a multi-granularity mask generation pipeline that combines random segmentation with Segment Anything Model (SAM) techniques.
  • Explainable Localization: Contributed the FL-Expert module to provide precise spatial localization coupled with linguistic explanations of splicing, copy-move, and removal manipulations.

ACM MM 2023 Accepted

ECENet: Explainable and Context-Enhanced Network for Multimodal Fact Verification

Co-author | 🔗 Paper

Dual-granularity Attention · Cross-modal Alignment · Hierarchical Reasoning

TL;DR: Introduces a state-of-the-art multimodal fact verification framework that utilizes dual-granularity attention and hierarchical reasoning to generate evidence-based justifications for news veracity.

Key Points:

  • Dual-Granularity Attention: Engineered an Improved Coarse- and Fine-grained Attention Network (CFgAN) that enhanced contextual comprehension of image-text correlations by 12.1%.
  • Hierarchical Reasoning Framework: Developed a unified architecture for feature extraction and cross-modal fusion, enabling the system to perform complex evidence-based inference.
  • Benchmark Leadership: Achieved SOTA performance on major benchmarks, including 87.7% accuracy on NewsCLIPpings and an 81.5 F1-score on FACTIFY.
  • Justification Generation: Focused on “explainability” by ensuring the network provides logical justifications alongside its verification results, as showcased in oral and poster presentations at ACM MM 2023.

AAAI 2023 Accepted

Unimodal Feature-Enhanced and Cross-Modal Correlation Learning for Multimodal Fact Verification

Co-author | 🔗 Paper

Multimodal feature engineering · Cross-modal correlation learning