RAVEL: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

1Virginia Tech, 2University of Illinois Urbana-Champaign

TL;DR We propose a training-free approach that leverages knowledge graph-based Retrieval Augmented Generation to enhance image generation and editing of text-to-image diffusion models. We also introduce a novel RAG context-guided self-correction mechanism. The approach enables generation of contextually and narratively accurate images for complex, domain-specific scenarios that standard T2I models struggle with, using simple high-level user prompts.

Teaser Image

We introduce RAVEL, a training-free approach that uses graph-based RAG to enhance T2I models with context-aware guidance. It improves generation of rare, complex concepts and supports disentangled image editing. A self-correction module further refines visual and narrative accuracy.

Abstract

Despite impressive visual fidelity, current text-to-image (T2I) diffusion models struggle to depict rare, complex, or culturally nuanced concepts due to training data limitations. We introduce RAVEL, a training-free framework that significantly improves rare concept generation, context-driven image editing, and self-correction by integrating graph-based retrieval-augmented generation (RAG) into diffusion pipelines. Unlike prior RAG and LLM-enhanced methods reliant on visual exemplars, static captions or pretrained knowledge of models, RAVEL leverages structured knowledge graphs to retrieve compositional, symbolic, and relational context, enabling nuanced grounding even in the absence of visual priors. To further refine generation quality, we propose SRD, a novel self-correction module that iteratively updates prompts via multi-aspect alignment feedback, enhancing attribute accuracy, narrative coherence, and semantic fidelity. Our framework is model-agnostic and compatible with leading diffusion models including Stable Diffusion XL, Flux, and DALL-E 3. We conduct extensive evaluations across three newly proposed benchmarks - MythoBench, Rare-Concept-1K, and NovelBench. RAVEL also consistently outperforms SOTA methods across perceptual, alignment, and LLM-as-a-Judge metrics. These results position RAVEL as a robust paradigm for controllable and interpretable T2I generation in long-tail domains.

Method

Paper Method Diagram

An overview of our framework: Our (a) image generation and (b) self-correction approaches leverage context-rich data from a knowledge graph to enhance image fidelity, as well as contextual and narrative coherence for complex characters.

Qualitative Results

Rare Concept Generation

Rare Concept Generation

RAVEL enhances image generation by integrating contextual details often overlooked by standard models for a variety of domains. *Note that the reference images are shown solely for illustrative purposes and are not used by our framework.

Rare Concept Generation - Diverse Domains

Rare Concept Generation - Diverse Domains

RAVEL’s effectiveness in generating complex mythological and fictional concepts without prior visual exemplars. The first 3 rows are global mythology concepts, while the last two rows are the Project Gutenberg novel characters.

Self-Correcting RAG-Guided Diffusion (SRD)

Self-Correction Results

Our self-correction mechanism ensures accurate depictions of concepts via iterative, context-aware prompt refinement.

Editing

Editing Results

Our method enhances disentangled editing by adding relationally accurate elements without explicit instructions, while ControlNet either adds generic objects or fails to make any edit.

Qualitative Comparison

Qualitative Comparison

Ravel effectively enhances different T2I models-such as SDXL, Flux and Dall-E 3 to accurately generate complex, rare characters across diverse domains.

Qualitative Comparison

Qualitative Comparison of RAVEL with other RAG-based T2I methods. We compare RAVEL with state-of-the-art RAG-based methods (Re-Imagen, ImageRAG, RDM) and baseline diffusion models (SDXL, Flux) across different rare concept categories - such as mythological characters (Gandabherunda), rare animals (saola, aye-aye lemur), and cultural artifacts (kapala bowl). Baseline and RAG methods frequently hallucinate incorrect attributes (single-headed birds, generic deer, wrong lemur species, ornate non-skull bowls) or generate visually plausible but contextually inaccurate variants due to lack of structured relational knowledge and the available visual exemplars being inconsistent or rare. *Reference images are shown for illustration only and are not used by our method

Quantitative Results

We conduct a comprehensive evaluation of 'RAVEL' in two stages, assessing both its foundational RAG component and image generation. We benchmark our approach against SOTA T2I models-such as Flux, SDXL, and DALL-E 3 across image generation and two rounds of self-correction.

Evaluation Across Diverse Benchmarks

We evaluate RAVEL across four benchmarks: the standard T2ICompBench and our three newly proposed benchmarks - MythoBench, Rare-Concept-1K, and NovelBench. Each benchmark targets a distinct challenge — compositional accuracy, symbolic complexity, fine-grained rarity, and zero-shot generalization. RAVEL consistently outperforms all baselines across metrics like Attribute Accuracy, Context Relevance, and Visual Fidelity.

Evaluation Across Our New Benchmarks

Comparison with SOTA T2I Models

SOTA Benchmarking Table

We compare RAVEL's performance on rare concept generation with SOTA T2I models across key quantitative metrics. Our method significantly improves text-image alignment and attribute accuracy across multiple diffusion backbones.

BibTeX

@misc{venkatesh2025ravelrareconceptgeneration,
      title={RAVEL: Rare Concept Generation and Editing via Graph-driven Relational Guidance}, 
      author={Kavana Venkatesh and Yusuf Dalva and Ismini Lourentzou and Pinar Yanardag},
      year={2025},
      eprint={2412.09614},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09614}, 
}