Skip to main content

Overview

Stable Diffusion 3 (SD3) is Stability AI’s latest generation text-to-image model, featuring a completely redesigned architecture. It introduces significant improvements in text rendering, prompt understanding, and image quality compared to SDXL.
SD3 uses a new Multimodal Diffusion Transformer (MMDiT) architecture, making it fundamentally different from SDXL’s UNet-based approach.

Key Features

Text Rendering

Outstanding text generation:
  • Accurate spelling in images
  • Multiple text elements
  • Various fonts and styles
  • Proper text integration

Prompt Understanding

Improved comprehension:
  • Better complex prompt handling
  • More accurate composition
  • Better spatial relationships
  • Fewer artifacts

Image Quality

Enhanced visuals:
  • Better detail preservation
  • Improved colors and lighting
  • More coherent compositions
  • Reduced common artifacts

Efficiency

Optimized performance:
  • Similar VRAM to SDXL
  • Competitive speed
  • Better quality per step
  • Efficient training

Model Variants

SD3 Medium

The primary SD3 model optimized for quality and accessibility.
from hypergen import model

m = model.load("stabilityai/stable-diffusion-3-medium-diffusers")
m.to("cuda")
Specifications:
  • Parameters: 2B (transformer), 8B total with text encoders
  • Resolution: Native 1024x1024
  • VRAM: 12GB minimum, 16GB recommended
  • Architecture: Multimodal Diffusion Transformer (MMDiT)

SD3 Large (Coming Soon)

A larger variant with enhanced capabilities.
SD3 Large is not yet publicly available. Check the Stability AI website for release information.

What’s New in SD3

Architectural Changes

SD3 introduces several key differences from SDXL:
  • MMDiT Architecture
  • Text Encoders
  • Rectified Flow
Multimodal Diffusion Transformer:
  • Replaces UNet with transformer architecture
  • Processes text and image jointly
  • Better cross-modal understanding
  • More efficient attention mechanisms
Impact:
  • Superior text rendering
  • Better prompt comprehension
  • More coherent compositions

Improvements Over SDXL

FeatureSDXLSD3
Text renderingPoorExcellent
Complex promptsGoodExcellent
Spatial understandingGoodBetter
ArchitectureUNetTransformer
Text encoders2 (CLIP)3 (CLIP + T5)
ArtifactsOccasionalFewer

Loading SD3 with HyperGen

Basic Loading

from hypergen import model

# Load SD3 Medium
m = model.load("stabilityai/stable-diffusion-3-medium-diffusers")
m.to("cuda")

# Generate an image
image = m.generate("A cat holding a sign that says 'HELLO WORLD'")
image[0].save("output.png")
SD3 excels at generating text in images. Try prompts that include signs, labels, or text elements!

Optimized Loading

For better performance:
from hypergen import model

# Load with fp16 or bf16 precision
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16",  # or "bfloat16"
    variant="fp16",
    use_safetensors=True,
)
m.to("cuda")

Memory-Optimized Loading

For 12GB VRAM GPUs:
from hypergen import model

m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Enable memory optimizations
m.enable_vae_slicing()
m.enable_attention_slicing()

Training LoRAs with SD3

SD3 supports LoRA training with HyperGen’s optimized pipeline.

Basic LoRA Training

from hypergen import model, dataset

# Load model
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Load dataset
ds = dataset.load("./my_images")

# Train LoRA
lora = m.train_lora(
    ds,
    steps=1000,
    rank=16,
    alpha=32,
    learning_rate=1e-4,
)
  • Quick Training (12GB VRAM)
  • Balanced Training (16GB VRAM)
  • High-Quality Training (24GB VRAM)
For fast iteration:
lora = m.train_lora(
    ds,
    steps=800,
    learning_rate=1e-4,
    rank=16,
    alpha=32,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Settings:
  • Standard rank (16)
  • Works on 12GB VRAM
  • Training time: ~12 minutes (50 images)

Training for Different Use Cases

Learning an artistic style:
lora = m.train_lora(
    ds,
    steps=1500,
    learning_rate=1e-4,
    rank=24,
    alpha=48,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset:
  • 50-200 images in target style
  • Consistent aesthetic
  • High resolution (1024x1024+)
  • Captions describing content
Learning a specific subject:
lora = m.train_lora(
    ds,
    steps=1200,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset:
  • 20-100 images of subject
  • Variety of poses and angles
  • Detailed captions
  • Different lighting conditions
Learning text rendering:
lora = m.train_lora(
    ds,
    steps=1000,
    learning_rate=1e-4,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
)
Dataset:
  • Images with various text elements
  • Different fonts and styles
  • Captions describing the text content
  • Variety of text placements

Inference Parameters

Basic Generation

image = m.generate(
    prompt="A neon sign that says 'Open 24/7' on a brick wall",
    negative_prompt="blurry, low quality",
    num_inference_steps=40,
    guidance_scale=7.0,
    height=1024,
    width=1024,
)

Parameter Guide

prompt
str
required
Text description of the desired imageSD3 excels at complex, detailed prompts with multiple elements.
negative_prompt
str
default:""
What to avoid in the generated imageRecommended:
negative_prompt="blurry, low quality, distorted"
num_inference_steps
int
default:28
Number of denoising stepsSD3’s default is 28 steps (vs 50 for SDXL):
  • 15-20: Fast, good quality
  • 28-40: Better quality (recommended)
  • 40-50: Highest quality
guidance_scale
float
default:7
How closely to follow the promptSD3 uses slightly lower guidance than SDXL:
  • 5-6: More creative
  • 7-8: Balanced (recommended)
  • 9-10: Very literal

Speed Priority

num_inference_steps=20
guidance_scale=6.5
Time: ~2.5s (RTX 4090)

Balanced

num_inference_steps=28
guidance_scale=7.0
Time: ~3.5s (RTX 4090)

Quality Priority

num_inference_steps=40
guidance_scale=7.5
Time: ~5s (RTX 4090)

Text Generation in Images

SD3’s standout feature is accurate text rendering:
# Simple text
image = m.generate(
    prompt='A coffee shop sign that says "FRESH COFFEE" in bold letters',
    num_inference_steps=28,
)

# Multiple text elements
image = m.generate(
    prompt='''
    A street scene with multiple signs: a red "STOP" sign,
    a blue "Main Street" street sign, and a neon "CAFE" sign
    ''',
    num_inference_steps=40,
)

# Stylized text
image = m.generate(
    prompt='A vintage poster with "SUMMER SALE" in retro typography',
    num_inference_steps=28,
)
For best text results:
  • Put text in quotes
  • Describe the text style (bold, neon, handwritten, etc.)
  • Specify the object containing the text (sign, poster, label)
  • Keep text relatively short (1-5 words)

Performance Benchmarks

Generation Performance

Based on NVIDIA RTX 4090, 1024x1024 resolution:
StepsVRAMTimeQuality
20~12GB~2.5sGood
28~12GB~3.5sExcellent
40~12GB~5sExcellent+
50~12GB~6sOutstanding

Training Performance

LoRA training on RTX 4090, 50 images:
ConfigurationVRAMTime (1000 steps)
Rank 16, Batch 1~12GB~15 min
Rank 32, Batch 1~14GB~18 min
Rank 64, Batch 1~16GB~22 min
Rank 32, Batch 2~18GB~20 min

Comparison with SDXL

MetricSDXLSD3
Generation time (28 steps)~3s~3.5s
VRAM (generation)~9GB~12GB
VRAM (training, rank 16)~9GB~12GB
Text renderingPoorExcellent
Overall qualityExcellentExcellent+

Best Practices

Prompt Engineering for SD3

  • Complex Compositions
  • Text in Images
  • Detailed Descriptions
  • Spatial Relationships
SD3 excels at complex scenes: Good:
prompt = """
A bustling farmer's market with wooden stalls selling
fresh vegetables, a red awning overhead, people shopping,
warm afternoon sunlight, photorealistic
"""
SD3 better understands spatial relationships and multiple elements.

Training Best Practices

1

Dataset Preparation

Prepare high-quality data:
  • Use 1024x1024 or higher resolution
  • 20-150 images for most use cases
  • Ensure consistent quality
  • Remove duplicates
  • Include variety in poses/angles
2

Caption Quality

Write effective captions: Good caption:
A person wearing a blue jacket and jeans, standing
in front of a brick wall, natural daylight from the
left, neutral expression, looking at camera
L Poor caption:
person
Tips for SD3:
  • Describe spatial relationships
  • Include text content if present
  • Describe lighting and colors
  • Be detailed but natural
3

Hyperparameter Selection

Start with recommended settings:
steps=1000
learning_rate=1e-4
rank=16-32
alpha=2*rank
batch_size=1
gradient_accumulation_steps=4
Adjust based on results and VRAM.
4

Monitoring Progress

Save and test checkpoints:
lora = m.train_lora(
    ds,
    steps=2000,
    save_steps=500,
    output_dir="./checkpoints"
)
Test multiple checkpoints to find optimal stopping point.

Memory Optimization

m.enable_vae_slicing()
  • Reduces VRAM by ~10-15%
  • Minimal performance impact
  • Recommended for all users
m.enable_attention_slicing()
  • Reduces VRAM by ~15-20%
  • Small performance impact
  • Useful for 12GB GPUs
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
  • Reduces VRAM by ~40-50%
  • Minimal quality impact
  • Strongly recommended

Troubleshooting

Solutions:
  1. Use float16 precision:
    m = model.load(
        "stabilityai/stable-diffusion-3-medium-diffusers",
        torch_dtype="float16"
    )
    
  2. Enable memory optimizations:
    m.enable_vae_slicing()
    m.enable_attention_slicing()
    
  3. Reduce resolution:
    image = m.generate(prompt, height=768, width=768)
    
Solutions:
  1. Reduce LoRA rank:
    lora = m.train_lora(ds, rank=16, alpha=32)
    
  2. Use gradient accumulation:
    lora = m.train_lora(ds, batch_size=1, gradient_accumulation_steps=8)
    
  3. Use float16:
    m = model.load(
        "stabilityai/stable-diffusion-3-medium-diffusers",
        torch_dtype="float16"
    )
    
Tips for better text:
  1. Put text in quotes:
    prompt = 'A sign that says "OPEN"'
    
  2. Describe the text container:
    prompt = 'A wooden sign with "WELCOME" carved in it'
    
  3. Keep text short (1-5 words)
  4. Increase inference steps:
    image = m.generate(prompt, num_inference_steps=40)
    
  5. Adjust guidance:
    image = m.generate(prompt, guidance_scale=8.0)
    
Solutions:
  1. Increase training steps:
    lora = m.train_lora(ds, steps=1500)
    
  2. Improve dataset quality:
    • Add more images
    • Write better captions
    • Use higher resolution
  3. Adjust learning rate:
    lora = m.train_lora(ds, learning_rate=5e-5)
    
  4. Increase LoRA rank:
    lora = m.train_lora(ds, rank=32, alpha=64)
    

Example Workflows

Text-Rich Image Generation

from hypergen import model

# Load SD3
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Generate image with text
image = m.generate(
    prompt='''
    A vintage travel poster with "VISIT PARIS" in bold art deco
    letters at the top, the Eiffel Tower in the background,
    warm sunset colors
    ''',
    num_inference_steps=28,
    guidance_scale=7.0,
)

image[0].save("paris_poster.png")

LoRA Training for Character

from hypergen import model, dataset

# Load model
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Load character dataset
ds = dataset.load("./character_images")

# Train character LoRA
lora = m.train_lora(
    ds,
    steps=1200,
    learning_rate=5e-5,
    rank=32,
    alpha=64,
    batch_size=1,
    gradient_accumulation_steps=4,
    save_steps=400,
    output_dir="./character_lora"
)

print("Character LoRA training complete!")

Batch Generation with SD3

from hypergen import model

# Load SD3
m = model.load(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype="float16"
)
m.to("cuda")

# Generate multiple images
prompts = [
    'A storefront with "BOOKS" written above the door',
    'A coffee cup with "MORNING" printed on it',
    'A street sign that says "MAIN ST"',
    'A neon sign saying "PIZZA" in red letters',
]

for i, prompt in enumerate(prompts):
    image = m.generate(
        prompt=prompt,
        num_inference_steps=28,
        guidance_scale=7.0,
    )
    image[0].save(f"text_image_{i}.png")

print(f"Generated {len(prompts)} images with text")

SD3 vs SDXL: When to Use Which

  • Use SD3 When
  • Use SDXL When
SD3 is better for: Text in images (signs, labels, posters)  Complex compositions with multiple elements  Precise spatial relationships  Detailed scene understanding  Latest technology and improvementsExample use cases:
  • Product mockups with labels
  • Signage and branding
  • Posters and advertisements
  • Complex scene compositions

GPU Requirements

Minimum

VRAM: 12GBGPUs:
  • RTX 3060 (12GB)
  • RTX 4070
Capabilities:
  • Generation: 1024x1024 
  • Training: Rank 16 
  • Batch size: 1 

Recommended

VRAM: 16GBGPUs:
  • RTX 4080
  • RTX 4090
  • A10
Capabilities:
  • Generation: 1024x1024 
  • Training: Rank 32 
  • Batch size: 1-2 

Optimal

VRAM: 24GB+GPUs:
  • RTX 4090
  • A100
  • H100
Capabilities:
  • Generation: Up to 2048x2048 
  • Training: Rank 64+ 
  • Batch size: 2-4 

Next Steps

Additional Resources