Skip to main content

Introduction

HyperGen provides a production-ready API server that serves diffusion models with an OpenAI-compatible interface. Deploy any diffusers model with request queuing, batching, and authentication.

Key Features

OpenAI-Compatible

Drop-in replacement for OpenAI’s image generation API. Use the official OpenAI Python client.

Request Queue

Automatic request queuing and batching for optimal GPU utilization.

LoRA Support

Load and serve models with LoRA adapters dynamically.

Authentication

Optional API key authentication for secure deployments.

Production-Ready

Built on FastAPI + uvicorn with async request handling.

Easy Deployment

Single command to start serving any model.

Quick Start

Start a server in one command:
hypergen serve stabilityai/stable-diffusion-xl-base-1.0
Then use it from Python:
from openai import OpenAI

client = OpenAI(
    api_key="not-needed",  # No auth by default
    base_url="http://localhost:8000/v1"
)

response = client.images.generate(
    model="sdxl",
    prompt="A cat holding a sign that says hello world",
    n=1,
    size="1024x1024"
)

Architecture

Components

1

API Server

FastAPI-based HTTP server that handles incoming requests and responses.
  • OpenAI-compatible endpoints
  • Request validation with Pydantic
  • Authentication middleware
  • Health check endpoints
2

Request Queue

Thread-safe async queue that manages incoming generation requests.
  • FIFO ordering
  • Configurable max size
  • Request tracking with unique IDs
  • Future-based result delivery
3

Model Worker

Background worker that processes requests from the queue.
  • Loads and manages the model
  • Batch processing (future feature)
  • LoRA loading and switching
  • GPU memory management

OpenAI Compatibility

HyperGen implements OpenAI’s image generation API:

Endpoint: /v1/images/generations

Standard OpenAI parameters:
ParameterTypeDescription
promptstringText prompt for generation
modelstringModel identifier (informational)
nintegerNumber of images (1-10)
sizestringImage size (e.g., “1024x1024”)
response_formatstring”url” or “b64_json”
HyperGen extensions:
ParameterTypeDefaultDescription
negative_promptstringNoneNegative prompt
num_inference_stepsinteger50Inference steps (1-150)
guidance_scalefloat7.5Guidance scale (1.0-20.0)
seedintegerNoneRandom seed for reproducibility
lora_pathstringNonePath to LoRA weights
lora_scalefloat1.0LoRA strength (0.0-2.0)

Other Endpoints

Health check endpoint for monitoring.Response:
{
  "status": "healthy",
  "model": "stabilityai/stable-diffusion-xl-base-1.0",
  "queue_size": 0,
  "device": "cuda"
}
List available models (OpenAI-compatible).Response:
{
  "object": "list",
  "data": [
    {
      "id": "stabilityai/stable-diffusion-xl-base-1.0",
      "object": "model",
      "created": 1234567890,
      "owned_by": "hypergen"
    }
  ]
}

Request Flow

  1. Client sends request to /v1/images/generations
  2. Server validates request parameters
  3. Server checks authentication (if enabled)
  4. Request added to queue with unique ID
  5. Worker picks up request from queue
  6. Model generates images on GPU
  7. Results returned to client via async future
  8. Response formatted as OpenAI-compatible JSON

Performance

Single Request Processing

Typical latency (SDXL, 50 steps, RTX 4090):
  • Queue time: <10ms
  • Generation time: ~3-5 seconds
  • Total: ~3-5 seconds

Queue Management

The request queue handles multiple concurrent requests:
  • Requests are processed FIFO (first in, first out)
  • Max queue size configurable (default: 100)
  • Queue full returns HTTP 503 (Service Unavailable)

Batch Processing

Batch processing for multiple requests is coming in Phase 2.
Future feature:
  • Automatically batch compatible requests
  • Process multiple prompts in one forward pass
  • Configurable max batch size

Deployment Scenarios

Local Development

hypergen serve stabilityai/stable-diffusion-xl-base-1.0
No authentication, localhost only.

Production Deployment

hypergen serve stabilityai/stable-diffusion-xl-base-1.0 \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key your-secret-key \
  --max-queue-size 100
With authentication and external access.

Behind Reverse Proxy

# HyperGen server
hypergen serve model_id --host 127.0.0.1 --port 8000

# Nginx config
location /v1/ {
    proxy_pass http://localhost:8000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Docker Deployment

Official Docker images coming soon.
Example Dockerfile:
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN pip install hypergen

CMD ["hypergen", "serve", "stabilityai/stable-diffusion-xl-base-1.0"]

Roadmap

Phase 1 (Current)

  •  FastAPI server with OpenAI endpoints
  •  Request queue management
  •  Model worker
  •  API key authentication
  •  LoRA support (server flag)
  • =� Complete inference implementation

Phase 2 (Planned)

  • Request batching for multiple prompts
  • Dynamic LoRA hot-swapping via API
  • Metrics and monitoring endpoints
  • Rate limiting
  • Streaming responses
  • Image-to-image endpoints

Phase 3 (Future)

  • Multi-GPU serving
  • Model caching and auto-scaling
  • Load balancing across workers
  • WebSocket support
  • Video generation endpoints

Monitoring and Debugging

Health Checks

Check server health:
curl http://localhost:8000/health

Logging

HyperGen logs to stdout with INFO level by default:
2024-01-15 10:30:45 - INFO - Starting HyperGen server...
2024-01-15 10:30:45 - INFO - Model: stabilityai/stable-diffusion-xl-base-1.0
2024-01-15 10:30:50 - INFO - Server ready!
2024-01-15 10:31:00 - INFO - Request abc123 queued (queue size: 1)

Error Handling

The server returns standard HTTP status codes:
  • 200 OK - Success
  • 400 Bad Request - Invalid parameters
  • 401 Unauthorized - Missing/invalid API key
  • 500 Internal Server Error - Generation failed
  • 503 Service Unavailable - Queue full

Security Considerations

Always use API keys in production deployments!
Best Practices:
  1. Use strong API keys:
    # Generate a secure key
    openssl rand -hex 32
    
  2. Run behind HTTPS:
    • Use nginx or similar reverse proxy
    • Enable SSL/TLS certificates
  3. Firewall rules:
    • Restrict access to trusted IPs
    • Use VPN or internal network
  4. Rate limiting:
    • Use nginx rate limiting
    • Or implement application-level limits
  5. Monitor usage:
    • Track API usage
    • Alert on anomalies

Next Steps