Introduction
HyperGen provides a production-ready API server that serves diffusion models with an OpenAI-compatible interface. Deploy any diffusers model with request queuing, batching, and authentication.Key Features
OpenAI-Compatible
Drop-in replacement for OpenAI’s image generation API. Use the official OpenAI Python client.
Request Queue
Automatic request queuing and batching for optimal GPU utilization.
LoRA Support
Load and serve models with LoRA adapters dynamically.
Authentication
Optional API key authentication for secure deployments.
Production-Ready
Built on FastAPI + uvicorn with async request handling.
Easy Deployment
Single command to start serving any model.
Quick Start
Start a server in one command:Architecture
Components
1
API Server
FastAPI-based HTTP server that handles incoming requests and responses.
- OpenAI-compatible endpoints
- Request validation with Pydantic
- Authentication middleware
- Health check endpoints
2
Request Queue
Thread-safe async queue that manages incoming generation requests.
- FIFO ordering
- Configurable max size
- Request tracking with unique IDs
- Future-based result delivery
3
Model Worker
Background worker that processes requests from the queue.
- Loads and manages the model
- Batch processing (future feature)
- LoRA loading and switching
- GPU memory management
OpenAI Compatibility
HyperGen implements OpenAI’s image generation API:Endpoint: /v1/images/generations
Standard OpenAI parameters:
| Parameter | Type | Description |
|---|---|---|
prompt | string | Text prompt for generation |
model | string | Model identifier (informational) |
n | integer | Number of images (1-10) |
size | string | Image size (e.g., “1024x1024”) |
response_format | string | ”url” or “b64_json” |
| Parameter | Type | Default | Description |
|---|---|---|---|
negative_prompt | string | None | Negative prompt |
num_inference_steps | integer | 50 | Inference steps (1-150) |
guidance_scale | float | 7.5 | Guidance scale (1.0-20.0) |
seed | integer | None | Random seed for reproducibility |
lora_path | string | None | Path to LoRA weights |
lora_scale | float | 1.0 | LoRA strength (0.0-2.0) |
Other Endpoints
GET /health
GET /health
Health check endpoint for monitoring.Response:
GET /v1/models
GET /v1/models
List available models (OpenAI-compatible).Response:
Request Flow
- Client sends request to
/v1/images/generations - Server validates request parameters
- Server checks authentication (if enabled)
- Request added to queue with unique ID
- Worker picks up request from queue
- Model generates images on GPU
- Results returned to client via async future
- Response formatted as OpenAI-compatible JSON
Performance
Single Request Processing
Typical latency (SDXL, 50 steps, RTX 4090):- Queue time: <10ms
- Generation time: ~3-5 seconds
- Total: ~3-5 seconds
Queue Management
The request queue handles multiple concurrent requests:- Requests are processed FIFO (first in, first out)
- Max queue size configurable (default: 100)
- Queue full returns HTTP 503 (Service Unavailable)
Batch Processing
Batch processing for multiple requests is coming in Phase 2.
- Automatically batch compatible requests
- Process multiple prompts in one forward pass
- Configurable max batch size
Deployment Scenarios
Local Development
Production Deployment
Behind Reverse Proxy
Docker Deployment
Official Docker images coming soon.
Roadmap
Phase 1 (Current)
- FastAPI server with OpenAI endpoints
- Request queue management
- Model worker
- API key authentication
- LoRA support (server flag)
- =� Complete inference implementation
Phase 2 (Planned)
- Request batching for multiple prompts
- Dynamic LoRA hot-swapping via API
- Metrics and monitoring endpoints
- Rate limiting
- Streaming responses
- Image-to-image endpoints
Phase 3 (Future)
- Multi-GPU serving
- Model caching and auto-scaling
- Load balancing across workers
- WebSocket support
- Video generation endpoints
Monitoring and Debugging
Health Checks
Check server health:Logging
HyperGen logs to stdout with INFO level by default:Error Handling
The server returns standard HTTP status codes:200 OK- Success400 Bad Request- Invalid parameters401 Unauthorized- Missing/invalid API key500 Internal Server Error- Generation failed503 Service Unavailable- Queue full
Security Considerations
Best Practices:-
Use strong API keys:
-
Run behind HTTPS:
- Use nginx or similar reverse proxy
- Enable SSL/TLS certificates
-
Firewall rules:
- Restrict access to trusted IPs
- Use VPN or internal network
-
Rate limiting:
- Use nginx rate limiting
- Or implement application-level limits
-
Monitor usage:
- Track API usage
- Alert on anomalies