Why Background Removal is Harder to Scale Than Generative AI Models
Technical
Infrastructure
AI
Performance

Why Background Removal is Harder to Scale Than Generative AI Models

A deep dive into the technical challenges of scaling background removal APIs compared to generative models like Stable Diffusion.
Marwen.T
Marwen.T

Lead Software Engineer

October 21, 2025

10 min read

Why Background Removal is Harder to Scale Than Generative AI Models

When people think about AI-powered image processing, they often assume all models face similar infrastructure challenges. However, after building and scaling RemBG to process millions of images, we've learned that background removal presents fundamentally different—and harder—scaling problems than generative models like Stable Diffusion.

The Hidden Complexity of Background Removal at Scale

At first glance, both background removal and text-to-image generation seem like similar problems: feed an input to a neural network, get a processed result. But when you're handling thousands of requests per second, the differences become painfully obvious.

The Core Problem: Unpredictable Input Dimensions

Users upload images of wildly different sizes:

  • Smartphones: 4032×3024 (iPhone 12 MP), 4000×3000 (Samsung)
  • DSLRs: 6000×4000 (24 MP), 8000×6000, even larger
  • Screenshots: 1920×1080, 2560×1440, 3840×2160
  • Social media: 1080×1080 (Instagram), 1200×628 (Facebook)
  • Product photos: Literally any resolution imaginable

Generative Models: The Luxury of Known Dimensions

When you're running Stable Diffusion or similar generative models, you have a massive advantage: you know the output dimensions upfront. Users typically request:

  • 512×512 pixels
  • 768×768 pixels
  • 1024×1024 pixels
  • Or other predefined aspect ratios

This predictability is a game-changer for infrastructure optimization. Here's why:

# Generative AI - Easy batching batch_512 = [request1, request2, request3, ...] # All 512x512 batch_1024 = [request4, request5, ...] # All 1024x1024 # Process each batch efficiently on GPU results_512 = model.generate(batch_512) results_1024 = model.generate(batch_1024)

With known dimensions, you can:

  • Batch requests perfectly - Group all 512×512 requests together and process them in one GPU operation
  • Pre-allocate memory - Know exactly how much VRAM you need before processing
  • Optimize tensor operations - No padding, no resizing, just pure computational efficiency
  • Predict processing time accurately - Each batch has consistent compute requirements

Background Removal: The Wild West of Dimensions

Now compare this to background removal. Users upload images of wildly different shapes and sizes:

  • Smartphones: 4032×3024 (iPhone), 4000×3000 (Samsung)
  • DSLRs: 6000×4000, 8000×6000, even larger
  • Screenshots: 1920×1080, 2560×1440, 3840×2160
  • Social media exports: 1080×1080, 1200×628
  • Product photos: anything

Here's a real sample of consecutive requests we received:

RequestDimensionsSizeType
#14032×302412.2 MPiPhone Photo
#2800×6000.5 MPThumbnail
#36000×400024 MPDSLR
#41080×19202.1 MPMobile Portrait
#53000×20006 MPProduct Shot
#6450×8000.4 MPSmall Image

The real problem isn't arrival timing — it's shape instability.

Yes, we can queue requests for 50–150 ms to form micro-batches. That's not the blocker. What breaks performance is batching heterogeneous H×W tensors. High-performance inference paths (TensorRT profiles, cuDNN autotune, CUDA Graphs, even Torch/Inductor) deliver peak throughput only with fixed or tightly bounded shapes. Mix shapes in a batch and you force plan swaps, re-tuning, extra copies, and wasteful padding — throughput tanks even if you "just queue them."

Why naive batching fails with mixed sizes

  1. Kernel/plan selection is shape-specific. Change H×W and the backend reselects kernels or swaps engines.

  2. Memory planning collapses. Workspaces and VRAM are sized per shape; padding to the largest image explodes memory.

  3. Execution cache thrash. Dynamic-shape engines still expect bounded ranges; unbounded heterogeneity triggers cache misses.

  4. CUDA Graphs need static extents. Varying dims = multiple graphs or no capture → higher launch overhead.

What actually works in production

Queue + size buckets + bounded caps + short max-wait. We downscale-preserve-aspect to the bucket cap, pad to that cap (not to the largest image in the wild), and reuse prebuilt engines/profiles per bucket. That keeps GPU utilization high without OOMs or latency spikes.

Why This Destroys Traditional Batching

The Naive Approach (That Doesn't Work)

When we first launched, we tried the obvious solution:

# DON'T DO THIS - It's terrible def naive_batch_processing(requests): # Find the largest image in the batch max_width = max(req.width for req in requests) max_height = max(req.height for req in requests) # Resize all images to match the largest padded_batch = [] for req in requests: padded_image = pad_to_size(req.image, max_width, max_height) padded_batch.append(padded_image) # Process the batch results = model.process(padded_batch) # Crop results back to original sizes return [crop_to_original(result, req) for result, req in zip(results, requests)]

This approach has catastrophic problems:

  1. Memory explosion: If you have 5 small images (800×600) and 1 huge image (6000×4000), you're padding 5 images to 6000×4000. That's 24 megapixels × 5 images of wasted memory.

  2. Wasted computation: The GPU is processing millions of padding pixels that contribute nothing to the final result.

  3. Unpredictable GPU memory: You never know when a massive image will join the batch, causing OOM errors.

  4. Latency spikes: Small images get delayed waiting for large images to process.

What About Processing One at a Time?

# Also bad - Terrible GPU utilization def sequential_processing(requests): results = [] for req in requests: result = model.process(req.image) # Process individually results.append(result) return results

This "solution" wastes your expensive GPU resources. Modern GPUs (A100, H100) are designed for parallel processing. Running images one-by-one is like using a Ferrari to commute 5 miles—technically it works, but you're wasting 90% of its capability.

GPU utilization drops from 95%+ to under 30% with sequential processing.

Our Solution: Dynamic Smart Batching

After months of optimization, we developed a dynamic batching system that achieves near-optimal GPU utilization while handling arbitrary image sizes.

The Three-Tier Architecture

Our solution categorizes images into size buckets and processes each bucket optimally:

class DynamicBatcher: def __init__(self): self.size_buckets = { 'small': [], # < 1MP 'medium': [], # 1-5MP 'large': [], # 5-15MP 'xlarge': [] # > 15MP } self.bucket_thresholds = { 'small': (1024, 1024), 'medium': (2048, 2048), 'large': (4096, 4096), 'xlarge': (8192, 8192) } def categorize_request(self, request): megapixels = (request.width * request.height) / 1_000_000 if megapixels < 1: return 'small' elif megapixels < 5: return 'medium' elif megapixels < 15: return 'large' else: return 'xlarge' def process_batch(self, bucket_name): bucket = self.size_buckets[bucket_name] max_dims = self.bucket_thresholds[bucket_name] # Resize images to bucket's max dimensions resized_batch = [ resize_preserve_aspect(req.image, max_dims) for req in bucket ] # Pad to uniform size within bucket padded_batch = [ pad_to_size(img, max_dims) for img in resized_batch ] # Process efficiently on GPU results = self.model.process(padded_batch) # Restore to original sizes return [ restore_original_size(result, req.original_size) for result, req in zip(results, bucket) ]

Key Optimizations

1. Adaptive Batch Sizing

Different size categories get different batch sizes based on VRAM requirements:

BATCH_SIZES = { 'small': 32, # Can fit many small images 'medium': 16, # Moderate batch size 'large': 8, # Fewer large images 'xlarge': 2 # Process huge images carefully }

2. Timeout-Based Triggering

Don't wait forever for a bucket to fill:

async def smart_batch_trigger(bucket_name): bucket = self.size_buckets[bucket_name] max_batch = BATCH_SIZES[bucket_name] max_wait_ms = 100 # Don't wait more than 100ms while True: if len(bucket) >= max_batch: # Bucket is full, process now await self.process_batch(bucket_name) elif len(bucket) > 0 and bucket[0].wait_time > max_wait_ms: # Has requests and oldest is waiting too long await self.process_batch(bucket_name) await asyncio.sleep(10) # Check every 10ms

3. Intelligent Image Resizing

Before batching, we resize images to their bucket's maximum dimensions while preserving aspect ratio:

def resize_preserve_aspect(image, max_dims): max_w, max_h = max_dims img_w, img_h = image.size # Calculate scaling factor scale = min(max_w / img_w, max_h / img_h) if scale >= 1: # Image is smaller than bucket max, keep original return image # Resize down to fit within bucket new_w = int(img_w * scale) new_h = int(img_h * scale) return image.resize((new_w, new_h), Image.LANCZOS)

The Results: Performance at Scale

The impact of our dynamic batching system was dramatic:

MetricBeforeAfterImprovement
GPU Utilization28-45% (variable)82-94% (consistent)3× better
Average Latency2.8s per image0.9s per image68% faster
P95 Latency12.5s (spikes!)2.1s (predictable)83% faster
Throughput~180 img/sec~520 img/sec3× higher
OOM Errors2-3 per dayZero (6 months)100% eliminated

Bottom line: We tripled throughput while dramatically reducing latency and eliminating crashes.

Additional Optimizations We Implemented

1. Model Quantization

We use INT8 quantization for our models, reducing memory footprint by 4× with minimal accuracy loss (<0.5% decrease in mIoU).

2. Multi-Model Pipeline

Different image types use different optimized models:

  • People/portraits: High-accuracy model with attention to hair and fine details
  • Products: Fast model optimized for solid objects with clear edges
  • General: Balanced model for mixed content

3. Preemptive Scaling

We monitor queue depth per bucket and spin up additional GPU instances before latency degrades:

if bucket_queue_depth['large'] > THRESHOLD: scale_up_gpu_instances(count=2)

Lessons for AI Infrastructure Engineers

If you're building a similar system, here are our key takeaways:

  1. Don't assume all AI workloads are the same - Generative models and per-image processing have completely different characteristics

  2. Measure everything - We instrumented latency, queue depth, GPU utilization, memory usage per bucket. You can't optimize what you don't measure.

  3. Start simple, optimize with data - Our first version was naive sequential processing. We only added complexity based on real production bottlenecks.

  4. Bucketing is your friend - When you can't predict inputs, categorize them and handle each category optimally.

  5. Balance latency vs. throughput - The timeout-based triggering was crucial—don't sacrifice latency for marginal throughput gains.

Conclusion

Background removal might seem simpler than generative AI models, but scaling it efficiently is significantly harder due to unpredictable input dimensions. While Stable Diffusion can batch 32 requests of identical 512×512 images and process them in parallel, background removal APIs must handle wildly varying image sizes—from 800×600 thumbnails to 8000×6000 professional photos—all in the same request queue.

Our dynamic batching solution with size-based bucketing, adaptive batch sizes, and timeout-based triggering allowed us to achieve 3× higher throughput, 68% lower latency, and near-perfect reliability compared to naive approaches.

If you're building RemBG into your application, you can now process images at scale without worrying about these infrastructure complexities—we've already solved them for you.


Start Building on RemBG's API

Quick Start Options:

Get in Touch Have architectural questions about scaling your own image processing pipeline? Or curious about our benchmarking methodology? Reach out via our contact page — we're always happy to talk infrastructure with fellow engineers.


Built by engineers who've processed 100M+ images in production. Now available for your applications.


Ready to Try RemBG?

Start removing backgrounds with our powerful API. Get 60 free credits to test it out.

Get API AccessTry Free Tool