Skip to content
All posts
Tutorials & Tips10 Min read

The Inpainting Problem: What Happens Behind the Foreground

When you shift a 2D image to create a stereo pair, parts of the background are revealed that were never photographed. Filling those gaps is the inpainting problem.

Every 2D-to-3D conversion faces the same fundamental challenge: when you shift pixels horizontally to create the second eye view, you reveal regions of the image that were occluded in the original frame. The foreground object has moved, but behind it is... nothing. A strip of missing pixels that need to be filled with plausible content.

This is the disocclusion inpainting problem, and it is the single hardest part of automated stereo conversion. Depth estimation has gotten remarkably good. Stereo warping is a solved geometric problem. But filling in what was never photographed — convincingly, consistently, across thousands of frames — remains genuinely difficult.

The simplest approach is to stretch the adjacent background pixels to fill the gap. This works for smooth, textureless regions (sky, walls, gradients) but produces visible smearing when the background has texture or structure. A brick wall behind a person becomes a stretched, distorted mess at the edges.

The next level is structure-aware inpainting. You detect the background texture pattern (if it has one) and continue it into the disoccluded region. This works well for regular patterns — bricks, tiles, fabric — but fails for irregular or unique content. You can't continue a painting or a face by repeating patterns.

Neural inpainting models (like those derived from stable diffusion architectures) offer the most promising approach. Given the surrounding context, they can generate plausible content for the missing region. The results are often impressive — the generated content matches the style, lighting, and texture of the surrounding background. But they introduce a new problem: temporal consistency. The model generates different plausible content for each frame, causing the inpainted regions to shimmer and shift over time.

The practical solution is a hybrid approach. Use simple stretching for smooth regions (fast and stable). Use structure-aware continuation for textured but regular backgrounds. Use AI inpainting only for complex regions, then apply aggressive temporal smoothing to the inpainted areas specifically. This layered strategy produces the best balance of quality and stability, at the cost of pipeline complexity.

For most prosumer content, the disocclusion artifacts are small enough that simple stretching with edge blending is sufficient. The inpainting problem becomes acute only at higher stereo baselines (deeper 3D effect) and with footage that has thin foreground objects against detailed backgrounds. If your 3D looks fine at a mild baseline, you don't need to solve the hard inpainting problem — and sometimes the right answer is to use less depth rather than better inpainting.