Temporal Consistency: The Invisible Quality That Separates Good 3D from Great
A depth map that looks perfect on a single frame can produce unwatchable video if it flickers between frames. Temporal consistency is how you fix it.
Look at a single frame from a depth estimation model and it might look flawless — clean edges, accurate relative distances, plausible surface normals. Play the video and the illusion collapses. Objects shimmer. Edges swim. Flat surfaces breathe in and out as the model's per-frame estimates disagree with each other by small but visible amounts. This is the temporal consistency problem, and it is the single largest quality gap between academic depth estimation and production-grade stereo conversion.
The root cause is straightforward. Most depth models process each frame independently. They have no memory of what they predicted for the previous frame. Even with identical scene content, minor variations in lighting, compression artifacts, or sub-pixel motion cause the model to produce slightly different depth values frame to frame. These differences are small in absolute terms but catastrophic for stereo viewing, where the human visual system is extraordinarily sensitive to depth instability.
The solution has two layers. The first is post-processing: apply a temporal smoothing filter that enforces consistency between adjacent frames. Simple exponential moving averages work for static scenes but fail during camera motion or scene cuts. More sophisticated approaches use optical flow to warp the previous frame's depth into the current frame's coordinate space before blending. This preserves consistency during motion while still allowing the depth to change when it should.
The second layer is model-level: train or fine-tune depth models with temporal loss terms that penalize frame-to-frame inconsistency. Depth Anything V2 and similar recent models are starting to incorporate this, but the field is still young. Most available models are trained purely on single-image depth, and temporal consistency is left as an exercise for the pipeline builder.
For practical stereo conversion, the pipeline approach is more reliable than hoping the model solves it. You run single-frame depth estimation (because the models are better), then apply temporal stabilization as a separate stage. This separation of concerns lets you upgrade the depth model independently of the stabilization logic, and it lets you tune stabilization strength per project — documentaries need less smoothing than action films because camera motion is slower.
The takeaway for anyone building or using a conversion pipeline: if your depth maps look good as stills but bad as video, the problem is almost certainly temporal consistency. It is fixable, but it requires treating it as a first-class pipeline stage, not an afterthought.