Pushing the Boundaries with Real-Time Video Inpainting

Why Video Inpainting?

Wojtek Jasinski
Software Mansion

--

Our most recent project delved into the realm of object detection and instance segmentation for real-time video applications. We developed a model capable of pinpointing and subsequently blurring objects in videos with exceptionally low latency. For a memorable demonstration, we successfully blurred a bottle from a live video. More about our object detection project can be read in our article.

While our blurring technique served its purpose, we weren’t fully satisfied with the visual outcome. We started pondering over an exciting idea: what if we could make the bottle vanish altogether?

The quest to achieve this ambitious goal led us to numerous projects doing excellent work in Video Inpainting. It is a machine-learning task that is defined as filling in masked areas in a video with visually plausible and temporally consistent content. It is commonly used to erase an object from a video as displayed below.

Video Inpainting use case — removing a moving object from a video. This example comes from FGT.

The focus of our work

The focus of our work was to excel both in real-time scenarios and static scenes — like webcam videos or live streams. While the model might not score high on scenes with dynamically moving backgrounds, it was an acceptable trade-off for our intended use case.

We wanted our system to work in real time, which means it would need to achieve at least 20fps of throughput and should not introduce delays — allowing it for easy integration into a streaming pipeline.

Diffusion Models: A Double-Edged Sword

At the tail end of 2022, a technique called diffusion modeling made a big splash in the machine learning domain. Wielding impressive results in image generation and inpainting tasks, diffusion models appeared promising. However, we quickly realized their significant computational requirements made them impracticable for real-time video inpainting.

In most instances, the striking results from diffusion models are attained by running the model for a few seconds on a single image. This process requires a high-grade GPU, often found in data centers. Although significant efforts have been made to streamline these models, they simply don’t meet our needs for real-time application.

Deconstructing Video Inpainting Methods

Numerous video inpainting models utilize vision transformers, optical flow estimation, or a blend of both. Recent noteworthy work in this domain includes the blend of both methods, exhibited by ProPainter.

We opted for transformer methods due to a few key reasons. While flow-based methods offer great performance on dynamic videos, they’re not as useful for static scenes, producing unnecessary overhead. Our choice leaned towards transformer methods, which generate missing parts of frames based on spatial and temporal context, using attention mechanisms.

Unfortunately, the majority of Video Inpainting methods weren’t built with real-time applications in mind. Many were high-complexity models that functioned at a snail’s pace and were designed to inpaint videos offline — they depend on the entire context of a video, which isn’t ideal for online inference, where only past frames can be used for context clarification.

Comparison of offline (left) and online (right) video inpainting scenarios.

Our Model Architecture

The model architecture is based on the lightweight spatial-temporal decouple mechanism introduced by Rui Liu et al. in DSTT. We improved upon it by utilizing SWIN transformers, which reduced total computations without hurting the model’s performance. We also added Conditional Positional Encodings allowing us for dynamic input resolution.

Overview of our model architecture. Details: 1— high level diagram; 2 — hierarchical encoder used in DSTT; 3 — two consecutive swin transformers, the second one with window shift; 4 — our approach to windowed attention: no bounds along temporal dimension and dividing into windows along spatial dimentions

As static scenes are less complex than dynamic ones for video inpainting, we scaled down our model as compared to the state-of-the-art models.

Datasets

To train our model we used the Youtube-VOS dataset (sample video on the left), a standard among video inpainting research projects, and the IPN hand dataset. The latter was used to tailor our model for specific applications like webcam videos for streamers. We aimed to improve the model’s dexterity in handling hands and faces which are notoriously challenging to successfully generate through generative models.

Sample videos from Youtube-VOS (the one on the left) and IPN Hand (the one on the right) datasets.

Training

The training process of such models is self-supervised. We overlay randomly selected frames from the original video with our generated masks and prompt the model to restore the image sequence to its original form.

Simplifed diagram ilustrating self-supervised training procedure of inpainting mode

We used a mix of L1 loss, adversarial loss, and perceptual loss for training. The L1 loss, albeit simple, is a highly stable and effective tool. The adversarial loss helps maintain temporal consistency and helps reconstruct intricate textures.

Inference

To generate masks we used the RTMDet model we previously worked on. It detects objects and outputs segmentation masks at high speed. For our case, we used a model that was trained to detect bottles — but you could fine-tune it to work on any object type you wish!

During inference, we fed the model some of the past frames for context. We can reconstruct each frame individually or in small batches. Initially, we shied away from batching, since our goal was for the solution to work in real-time, but we found that at a minimal delay, we can obtain a slight increase in temporal consistency, maintaining the throughput of over 25fps. Finally, to speed up the inference, torch.compile came in handy, enabling us to process inpainting about 20% faster.

A big drawback of the inpainting model was that it operates in quite low resolution (480x288). Increasing it would yield visually better results, but we could not afford to do so because of our real-time constraints. To increase the final video quality, we implemented a mechanism for caching frames client side. The backend pipeline sends inpainted frames and segmentation masks to the front-end. The original, full-resolution images are used on the front end, except for areas outlined by segmentation masks. This method hugely increases perceived visual quality and decreases the amount of data transmitted between server and client.

Results

The results have been spectacular. But don’t just take our word for it — check out our interactive demo and see the magic for yourself!

>>> https://swmansion.com/ai-ml/inpainting-demo

--

--