Client side AI model inference for real-time video processing — style transfer

Jakub Chmura
Software Mansion
Published in
5 min readOct 4, 2023

--

Ever wondered what you would look like in a Van Gogh’s painting? Probably not, however, recent research in image-related tech allows you to find out. Our model goes a step further, letting you transform your webcam feed into a Van Gogh’s painting or any other style you’d like! We proudly introduce the result of our latest AI adventures, which is a real-time, offline style transfer solution.
Style transfer is a problem of taking two images, called content and style, and manipulating them in such a way that the content image reflects style features of the latter one. Our solution operates offline entirely in your browser, which requires intense performance optimizations of the neural network.

The reason behind it

Network latency and GPU memory swap bottlenecks inherent to a cloud-inference architecture limited us to around 20 fps. This motivated us to explore alternative solutions to enhance the processing speed.
This follows our previous experiments with real-time AI for object segmentation which you can check out here. This time the goal was to leverage state-of-the-art web technologies to test edge device inference.

Benefits

Our demo is a solution that fully relies on front-end technologies. This is a rather new approach that obviously brings an array of very important advantages. Primarily, it enables numerous users to access our service simultaneously, ensuring no performance loss from the server side. Since there’s no heavy computation done on the server, it is cheap and easy to scale. Furthermore, the service will continue to operate seamlessly, even in the unlikely event of a server outage, due to its local, client-side deployment. Last but not least, with hardware that’s powerful enough, this solution gives very satisfactory results in terms of performance.

The story of style transfer

We started our research with the classic neural style transfer, which is essentially an iterative approach. This means with each pass of the algorithm, we optimized the output image so it becomes more similar to both the style and the content images. It turned out to be very inefficient for our use case. Hence, we started digging deeper. We quickly found a paper titled “Perceptual losses for real-time style transfer and super-resolution”, which introduces a simple feed-forward neural network. That combined with instance normalization layers was our final baseline. You can read more into this here.

Baseline problem

You might wonder, if the title suggests ‘real-time’, where does the problem lie? While it’s true that these claims hold under the conditions of powerful GPUs, the reality is that the majority of users do not have such high-performing hardware. In order to achieve our objective, it was our goal to enhance the model’s efficiency so that it could operate swiftly on a standard CPU, which is orders of magnitude slower than GPU with operations that neural networks perform.

Optimization

After conducting a series of exhaustive experiments with the aforementioned network, we discovered that the number of channels throughout the network can be significantly reduced, contributing to our model’s enhanced efficiency. Also, we successfully got rid of a single upsampling and downsampling layer, while maintaining the output image quality. Our next step was to use knowledge distillation techniques. It allowed us to ditch the skip connections totally and replace them with simple convolution combined with instance normalization. They were not necessary, as the trainings were stable, and it allowed us to reduce memory bandwidth and simplify computational graph. Finally, we used an iterative model compression algorithm called netAdapt, thanks to which we shrunk the network even further. We decided to train multiple models, with different style images. Some models even allowed us to use small input sizes without visible quality impact, but it wasn’t always possible. This way, we developed a solution with impressive performance of 20–30ms on CPU, varying on the model. In fact, one of our smallest models turned out to be almost 400 times smaller than the original one.

Deployment

After successfully training multiple such models, we were left with deploying them in the browsers. We were aware of technologies used to gain near-native inference performance in browsers, such as WASM with SIMD and multithreading. We explored multiple libraries, however our final choice was exporting the model to ONNX format, and running it using plain JS and onnxruntime-web.
It quickly turned out that something was wrong. The inference was significantly slower than native! After multiple daunting debugging sessions, we found that the multithreading feature uses a SharedArrayBuffer, which is only available when some specific CORS policies are set. It had a notable impact on performance but was still far from native. Finally to reach our desired performance for live video processing, we had to tweak the configuration options of the onnx library.

Implementation

The implementation is as simple as taking the input from the user using video, passing each frame into the neural network, and lastly displaying it on the canvas. This required quite a lot of confusing preprocessing and post-processing work to sync the data format between ONNX and canvas/video, however, the impact on performance is negligible.

Sliders

As mentioned, the used neural network includes of instance normalization layers, which normalize the output to a learned mean and deviation. It turns out that modifying the weights of those layers allows for a change in the style of the output image. The main concern here was finding a way to modify the weights while ensuring the ONNX export functions properly. We managed to perform basic tensor operations within a forward pass of the network. This is an effective way of modifying weights without disrupting the network architecture and the export itself. Additionally it removes the need to re-render the video each time values change, resulting in a seamless user experience.

Final thoughts

The on-device machine learning deployment approach offers a compelling alternative to cloud-based inferencing. However, it is not without its drawbacks. Since this technology is relatively new, the runtimes are still in significant development. Consequently, the process can be time-consuming at times. Nevertheless, if you manage to overcome these challenges and use a lightweight model, it is definitely worth exploring, as it opens up endless possibilities. This is especially relevant with the emergence of technologies like webGPU, making it a valuable investment to delve into this method of ML deployment.

--

--