🦇6DOF
Last updated
Last updated
In virtual reality photos and video, there are a spectrum of formats with different degrees of immersion and correctness in 3D rendering (which is important for a comfortable viewing experience). Some are 2D, and others are stereoscopic pseudo-3D (a different image for the left and right eye, wrapped around a sphere or half-sphere). VR videos and photos typically respond to rotation of a user’s head (3 degrees of freedom, a.k.a. 3DOF), but do not respond to moving side to side, forward and backward, or up and down. In contrast, VR games and a tiny fraction of VR videos support 6 degrees of freedom (6DOF), which allows a user to move in all directions, and see correct 3D images regardless of how they move. 3DOF VR video formats provide a suboptimal experience for viewers if they move their head at all, or look anywhere other than directly at the horizon, and these issues can cause motion sickness, double vision, and eye-strain. 6DOF VR video mitigates these issues, but creating 6DOF VR video is much harder. Even big tech companies like Meta and Google have thus far not delivered a practical solution. It is hard because it requires a photorealistic 3D model of every frame of video to be estimated from available sensors, compressed, and efficiently rendered in realtime.
Virtual production is a powerful new tool for 2D filmmaking. The idea is to have a 3D environment rendered on a huge LED wall behind the actors. The 3D environment is usually modeled in Unreal Engine or Unity. If the film camera moves, the image on the LED wall needs to respond accordingly, which is exactly the same problem as rendering 6DOF for VR. It is time consuming and expensive to make environments that look photorealistic in Unreal and Unity. Lifecast started out making software for virtual reality, but we learned from talking with film-industry professionals that the same technology for 6DOF video is a cost-effective and efficient way of creating photorealistic 3D environments for virtual production. Photogrammetry is similar, but applicable only to static scenes, whereas video allows the virtual environments to feel more alive. In this article we mostly explain our progress for VR. We are bringing the same volumetric video technology to virtual production as well.
Lifecast uses the term “volumetric” and 6DOF interchangeably. This terminology will offend some, while for others it conveys the idea clearly. In our view, the most precise use of the term volumetric is for 3D scene representations which assign some value to each point in 3D space, such as voxels, or neural radiance fields (NeRF). However, it has also become common to refer to RGBD (color + depth map) images and video as “volumetric”, and for volumetric video software to operate on one or more streams of RGBD video. The important thing here is that RGBD allows for 6DOF rendering. What we are unveiling today is like RGBD on steroids; for the scholars, it’s an “inflated equiangular layered depth image”.
What does it mean to be immersive? Some VR videos have 180 degree field of view (half a sphere), while others cover a full sphere. We believe that half a sphere is enough to be immersive, while others demand a full 360 sphere. We are focusing on 180 degree content right now because it offers a favorable set of tradeoffs when considering the entire system, from cameras capturing the video, to processing and compression, and realtime playback.
When done well, 6DOF can be more immersive than 3DOF because the user can move, and correct 3D rendering is more immersive than stereoscopic pseudo-3D. However, 6DOF can also have different visual artifacts that reduce immersion. Overcoming these artifacts by creating a photorealistic 3D model of every frame of video, and being able to render that in real time on limited hardware, is an open problem in computer vision and graphics.
Video is harder than photos. Existing techniques such as photogrammetry and NeRF can produce 6DOF 3D representations of a static scene from a large number of images from different points of view. However, basic formulations do not work for parts of a scene that move, and extension to video is non-trivial, and involves tradeoffs with practicality. For example, prior work from Meta and Google on immersive volumetric video uses custom camera arrays with 24 or 46 cameras, in order to have many images of the scene all at the same moment in time. Unfortunately, working with this many cameras isn’t very practical.
Academic publications provide an inspiring window into the future, but so far volumetric video for VR hasn’t become mainstream because it isn’t practical to create, edit, or watch. Lifecast believes the elements of a practical solution include:
It is possible to capture anything, anywhere, not just in a controlled environment.
It is possible to capture using reliable off-the-shelf cameras.
The amount of data captured is not prohibitively large.
The data can be processed into a volumetric representation in a reasonable amount of time.
The volumetric video can be edited using existing tools such as Adobe Premiere.
The video can be compressed efficiently and streamed over the internet.
The player runs in real time on the most popular, widely available mobile VR devices, which have relatively little GPU power compared with desktop VR systems.
The player runs on the web (not just standalone applications).
The player can be mixed into Unreal and Unity projects.
Lifecast’s software is designed with all of these goals in mind. Our approach is to work within the limitations of current hardware, and use more machine learning.
Recently, the Canon EOS R5 with dual fisheye lens has emerged as a category-redefining VR camera which can capture cinematic quality VR180 footage in 8K resolution. We developed a new pipeline for processing volumetric video which works with this camera, or any other VR180 camera (some other top-notch VR180 cameras include the FM Duo by FXG, and the K2 Pro by Z-Cam).
Other approaches to volumetric video such as light stages use many (sometimes hundreds) of cameras facing inward to capture a detailed 3D model of a person. Such systems have many uses, but they cannot capture full immersive scenes on their own, and are not applicable to filming volumetric video in any location. RGBD depth sensors such as Azure Kinect have limited capabilities outdoors, and insufficient field of view and resolution for VR. Lifecast makes volumetric video using VR180 cameras, which have sufficient resolution and field of view for VR, but require more machine learning to process the data. Our first-generation pipeline for converting VR180 video to volumetric/6DOF shipped over a year ago. Since then, we have improved the visual quality of the results significantly.
Even in 2D, it has always been a challenge to make VR videos and photos look clear because we have to stretch a limited number of pixels to cover a sphere, or half a sphere. There just aren’t enough pixels to go around. A “projection” is a particular formula for wrapping a rectangular image around a sphere. For example, the most widely used projection for VR video and photos is equirectangular (which is also used to make maps of the earth).
The VR180 format for VR videos and photos has recently become popular because it provides a good set of tradeoffs in resolution, field of view, and ease of stitching. VR180 consists of one image for the left eye, and one for the right eye, in equirectangular projection, but each eye gets 180 degrees (half a sphere) instead of a whole sphere