Creating detailed maps and mapping the environment has become a growing interest in the age of self-driving vehicles and robots executing jobs autonomously. Until now, only very little parts of the world have been mapped in 3D and the need to do so has been limited to companies whose products rely on accurate 3D data. On the other hand, with the advent of social media and various photo/video sharing platforms people are sharing more and more photos and videos on the internet. Just to give a rough idea, on YouTube alone 300 hours of videos gets uploaded every minute on average. Introducing a service offering to turn amateur videos in a 3D scene opens up a novel opportunity to collect and share memories. Combining 3D models from places would eventually lead to creating complete 3D maps of public places and buildings up to whole cities.
Openstreetmap is already picking up the idea of using videos for mapping purposes . We want to go further and allow to use the collected data for map enhancement purposes. This project aims to introduce a functional proof of concept for an online service offering the reconstruction of 3D scenes from amateur videos. The final goal is to initiate a crowdsourcing approach by offering a service that might later be combined with concepts as seen in the product Ubipix.
2.1 Data Collection
In order to develop and test a functional 3D reconstruction pipeline we manually collected images and videos using digital cameras. Additionally, we downloaded several videos from YouTube that showed monuments in a sufficient number of consecutive frames. We thinned the data before processing it, meaning that we hand-selected certain frames or images to feed into the software. This allowed for shorter processing times and thereby faster development.
2.2 Reconstruction software
A fully automated pipeline for reconstructing 3D point clouds from a set of images was created. To simplify user interaction we allow to upload a whole video of the scene that is to be reconstructed. During preprocessing, we subsample the video meaning that we only extract a limited number of frames per time interval.
Given images and their metadata (eg. focal length, distortions, etc.) the pipeline constructs a dense 3D model. The main principle is the so called Structure from Motion (SfM) technique which recovers 3D structure of the scene using images taken from different viewpoints. The reconstruction pipeline consists of multiple modules. First, features are extracted from images and feature matching is done to compute transformations (relative rotation and translation) between images. These transformations are used to triangulate points in real-world coordinates. This also allows to infer the original pose of the camera with respect to the scene.
A second step, called bundle adjustment, optimizes the alignment of all the computed 3D points by minimizing the reprojection error of the point cloud. The framework used for this process is openMVG (open Multi View Geometry), an opensource framework for bundle-adjustment and sparse 3D reconstruction.
In the last step, in the pipeline, the sparse 3D reconstruction is refined. A dense 3D reconstruction of the scene is created by iteratively growing patches from the sparse 3D points taking into account certain photometric constrains. The opensource library CMVS (Clustering Views for Multi-view Stereo) was used for this step.
An an optional step we generate meshes from 3D point clouds and apply texture to the 3D model. The freely available MeshLab was used for this post-processing step during development. This part of the pipeline is however excluded in the final version of the project since it requires manual cleaning of the mesh and hence cannot be done automatically.
The entire reconstruction pipeline is currently running on the server.
2.3 Implementation as Online Service
We have implemented a basic functioning web service where one can upload videos from youtube and create 3D models out of such videos. We have tested the service on our local machine and we will soon migrate to a dedicated server before going public.
Making assumptions about expected numbers of users or videos per day is difficult and also not beneficial at this point. We were more interested in implementing the concept than in optimising for a particular usage profile. Therefore, the interaction between front-end and back-end was designed in a simple serial way.
Requests are processed in a first-come-first-served manner. A database keeps track of waiting and finished jobs. Once a job is finished successfully (no error during reconstruction) the built model is displayed on the web interface. Hence the front-end (the web service) and the back-end (the reconstruction pipeline) communicate via databases.
User management: Users can register to the system and create their own profile, where they manage their videos and personal information. However, this is not mandatory.
Video upload module: Our web interface allows users to select videos form youtube that they want to create 3D models form. Such video can be any video of some static scene. Due to some storage constraints and for ease of implementation we decided to use youtube as a video storage server. So if a user wants to create 3D models from a personal video he uploads the video to youtube and then on our web interface he selects the link to the uploaded video along with some other additional information. Several fields offer the option to add description or location (city, country). An embedded map allows to point to the exact GPS location. We currently do not use the GPS locations however this might beneficial in the future while merging different point clouds based on location proximity. Also some additional meta data are requested form the user, mainly the focal length of the camera. However this is optional and if not provided with the reconstruction pipeline uses some default values.
Videos and 3D model listing: 3D models will be displayed alongside the videos in a listing page. The web service is going to be a completely open platform where users will be able to upload any videos and create 3D models and also see everybody else’s models.
Model display module: In order to display point clouds in a web browser we made use of an open source library called the XB-PointStream. It supports WebGL based point cloud streaming and rendering along with various user interactions. The reason we chose this library was that it supports streaming of point clouds so the user does not have to wait until the entire point cloud is loaded. This is beneficial since the point clouds created by our reconstruction pipeline are usually large (tens of Megabytes) and waiting for the entire point cloud to load would be tedious and impractical.
3.1 Testing Reconstruction Separately
As a proof of concept we downloaded a video from YouTube, extracted images from the video stream and constructed a dense 3D reconstruction in a fully automated way.
The above video was taken by a GoPro camera onboard a quadcopter (DJI Phantom) flying above the famous Colosseum in Rome. We extracted images from a continuous shot in the video stream and fed it to the sparse reconstruction module. Below is the result. The green dots are the camera poses in the scene.
We then fed the sparse reconstruction to the multi-view stereo module for patch expansion and dense 3D reconstruction. Below is the result. As can be seen, even a small collection of 26 image frames can lead to a reasonably good reconstruction result.
3.2 Testing the Product as a Whole/Deployment
Depending on the performance we reach after deploying our idea on the servers we will advertise our service to a smaller or larger number of people through mailing lists and announcement on social media.
Extensive research on available frameworks for 3D reconstruction and also for front-end development allowed us to create a fully functional reconstruction service. Given that our software is running on a single machine with limited disk space, this project is still to be seen as proof of concept.
Summary and Outlook:
An online service was developed for the automatic reconstruction of 3D models from amateur videos. The reconstruction pipeline was constructed from state-of-the-art computer vision and optimisation frameworks. The pipeline was embedded in a web interface allowing for public access and simple user management.
Processing amateur videos does not only satisfy the demand of the great public but also offers a large number of opportunities to further develop this project. Possible extensions include enhancing street maps by embedding 3D models of important points of interest. An even further approach might be to combine models created at neighbouring locations, refine individual models or to build larger 3D models and eventually create 3D representations of whole streets or even cities.