The WILDTRACK Seven-Camera HD Dataset ‒ CVLAB ‐ EPFL

The challenging and realistic setup of the ‘WILDTRACK‘ dataset brings multi-camera detection and tracking methods into the wild.

It meets the need of the deep learning methods for a large-scale multi-camera dataset of walking pedestrians, where the cameras’ fields of view in large part overlap. Being acquired by current high tech hardware it provides HD resolution data. Further, its high precision joint calibration and synchronization shall allow for development of new algorithms that go beyond what is possible with currently available data-sets.

Camera 1: GoPro Hero 3	Camera 2: GoPro Hero 3	Camera 3: GoPro Hero 3
	Camera 4: GoPro Hero 3
Camera 5: GoPro Hero 4	Camera 6: GoPro Hero 4	Camera 7: GoPro Hero 4

Download

To download the annotated dataset (frames & annotations):

Wildtrack_dataset_full.zip

To download the videos:

Camera 1: https://drive.google.com/open?id=1sGUnExmJM2_tFuBd9LNlexf0LN2m0_c-
Camera 2: https://drive.google.com/open?id=1OnaRN2qYhZ2n4rSaNZQJzQPd1Cl2fluk
Camera 3: https://drive.google.com/open?id=1I7ARLVVfZqdZTbYG_TPEfDt1Bc8N35Lb
Camera 4: https://drive.google.com/open?id=1sXn70X-bV_YGPv43r4-iMtK_Js09eUVB
Camera 5: https://drive.google.com/open?id=1ExTyezMWqmLefi2kJTQwYxZivJuag7G5
Camera 6: https://drive.google.com/open?id=1NM4kNjdyiC6JioOT90s9vFmMUkeSnV0z
Camera 7: https://drive.google.com/open?id=1pZ3pWBuaLgPWGfOcY-tAtsBAUyQZW_YX

Hardware and data acquisition

This new multi-camera dataset was acquired using seven high-tech statically positioned cameras with overlapping fields of view. Namely, three GoPro Hero 4 and four GoPro Hero 3 cameras were used. It comes with highly accurate joint-camera calibration as well as synchronization between the views’ sequences.

The data acquisition took place in front of the main building of ETH Zurich, Switzerland, during nice weather conditions. The sequences are of resolution 1920×1080 pixels, shot at 60 frames per second.

Description of available files

Currently we provide:

Synchronized frames extracted with a frame rate of 10 fps, 1920×1080 resolution, and which are post-processed to remove the distortion;
Calibration files which use the Pinhole camera model, compatible with the projection functions provided in the OpenCV library. Both the extrinsic and the intrinsic calibrations are available;
The ground-truth annotations in a ‘json’ file format (please see separate section bellow);
For ease in usage for methods focusing on classification, we also provide a file we refer to as ‘positions’ file in ‘json’ file format. For details please refer to the section bellow.

Please check for an update of this site, which shell extend the download list with:

Full videos;
Corresponding points annotations which may be used for camera calibration algorithms;
A second part of this dataset which albeit not being annotated, can be used for unsupervised methods.

Positions file

The ‘positions file’ allows for omitting the work with calibration files and focusing for instance on classification, while making use of the fact that the cameras are static. It consists of information about where exactly a given set of particular volumes of space project to in all of the views. The height of each volume space corresponds to the one of an average person’s height.

We discretize the ground surface as a regular grid. The 3D space occupied if a person is standing at a particular position is modelled by a cylinder positioned centrally on the grid point. Each cylinder projects into each of the separate 2D views as a rectangle whose position in the view is given in pixel coordinates.

Using a 480×1440 grid – totalling into 691200 positions – and the provided camera calibration files, we yield such file which is available for download. Each position is assigned an ID using 0-based enumeration ([0, 691199]). The views’ ordering numbers in this file also follow such enumeration, i.e. they range between 0 and 6 inclusively. The positions which are not visible in a given view are assigned coordinates of -1.

Annotations

Full ground truth annotations are provided for 400 frames using a frame rate of 2fps. On average, there are 20 persons on each frame. Thus, our dataset provides approximately 400x20x7=56,000 single-view bounding boxes. By interpolating, the annotations’ size can be further increased. This annotations were generated through workers hired on Amazon Mechanical Turk.

Note that the annotations roughly correspond to the coordinates of the above-elaborated position file and thus include the ID of the annotated position which is estimated to be occupied by the specific target. These position IDs are in accordance with the provided positions file.

Acknowledgment

This work was supported by the Swiss National Science Foundation, under the grant CRSII2-147693 ”WILDTRACK”.

Publication

WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection

T. Chavdarova; P. Baqué; A. Maksai; S. Bouquet; C. Jose et al.

2018. Proceedings of the IEEE international conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, Jun 18-23, 2018. p. 5030-5039. DOI : 10.1109/CVPR.2018.00528.

Detailed record

View at publisher