GES-X: a Large-scale 3D Co-speech Gesture Dataset

We collect a large-scale 3D meshed co-speech whole-body dataset that contains more than 40M posture instances across about 4.3K aligned speaker audios, dubbed GES-X. To the best of our knowledge, this is the largest whole-body meshed 3D co-speech gesture dataset, whose duration is 15 times of the next largest one.

Basic informations

  • Videos are collected from talk show videos.
  • 4.3K videos in total, including both standing and sitting positions.
  • Integrates a wide range of attributes, including facial, mesh, phoneme, text, body, hand, and joint annotations.

Basic statistics

  • Total duration of videos is 450 hours.
  • Average duration of videos is 13min.
  • More than 40M gesture frames.
  • 80,720 word corpus in total.

Examples

Some video-pose examples in GES-X dataset. Original video clips are shown on the left, and the corresponding pose sequences are shown on the right.
If it takes too long to load the videos below, please view the videos here.

GES-X Statistics

Dataset Analysis and Statistics

Overview of GES-X Dataset:

Statistics comparison of our GES-X with existing ones. The dotted line “- - -” separates whether the posture in the dataset is built based on the mesh. As for the duration of meshed whole body co-speech gesture datasets, our GES-X is 15 times of the second best one (i.e.BEAT2).


Dataset statistical comparison:

Dataset statistical comparison between our GES-X and existing meshed co-speech gesture datasets (i.e.BEAT2, TalkSHOW). Our GES-X has a much larger word corpus and a more widely uniform distributed gesture motion degree.

CoCoGesture

Our framework CoCoGesture and key module Mixture-of-Gesture-Experts (MoGE) block.

Task Introduction

In our work, we focus on the task of vivid and diverse co-speech 3D gesture generation from in-the-wild human voices.
Details about our task are shown in the figure below:


CoCoGesture Framework

Along with GES-X, we proposed CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen voice of human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm.


Mixture-of-Gesture-Experts (MoGE)

Moreover, we designed a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism.
Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation.

CoCoGesture Results

Visualization of Our Results

Demo Video

If it takes too long to load the videos below, please view the videos here.

Gesture Predictions

If it takes too long to load the videos below, please view the videos here.

Experiments

Extensive experiments and comprehensive results demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.

Methods Performance Comparison

Comparison with the state-of-the-art counterparts on BEAT2 and TalkSHOW datasets.

↑ means the higher the better.
↓ indicates the lower the better.
“-” denotes that the method cannot be applied to the TalkSHOW dataset due to the lack of text transcripts.
The term “zero-shot” implies that the dataset contains unseen human voices.


Experiments on Model Scale and Pre-training

Ablation study on model scale and pre-training setting.
‡ denotes without pre-training stage.


Experiments on Mixture-of-Gesture-Experts (MoGE)

Ablation study of MoGE block on BEAT2 dataset.

Metadata

Dataset metadata format

The format of a metadata JSON file is shown in below: