Basic informations

Videos are collected from talk show videos.
4.3K videos in total, including both standing and sitting positions.
Integrates a wide range of attributes, including facial, mesh, phoneme, text, body, hand, and joint annotations.

Basic statistics

Total duration of videos is 450 hours.
Average duration of videos is 13min.
More than 40M gesture frames.
80,720 word corpus in total.

Examples

Some video-pose examples in GES-X dataset. Original video clips are shown on the left, and the corresponding pose sequences are shown on the right.
If it takes too long to load the videos below, please view the videos here.

GES-X Statistics

Dataset Analysis and Statistics

Overview of GES-X Dataset:

Statistics comparison of our GES-X with existing ones. The dotted line “- - -” separates whether the posture in the dataset is built based on the mesh. As for the duration of meshed whole body co-speech gesture datasets, our GES-X is 15 times of the second best one (i.e.BEAT2).

Dataset statistical comparison:

Dataset statistical comparison between our GES-X and existing meshed co-speech gesture datasets (i.e.BEAT2, TalkSHOW). Our GES-X has a much larger word corpus and a more widely uniform distributed gesture motion degree.

CoCoGesture

Our framework CoCoGesture and key module Mixture-of-Gesture-Experts (MoGE) block.

Task Introduction

In our work, we focus on the task of vivid and diverse co-speech 3D gesture generation from in-the-wild human voices.
Details about our task are shown in the figure below:

CoCoGesture Framework

Along with GES-X, we proposed CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen voice of human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm.

Mixture-of-Gesture-Experts (MoGE)

Moreover, we designed a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism.
Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation.

Experiments

Extensive experiments and comprehensive results demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.

Methods Performance Comparison

Comparison with the state-of-the-art counterparts on BEAT2 and TalkSHOW datasets.

• Visualization Comparison

Comparison of our method with other state-of-the-art approaches. Our results are shown on the left, and the results of compared methods are shown on the right. If it takes too long to load the videos below, please view the videos here.

CoCoGesture

EMAGE

HA2G

CAMN

DiffuGesture

Trimodal

ProbTalk

• Statistical Comparison

↑ means the higher the better. ↓ indicates the lower the better.
“-” denotes that the method cannot be applied to the TalkSHOW dataset due to the lack of text transcripts.
The term “zero-shot” implies that the dataset contains unseen human voices.

Experiments on Model Scale and Pre-training

• Ablation Study

Ablation Study: Full version results are shown on the left, and the ablated results are shown on the right. If it takes too long to load the videos below, please view the videos here.

CoCoGesture-Base

CoCoGesture-Medium

CoCoGesture-Large(WO Pre-training)

CoCoGesture-Large

• Statistical Comparison

Ablation study on model scale and pre-training setting.
‡ denotes without pre-training stage.

Experiments on Mixture-of-Gesture-Experts (MoGE)

• Ablation Study

Ablation Study: Comparing full version results with ablated versions. If it takes too long to load the videos below, please view the videos here.

W/O Corss Attention

Full Version

W/O Routing

• Statistical Comparison

Ablation study of MoGE block on BEAT2 dataset.

Experiments of Training on BEAT2

Ablation study of direct training on BEAT2 dataset. *denotes our model is pre-trained on BEAT2, while †means the source training set is our GES-X.

Metadata

Dataset metadata format

The format of a metadata JSON file is shown in below:

(Please view the JSON format in larger screen.)


{'basic':       # Each item in metadata list is a dict of a clip extracted from a longer video
    {'video_id': 'J-LQxBbQPTY',                 # Video ID in GES-X
		'video_path': 'TED_Talk_Videos_key/J-LQxBbQPTY.mp4',                       # Relative path of the dataset root path
		'video_duration': 792.52,               # Duration of the video
		'video_resolution': [360, 640],
		'video_fps': 25.0, 
		'clip_id': 'J-LQxBbQPTY_0000000',           # Clip ID
		'clip_path': 'J-LQxBbQPTY_0000000.mp4',          # Relative path of the dataset root path
		'clip_duration': 0.96,            # Duration of the clip itself/em>
		'clip_start_end_idx': [0, 24],     # Start frame_id and end frame_id
		'clip_start_end_time': ['00:00:00.000', '00:00:00.960']     # Start timestamp and end timestamp
    }, 
  {
      ...
  }
}

Basic informations

Basic statistics

Examples

GES-X Statistics

Dataset Analysis and Statistics

Overview of GES-X Dataset:

Dataset statistical comparison:

CoCoGesture

Our framework CoCoGesture and key module Mixture-of-Gesture-Experts (MoGE) block.

Task Introduction

CoCoGesture Framework

Mixture-of-Gesture-Experts (MoGE)

CoCoGesture Results

Visualization of Our Results

Demo Video

Gesture Predictions

Experiments

Extensive experiments and comprehensive results demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.

Methods Performance Comparison

• Visualization Comparison

CoCoGesture

EMAGE

HA2G

CAMN

DiffuGesture

Trimodal

ProbTalk

• Statistical Comparison

Experiments on Model Scale and Pre-training

• Ablation Study

CoCoGesture-Base

CoCoGesture-Medium

CoCoGesture-Large(WO Pre-training)

CoCoGesture-Large

• Statistical Comparison

Experiments on Mixture-of-Gesture-Experts (MoGE)

• Ablation Study

W/O Corss Attention

Full Version

W/O Routing

• Statistical Comparison

Experiments of Training on BEAT2

Metadata

Dataset metadata format