We collect a large-scale 3D meshed co-speech whole-body dataset that contains more than 40M posture instances across about 4.3K aligned speaker audios, dubbed GES-X. To the best of our knowledge, this is the largest whole-body meshed 3D co-speech gesture dataset, whose duration is 15 times of the next largest one.
Some video-pose examples in GES-X dataset. Original video clips are shown on the left, and the corresponding pose sequences are shown on the right.
If it takes too long to load the videos below, please view the videos here.
Statistics comparison of our GES-X with existing ones. The dotted line “- - -” separates whether the posture in the dataset is built based on the mesh. As for the duration of meshed whole body co-speech gesture datasets, our GES-X is 15 times of the second best one (i.e.BEAT2).
Dataset statistical comparison between our GES-X and existing meshed co-speech gesture datasets (i.e.BEAT2, TalkSHOW). Our GES-X has a much larger word corpus and a more widely uniform distributed gesture motion degree.
In our work, we focus on the task of vivid and diverse co-speech 3D gesture generation from in-the-wild human voices.
Details about our task are shown in the figure below:
Along with GES-X, we proposed CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen voice of human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm.
Moreover, we designed a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio
embedding from the human speech and the gesture features from the pre-trained
gesture experts with a routing mechanism.
Such an effective manner ensures
audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation.
Comparison with the state-of-the-art counterparts on BEAT2 and TalkSHOW datasets.
Comparison of our method with other state-of-the-art approaches. Our results are shown on the left, and the results of compared methods are shown on the right. If it takes too long to load the videos below, please view the videos here.
↑ means the higher the better.
↓ indicates the lower the better.
“-” denotes that the method cannot be applied to the TalkSHOW dataset due to the lack of text transcripts.
The term “zero-shot” implies that the dataset contains unseen human voices.
Ablation Study: Full version results are shown on the left, and the ablated results are shown on the right. If it takes too long to load the videos below, please view the videos here.
Ablation study on model scale and pre-training setting.
‡ denotes without pre-training stage.
Ablation Study: Comparing full version results with ablated versions. If it takes too long to load the videos below, please view the videos here.
Ablation study of MoGE block on BEAT2 dataset.
Ablation study of direct training on BEAT2 dataset. *denotes our model is pre-trained on BEAT2, while †means the source training set is our GES-X.
The format of a metadata JSON file is shown in below:
{'basic': # Each item in metadata list is a dict of a clip extracted from a longer video {'video_id': 'J-LQxBbQPTY', # Video ID in GES-X 'video_path': 'TED_Talk_Videos_key/J-LQxBbQPTY.mp4', # Relative path of the dataset root path 'video_duration': 792.52, # Duration of the video 'video_resolution': [360, 640], 'video_fps': 25.0, 'clip_id': 'J-LQxBbQPTY_0000000', # Clip ID 'clip_path': 'J-LQxBbQPTY_0000000.mp4', # Relative path of the dataset root path 'clip_duration': 0.96, # Duration of the clip itself/em> 'clip_start_end_idx': [0, 24], # Start frame_id and end frame_id 'clip_start_end_time': ['00:00:00.000', '00:00:00.960'] # Start timestamp and end timestamp }, { ... } }