Co\(^3\)Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

We propose Co\(^3\)Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. We also present a Temporal Interaction Module (TIM) to ensure the temporal synchronization of gestures w.r.t. the corresponding speaker voices while preserving desirable interactive dynamics.

Moreover, we collect a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse high-quality two-person interactive posture sequences, dubbed GES-Inter.

GES-Inter Dataset

Basic information

  • Our conversational videos are collected mostly from talk show and interview videos.
  • Videos contain only two people with appropriate sizes and noticable gestures.
  • Accompanied with attributes such as 3D poses (including facial, mesh, phoneme, text, etc.), audios, speaker diarizations, and audio-speaker alignment tables.

Basic statistics

  • 1462 high-quality videos in total.
  • Videos are cut into 27,390 clips.
  • Total duration of videos is 70 hours.
  • More than 7M video frames.

VIDEO Examples

Some video examples and their pose visualizations in GES-Inter dataset. Orignial video clips are shown on the left, and the corresponding pose visualizations are shown on the right.
If it takes too long to load the videos below, please view the videos here.

(Slight differences are inevitable because of smooth operation.)

• Long Video examples

Video exmples with relatively long duration. If it takes too long to load the videos below, please view the videos here.

AUDIO Examples

Some examples of extracted audios and the corresponding separated audios in GES-Inter dataset. Orignial audios / mixed audios are shown on the left, and the corresponding separated ones are shown on the right.

Data Processing

• Dataset Processing Pipeline

Acquisition, processing, and filtering of our GES-Inter dataset. To build a high-quality 3D co-speech gesture dataset with concurrent and interactive body dynamics, we collect a considerable number of videos. They are then processed using automated methods to extract both audio and motion information.

• Audio Separation and Alignment

To obtain the individual sound signals of each speaker in the conversation while preserving the identity consistency with the posture movement, we employ the pyannote-audio to separate the mixed speech. Afterward, by utilizing the automatic speech recognition techniques Whisper-X, we acquire the consistent text transcript and speech phoneme with speaker audio.

Statistical Comparison

Comparison of GES-Inter with other datasets

The dotted line separates whether the speech content in the dataset is built based on the conversational corpus.

Co\(^3\)Gesture

Model Pipeline

The overall pipeline of our Co\(^3\)Gesture. Given conversational speech audios, our framework generates concurrent co-speech gestures with coherent interactions.

SOTA Methods Comparison

Here we compare results of our model Co\(^3\)Gesture and those of other state-of-the-art methods through visualization and statistics.

• Videos Comparison

Our results are shown on the left, and the results of compared methods are shown on the right. If it takes too long to load the videos below, please view the videos here.

• Ablation Study Comparison

Ablation Study Comparison: Full version results are shown on the left, and the ablated results are shown on the right. If it takes too long to load the videos below, please view the videos here.

• Image Comparison (Screenshots from Videos)

• Statistical Comparison

Comparison with the state-of-the-art counterparts on our newly collected GES-Inter dataset.↑ means the higher the better, and ↓ indicates the lower the better. ± means 95% confidence interval. The dotted line separates whether the methods are adopted from single-person co-speech generation or text2motion counterparts.
User study on gesture naturalness, motion smoothness, and interaction coherency also verify the superiority of our method.


Metadata

Dataset metadata format

The format of a metadata JSON file is shown in below:


[
  
{'basic':       # Each item in metadata list is a dict of a clip extracted from a longer video
{'video_id': 'J-LQxBbQPTY',                 # Video ID in GES-Inter
'video_path': './J-LQxBbQPTY.mp4',                       # Relative path of the dataset root path
'video_duration': 792.52,               # Duration of the video
'video_resolution': [360, 640],
'video_fps': 25.0, 
'clip_id': 'J-LQxBbQPTY_0000000',           # Clip ID
'clip_path': 'J-LQxBbQPTY_0000000.mp4',          # Relative path of the dataset root path
'clip_duration': 0.96,            # Duration of the clip itself/em>
'clip_start_end_idx': [0, 24],     # Start frame_id and end frame_id
'clip_start_end_time': ['00:00:00.000', '00:00:00.960']     # Start timestamp and end timestamp
}, 
{
  ...
}
}
]