GES-Inter Dataset

Basic information

Our conversational videos are collected mostly from talk show and interview videos.
Videos contain only two people with appropriate sizes and noticable gestures.
Accompanied with attributes such as 3D poses (including facial, mesh, phoneme, text, etc.), audios, speaker diarizations, and audio-speaker alignment tables.

Basic statistics

1462 high-quality videos in total.
Videos are cut into 27,390 clips.
Total duration of videos is 70 hours.
More than 7M video frames.

VIDEO Examples

Some video examples and their pose visualizations in GES-Inter dataset. Orignial video clips are shown on the left, and the corresponding pose visualizations are shown on the right.
If it takes too long to load the videos below, please view the videos here.

(Slight differences are inevitable because of smooth operation.)

• Long Video examples

Video exmples with relatively long duration. If it takes too long to load the videos below, please view the videos here.

AUDIO Examples

Some examples of extracted audios and the corresponding separated audios in GES-Inter dataset. Orignial audios / mixed audios are shown on the left, and the corresponding separated ones are shown on the right.

Mixed

Separated

Mixed

Separated

Mixed

Separated

Mixed

Separated

Mixed

Separated

Data Processing

• Dataset Processing Pipeline

Acquisition, processing, and filtering of our GES-Inter dataset. To build a high-quality 3D co-speech gesture dataset with concurrent and interactive body dynamics, we collect a considerable number of videos. They are then processed using automated methods to extract both audio and motion information.

• Audio Separation and Alignment

To obtain the individual sound signals of each speaker in the conversation while preserving the identity consistency with the posture movement, we employ the pyannote-audio to separate the mixed speech. Afterward, by utilizing the automatic speech recognition techniques Whisper-X, we acquire the consistent text transcript and speech phoneme with speaker audio.

Statistical Comparison

Comparison of GES-Inter with other datasets

The dotted line separates whether the speech content in the dataset is built based on the conversational corpus.

Co\(^3\)Gesture

Model Pipeline

The overall pipeline of our Co\(^3\)Gesture. Given conversational speech audios, our framework generates concurrent co-speech gestures with coherent interactions.

SOTA Methods Comparison

Here we compare results of our model Co\(^3\)Gesture and those of other state-of-the-art methods through visualization and statistics.

• Videos Comparison

Our results are shown on the left, and the results of compared methods are shown on the right. If it takes too long to load the videos below, please view the videos here.