We propose Co\(^3\)Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. We also present a Temporal Interaction Module (TIM) to ensure the temporal synchronization of gestures w.r.t. the corresponding speaker voices while preserving desirable interactive dynamics.
Moreover, we collect a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse high-quality two-person interactive posture sequences, dubbed GES-Inter.
Some video examples and their pose visualizations in GES-Inter dataset. Orignial video clips are shown on the left, and the corresponding pose visualizations are shown on the right.
If it takes too long to load the videos below, please view the videos here.
(Slight differences are inevitable because of smooth operation.)
Video exmples with relatively long duration. If it takes too long to load the videos below, please view the videos here.
Some examples of extracted audios and the corresponding separated audios in GES-Inter dataset. Orignial audios / mixed audios are shown on the left, and the corresponding separated ones are shown on the right.
Acquisition, processing, and filtering of our GES-Inter dataset. To build a high-quality 3D co-speech gesture dataset with concurrent and interactive body dynamics, we collect a considerable number of videos. They are then processed using automated methods to extract both audio and motion information.
To obtain the individual sound signals of each speaker in the conversation while preserving the identity consistency with the posture movement, we employ the pyannote-audio to separate the mixed speech. Afterward, by utilizing the automatic speech recognition techniques Whisper-X, we acquire the consistent text transcript and speech phoneme with speaker audio.
The dotted line separates whether the speech content in the dataset is built based on the conversational corpus.
The overall pipeline of our Co\(^3\)Gesture. Given conversational speech audios, our framework generates concurrent co-speech gestures with coherent interactions.
Here we compare results of our model Co\(^3\)Gesture and those of other state-of-the-art methods through visualization and statistics.
Our results are shown on the left, and the results of compared methods are shown on the right. If it takes too long to load the videos below, please view the videos here.
Ablation Study Comparison: Full version results are shown on the left, and the ablated results are shown on the right. If it takes too long to load the videos below, please view the videos here.
Comparison with the state-of-the-art counterparts on our newly collected GES-Inter dataset.↑ means the higher the better, and ↓ indicates the lower the better. ± means 95% confidence interval.
The dotted line separates whether the methods are adopted from single-person co-speech generation or text2motion counterparts.
User study on gesture naturalness, motion smoothness, and interaction coherency also verify the superiority of our method.
The format of a metadata JSON file is shown in below:
[ {'basic': # Each item in metadata list is a dict of a clip extracted from a longer video {'video_id': 'J-LQxBbQPTY', # Video ID in GES-Inter 'video_path': './J-LQxBbQPTY.mp4', # Relative path of the dataset root path 'video_duration': 792.52, # Duration of the video 'video_resolution': [360, 640], 'video_fps': 25.0, 'clip_id': 'J-LQxBbQPTY_0000000', # Clip ID 'clip_path': 'J-LQxBbQPTY_0000000.mp4', # Relative path of the dataset root path 'clip_duration': 0.96, # Duration of the clip itself/em> 'clip_start_end_idx': [0, 24], # Start frame_id and end frame_id 'clip_start_end_time': ['00:00:00.000', '00:00:00.960'] # Start timestamp and end timestamp }, { ... } } ]