MMTrail

Basic information

There are 290K trailer videos in total, collected mostly from trailer videos.
Videos belong to diverse categories, including film & animation, entertainment, comedy, news & politics, etc.
Accompanied with a wide range of multimodal captions such as fine-grained video labels, descriptions, music captions, and speech transcriptions.

Basic statistics

20M+ hight quality video clips.
Total duration of videos is 27.1k hours.
Average duration of videos is 5s.
Each clip has an average of 10.7 words in its frame captions.
2M+ multimodal captioned subset, featuring over 7 types of captions, with videos averaging 13.8 seconds in duration.

Examples

Some video-captions examples in MMTrail dataset. Trailer video clips are shown on the left, and the corresponding captions are shown on the right.
If it takes too long to load the videos below, please view the videos here.

Frame Captions:

A woman standing next to a wall holding a cell phone.
A woman in a blue shirt is looking at a cell phone.
A woman with long black hair standing in front of a wall.

Music Caption:
Scene Caption:
Merge Caption:

Frame Captions:

A man with long black hair is looking into the sky.
A man standing in front of a body of water at night.
A man standing in front of a body of water at night.

Music Caption:
Scene Caption:
Merge Caption:

Frame Captions:

A cat laying on a couch next to a window.
A person laying on a couch in a dark room.
A man sitting on top of a bed next to pillows.

Music Caption:
Scene Caption:
Merge Caption:

Frame Captions:

A person standing in a room with a laptop on their lap.
A person is seen through a window at night.
A blurry picture of a bus in the dark.

Music Caption:
Scene Caption:
Merge Caption:

Frame Captions:

A man sitting in front of a window looking down.
A man in a brown shirt holding a gun in his hand.
A man in a brown shirt is holding a drink.

Music Caption:
Scene Caption:
Merge Caption:

Frame Captions:

A person is standing in a dark room.
A man in a mask is standing in the dark.
A black cat sitting on top of a wooden floor.

Music Caption:
Scene Caption:
Merge Caption:

MMTrail Statistics

Dataset Analysis and Statistics

MMTrail Data Pipeline:

Data collection and cleaning pipeline of the MMTrail. Starting from the source video data, we follow the metadata cleaning, scene-cut, and basic filtering to obtain the full list of MMTrail-20M and High-Quality Selection to filter the MMTrail-2M.

Scores of MMTrail Dataset:

Statistic of the MMTrail clips. These evaluation scores collectively include OCR score, Video duration, optical flow score, clip duration, image quality, and aesthetic score, demonstrating the richness and diversity of MMTrail, making it a valuable resource for multimedia research.

Examples of Highest & Lowest Scores:

Aesthetic Score:

Highest:

Lowest:

Optical Flow Score:

Highest:

Lowest:

Image Quality Score:

Highest:

Lowest:

OCR Score:

Highest:

Lowest:

Wordclouds of MMTrail Dataset:

Objects

Backgrounds

Word cloud of the (left) objects and (right) background in MMTrail. Most of the objects are human, and most of the backgrounds are indoor scenes like office, kitchen, etc.

Sources of MMTrail Dataset:

Distribution of video categories of MMTrail dataset.

Human Evaluation of MMTrail Dataset:

Human evaluation results of the captioning models on the MMTrail-Test. The X-axis is the average evaluation score from 0-10, and the Y-axis is the average word numbers.

Split	Download	# Samples	Video Duration	Storage Space
Training (20M)	TODO (-)	20M	27.1 khrs	~8.0 TB
Training (2M)	TODO (-)	2M	12.2 khrs	~1.6 TB
Testing	TODO (2.77 MB)	1,000	3.5 hrs	794 Mb

Metadata

Dataset metadata format

The format of a metadata JSON file is shown in below:

(Please view the JSON format in larger screen.)

[
{
'video_id': 'zW1-6V_cN8I', # Video ID in MMTrail
'video_path': 'group_32/zW1-6V_cN8I.mp4', # Relative path of the dataset root path
'video_duration': 1645.52, # Duration of the video
'video_resolution': [720, 1280],
'video_fps': 25.0,
'clip_id': 'zW1-6V_cN8I_0000141', # Clip ID
'clip_path': 'video_dataset_32/zW1-6V_cN8I_0000141.mp4', # Relative path of the dataset root path
'clip_duration': 9.92, # Duration of the clip itself
'clip_start_end_idx': [27102, 27350], # Start frame_id and end frame_id
'image_quality': 45.510545094807945, # Image quality score
'of_score': 6.993135, # Optical flow score
'aesthetic_score': [4.515582084655762, 4.1147027015686035, 3.796849250793457],
'music_caption_wo_vocal': [{'text': 'This song features a drum machine playing a simple beat. A siren sound is played on the low register. Then, a synth plays a descending lick and the other voice starts rapping. This is followed by a descending run. The mid range of the instruments cannot be heard. This song can be played in a meditation center.', 'time': '0:00-10:00'}], # Music description of the background music without vocal (human voice).
'vocal_caption': 'I was just wondering...' # Speech recongitation.
'frame_caption': ['two people are standing in a room under an umbrella . ', 'a woman in a purple robe standing in front of a man . ', 'a man and a woman dressed in satin robes . '], # Coca caption of three key frame
'music_caption': [{'text': 'This music is instrumental. The tempo is medium with a synthesiser arrangement and digital drumming with a lot of vibrato and static. The music is loud, emphatic, youthful, groovy, energetic and pulsating. This music is a Electro Trap.', 'time': '0:00-10:00'}] # Music description of the background music.
'objects': [' bed', 'Woman', ' wall', ' pink robe', ' pillow'],
'background': 'Bedroom',
'ocr_score': 0.0,
'caption': 'The video shows a woman in a pink robe standing in a room with a bed and a table, captured in a series of keyframes that show her in various poses and expressions.', # Caption generation from LLaVA and rewrite by LLAMA-13B
'polish_caption': 'A woman in a pink robe poses and expresses herself in various ways in a room with a bed and a table, capturing her graceful movements and emotive facial expressions.', # Polished caption generation from LLaVA and rewrite by LLAMA-13B
'merge_caption': 'In a cozy bedroom setting, a stunning woman adorned in a pink robe gracefully poses and expresses herself, her movements and facial expressions captured in a series of intimate moments. The scene is set against the backdrop of a comfortable bed and a table, with an umbrella standing in a corner of the room. The video features two people standing together under the umbrella, a woman in a purple robe standing confidently in front of a man, and a man and woman dressed in satin robes, all set to an energetic and pulsating electro trap beat with a synthesiser arrangement and digital drumming. The music is loud and emphatic, capturing the youthful and groovy vibe of the video.'# The final description of the video. It is the merge of all above captions, and merged by LLaMA
}
]

Basic information

Basic statistics

Examples

MMTrail Statistics

Dataset Analysis and Statistics

MMTrail Data Pipeline:

Scores of MMTrail Dataset:

Examples of Highest & Lowest Scores:

Aesthetic Score:

Highest:

Lowest:

Optical Flow Score:

Highest:

Lowest:

Image Quality Score:

Highest:

Lowest:

OCR Score:

Highest:

Lowest:

Wordclouds of MMTrail Dataset:

Objects

Backgrounds

Sources of MMTrail Dataset:

Human Evaluation of MMTrail Dataset:

Captioning Pipeline

MMTrail Captioning Pipeline

Downloads

Metadata

Dataset metadata format