vstar

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

Yuxuan Wang	Zilong Zheng	Xueliang Zhao
Jinpeng Li	Yueqian Wang	Dongyan Zhao

Introduction

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this work, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation.

Data Statistics

VSTAR contains a total of 8159 TV episodes and corresponding meta data (genres, keywords, story lines). We specially annotate for each video about dialogue scene and topic transitions. The dataset are split into train/val/test: 172,041/9,573/9,779 ~90 seconds video clips, in which the 9,779 test video clips are held out for futher study.

Video clips

Train

Val

Test

Total

172,041

9,753

9,779

191,573

Annotations

Scene
Topic

Segments

Train

Val

Test

Total

417,154
633,649

25,777
41,834

26,165
42,010

469,096
717,493

Examples

Case study for scene segmentation of the base model. The green bars indicate correct segmentations, the red bars indicate wrong (false-positive) ones, and the yellow bars indicate missed (true-negative) ones. In addition, the grey bars indicate correct topic segmentation.

Download

Precomputed video features (we are polishing an agreement for additional access to the original video)

Dialogues and Annotations (BaiduNetDisk), Gdrive

Interested in VSTAR and want to try it? Please refer to our Github page on how to use the dataset. Any questions please feel free to contact Yuxuan Wang (flagwyx@gmail.com).

Citation

@inproceedings{wang-etal-2023-vstar,
    title = "{VSTAR}: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions",
    author = "Wang, Yuxuan  and
        Zheng, Zilong  and
        Zhao, Xueliang  and
        Li, Jinpeng  and
        Wang, Yueqian  and
        Zhao, Dongyan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.276",
    pages = "5036--5048",
    abstract = "Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present \textbf{Video-grounded Scene{\&}Topic AwaRe dialogue (VSTAR)} dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.",
}