VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

Yuxuan Wang
Zilong Zheng
Xueliang Zhao
Jinpeng Li
Yueqian Wang
Dongyan Zhao

Introduction

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this work, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation.


Data Statistics

VSTAR contains a total of 8159 TV episodes and corresponding meta data (genres, keywords, story lines). We specially annotate for each video about dialogue scene and topic transitions. The dataset are split into train/val/test: 172,041/9,573/9,779 ~90 seconds video clips, in which the 9,779 test video clips are held out for futher study.
Video clips
Train
Val
Test
Total
172,041
9,753
9,779
191,573
Annotations
Scene
Topic
Segments
Train
Val
Test
Total
417,154
633,649
25,777
41,834
26,165
42,010
469,096
717,493

Examples

Case study for scene segmentation of the base model. The green bars indicate correct segmentations, the red bars indicate wrong (false-positive) ones, and the yellow bars indicate missed (true-negative) ones. In addition, the grey bars indicate correct topic segmentation.

Download

  • Precomputed video features (we are polishing an agreement for additional access to the original video)
  • Dialogues and Annotations (BaiduNetDisk), Gdrive
  • Interested in VSTAR and want to try it? Please refer to our Github page on how to use the dataset. Any questions please feel free to contact Yuxuan Wang (flagwyx@gmail.com).

    Citation

    @inproceedings{wang-etal-2023-vstar,
        title = "{VSTAR}: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions",
        author = "Wang, Yuxuan  and
            Zheng, Zilong  and
            Zhao, Xueliang  and
            Li, Jinpeng  and
            Wang, Yueqian  and
            Zhao, Dongyan",
        booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = jul,
        year = "2023",
        address = "Toronto, Canada",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.acl-long.276",
        pages = "5036--5048",
        abstract = "Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present \textbf{Video-grounded Scene{\&}Topic AwaRe dialogue (VSTAR)} dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.",
    }