LoGAN: Latent graph co-attention network for weakly-supervised video moment retrieval

Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 520% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2082-2091
Number of pages10
ISBN (Electronic)9780738142661
DOIs
StatePublished - Jan 2021
Event2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021 - Virtual, Online, United States
Duration: Jan 5 2021Jan 9 2021

Publication series

NameProceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021

Conference

Conference2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
Country/TerritoryUnited States
CityVirtual, Online
Period1/5/211/9/21

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this