TY - GEN
T1 - Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning
AU - Luo, Zhekun
AU - Guillory, Devin
AU - Shi, Baifeng
AU - Ke, Wei
AU - Wan, Fang
AU - Darrell, Trevor
AU - Xu, Huijuan
N1 - Funding Information:
Prof. Darrell’s group was supported in part by DoD, BAIR and
Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag’s label is known, the main challenge is assigning which key instances within the bag to trigger the bag’s label. Most previous models use attention-based approaches applying attentions to generate the bag’s representation from instances, and then train it via the bag’s classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.
AB - Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag’s label is known, the main challenge is assigning which key instances within the bag to trigger the bag’s label. Most previous models use attention-based approaches applying attentions to generate the bag’s representation from instances, and then train it via the bag’s classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.
UR - http://www.scopus.com/inward/record.url?scp=85093081608&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093081608&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-58526-6_43
DO - 10.1007/978-3-030-58526-6_43
M3 - Conference contribution
AN - SCOPUS:85093081608
SN - 9783030585259
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 729
EP - 745
BT - Computer Vision – ECCV 2020 - 16th European Conference, Proceedings
A2 - Vedaldi, Andrea
A2 - Bischof, Horst
A2 - Brox, Thomas
A2 - Frahm, Jan-Michael
PB - Springer Science and Business Media Deutschland GmbH
T2 - 16th European Conference on Computer Vision, ECCV 2020
Y2 - 23 August 2020 through 28 August 2020
ER -