Video Scene Segmentation of TV Series Using Multimodal Neural Features

Authors

  • Aman Berhe LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay,
  • Claude Barras LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay claude.barras@limsi.fr
  • Camille Guinaudeau LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay camille.guinaudeau@limsi.fr

DOI:

https://doi.org/10.6092/issn.2421-454X/8967

Keywords:

Unsupervised, Scene Segmentation, TV Series, Multimodal Fusion, Neural Features

Abstract

Scene segmentation of a video, a book or TV series allows to organize them into Logical Story Units and is an essential step for representing, extracting and understanding their narrative structures. We propose an automatic scene segmentation method for TV series based on the grouping of adjacent shots and relying on a combination of multimodal neural features: visual features and textual features, further augmented with the temporal information which may improve the clustering of adjacent shots. Reported experiments compare early and late fusion of the features, video frames subsampling and various shot clustering algorithms. The proposed method achieved good recall, precision and F-measure when tested on several seasons of two popular TV series.

References

Baraldi, Lorenzo, Costantino Grana and Rita Cucchiara (2015). “A Deep Siamese Network For Scene Detection In Broadcast Videos.” Association for Computing Machinery International Conference on Multimedia. http://dx.doi.org/10.1145/2733373.2806316.

Bost, Xavier. (2016). A Storytelling Machine?: Automatic Video Summarization: The Case of TV Series. Doctoral dissertation. Avignon: Université d’Avignon.

Beeferman, Doug, Adam Berger and John Lafferty, J. (1997). “Text Segmentation Using Exponential Models.” Computing Research Repository. https://arxiv.org/abs/cmp-lg/9706016v3 (last accessed 26-06-19).

Bredin, H. (2015). Pyannote Video: A Toolkit For Shot Detection, Shot Threading And Face Trucking. https://github.com/pyannote/pyannote-video (last accessed 30-04-19).

Chasanis, Vasileios T., Aristidis C. Likas and Nikolaos P. Galatsanos, N. P. (2009). “Scene Detection In Videos Using Shot clustering And Sequence Alignment.” Institute of Electrical and Electronics Engineers Transactions on Multimedia 11(1): 89-100. https://doi.org/10.1109/TMM.2008.2008924.

Choi, Freddy Y. Y. (2000). “Advances in domain independent linear text segmentation.” In Proceedings of North American Chapter of the Association for Computational Linguistics. https://dl.acm.org/citation.cfm?id=974309 (last accessed 26-06-19).

Del Fabro, Manfred and Laszlo Böszörmenyi (2013). “State-of-the-art And Future Challenges In Video Scene Detection: A Survey”. Multimedia Systems 19(5): 427-54. https://doi.org/10.1007/s00530-013-0306-4.

Ercolessi, Philippe et al. (2011). “Segmenting TV Series Into Scenes Using Speaker Diarization.” WIAMIS 2011, 12th International Workshop on Image Analysis for Multimedia Interactive Services, 2011, Delft, Netherlands. https://hal.archives-ouvertes.fr/hal-01987819 (last accessed 26-06-2019).

Guinaudeau, Camille, Guillaume Gravier and Pascale Sébillot (2012). “Enhancing Lexical Cohesion Measure with Confidence Measures, Semantic Relations and Language Model Interpolation for Multimedia Spoken Content Topic Segmentation.” Computer Speech & Language 26(2): 90-104 https://doi.org/10.1016/j.csl.2011.06.002.

Kumar, Niraj et al. (2011). “Video Scene Segmentation with a Semantic Similarity.” Indian International Conference on Artificial Intelligence.

Pevzner, Lev and Marti A. Hearst (2002). “A Critique and Improvement of an Evaluation Metric for Text Segmentation.” Computational Linguistics 28(1): 19-36. https://doi.org/10.1162/089120102317341756.

Protasov, Stanislav (2018). “Using Deep Features for Video Scene Detection and Annotation.” Signal, Image and Video Processing 12(5): 991-9. https://doi.org/10.1007/s11760-018-1244-6.

Řehůřek, Radim and Petr Sojka (2010). “Software Framework for Topic Modelling with Large Corpora.”. In Proceedings of the International Conference on Language Resources and Evaluation Workshop on New Challenges for NLP Frameworks. http://www.fi.muni.cz/usr/sojka/papers/lrec2010-rehurek-sojka.pdf (last accessed 26-06-19).

Scaiano, Martin and Diana Inkpen (2012). “Getting More from Segmentation Evaluation.” In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. https://www.aclweb.org/anthology/N12-1038 (last accessed 26-06-19).

Scaiano, Martin et al. (2010). “Automatic Text Segmentation for Movie Subtitles.” Canadian Conference on Artificial Intelligence. https://doi.org/10.1007/978-3-642-13059-5_32.

Sidiropoulos, Panagiotis et al. (2009). “Multimodal Scene Segmentation Using Scene Transition Graphs.”. In Proceedings of the Association for Computing Machinery International Conference on Multimedia. https://doi.org/10.1145/1631272.1631383.

Simonyan, Karen and Andrew Zisserman (2014). “Very Deep Convolutional Networks for Large-scale Image Recognition.” International Conference on Learning Representations. https://arxiv.org/abs/1409.1556v6 (last accessed 26-06-19).

Tapaswi, Makarand, Martin Bäuml and Rainer Stiefelhagen (2014). “Storygraphs: Visualizing Character Interactions as a Timeline”. Institute of Electrical and Electronics Engineers Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.111.

Todorov, Tzvetan (1977). The Poetics of Prose. Ithaca: Cornell University Press.

Utiyama, Masao and Hitoshi Isahara (2001). “A Statistical Model for Domain-independent Text Segmentation.” In Proceedings of the Association for Computational Linguistics. http://dx.doi.org/10.3115/1073012.1073076.

Vendrig, Jeroen and Marcel Worring (2002). “Systematic Evaluation of Logical Story Unit Segmentation.”. Institute of Electrical and Electronics Engineers Transactions conference on Multimedia 4(4):492-9 https://doi.org/10.1109/TMM.2002.802021.

Yeung, Minerva, Boon Lock Yeo and Bede Liu (1998). “Segmentation of Video by Clustering and Graph Analysis”. Computer Vision and Image Understanding 71(1);94-109. https://doi.org/10.1006/cviu.1997.0628.

Zhou, Bolei (2017). “Places: A 10 million image database for scene recognition.” Institute of Electrical and Electronics Engineers Transactions on Pattern Analysis and Machine Intelligence 40(6): 1452-64. https://doi.org/10.1109/TPAMI.2017.2723009.

Downloads

Published

2019-07-31

How to Cite

Berhe, A., Barras, C., & Guinaudeau, C. (2019). Video Scene Segmentation of TV Series Using Multimodal Neural Features. Series - International Journal of TV Serial Narratives, 5(1), 59–68. https://doi.org/10.6092/issn.2421-454X/8967

Issue

Section

Narratives / Aesthetics / Criticism