Research

Research Experience

Research Interests : Multimodal Deep Learning, Computer Vision, NLP.

As an MSc student at the University of Saskatchewan (USask), I am currently conducting research in Computer Vision and Deep Learning with Dr. Mrigank Rochan. Our research works have been accepted to be presented at reputed venues such as IEEE WACV 2025, and NeurIPS 2024 Workshop on Self-Supervised Learning. My undergrad research work on “Violent Activity Recognition” was published at IJCNN 2021 and has gained over 77 citations so far. In addition to my academic experiences, I have also gained exposure at the industry while working as a Machine Learning Engineer at Apurba Technologies. I contributed to developing a large-scale Bengali OCR system for text recognition. Additionally, developing various academic and personal projects related to computer graphics, computer vision, and machine learning have further enriched my technical expertise.

Publications

ORCID Profile

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

Zahidul Islam (USask), Sujoy Paul (Google Research), Mrigank Rochan (USask)

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025, Tucson, Arizona

Feb 2025

Abstract - Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose an innovative unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Subsequently, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three highlight detection benchmarks showcase the superior performance of our method over prior work.

Paper

Test-Time Adaptation for Video Highlight Detection

Zahidul Islam (USask), Sujoy Paul (Google Research), Mrigank Rochan (USask)

Neural Information Processing Systems (NeurIPS) 2024 Workshop Self-Supervised Learning - Theory and Practice.

Dec 2024

Abstract - Existing video highlight detection methods often struggle to generalize due to varying content, styles, and audio-visual quality in unseen test videos. We propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during inference to better align with the specific characteristics of each test video, thereby improving its generalization and highlight detection performance. Highlight-TTA is jointly optimized during training using a self-supervised auxiliary task, cross-modality hallucinations, alongside the primary task of highlight detection within a meta-auxiliary training scheme to enable effective adaptation. During testing, we adapt the trained model using the self-supervised auxiliary task on the test video to enhance its highlight detection performance. Extensive experiments on three benchmark datasets demonstrate the effectiveness of Highlight-TTA.

Link

Efficient Two Stream Network for Violence Detection Using Separable Convolutional LSTM

Zahidul Islam, Mohammad Rukonuzzaman, Raiyan Ahmed, Md. Hasanul Kabir, Moshiur Farazi

International Joint Conference on Neural Networks (IJCNN) 2021

July 2021

Abstract - Automatically detecting violence from surveillance footage is a subset of activity recognition that deserves special attention because of its wide applicability in unmanned security monitoring systems, internet video filtration, etc. In this work, we propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet where one stream takes in background suppressed frames as inputs and other stream processes difference of adjacent frames.

Paper

Towards Building A Robust Large-Scale Bangla Text Recognition Solution Using A Unique Multiple-Domain Character-Based Document Recognition Approach

AKM Shahariar Azad Rabby, Md. Majedul Islam, Zahidul Islam, Nazmul Hasan, Fuad Rahman

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

13-16 Dec. 2021

Abstract - Bangla is one of the world’s top ten popular languages in terms of the num- ber of speakers. It also happens to have a complex script primarily because of the presence of complex characters e.g. graphemes, that are composed of multiple single characters, and the characteristic shorthands e.g. vowel dia- critics, and consonant diacritics, making the number of classes of this script recognition quite large, varied and challenging. In this paper, we present a unique large-scale Bangla document OCR solution based on character-level recognition modules.

Paper

A Comparative Analysis of Efficient Convolutional Neural Network Based Methods for Plant Disease Classification

Ridwan Mahbub, Samiha Anuva, Ifrad Khan, Zahidul Islam

2022 25th IEEE International Conference on Computer and Information Technology (ICCIT)

17-18 Dec. 2022

Abstract - For implementation of automated mechanisms to detect and classify plant disease, using heavy-weight convolutional neural network or CNN-driven solutions is often not practical as farmers are not equipped with devices capable of running such heavy applications. This is why lightweight CNN architectures capable of operating mobile and embedded devices are crucial. In this work, we present a comparative analysis and overview of different efficient CNN-based methodologies proposed for plant disease classification. Moreover, we fine-tuned off-the-shelf state-of-the-art efficient CNN architectures using transfer learning to analyze and determine the right balance of model size and accuracy.

Button

Reviewing Experience

Worked as a reviewer for the conference IJCNN 2022(x4)
Worked as a reviewer for the journal IEEE Access (x2)
Worked as a reviewer for the journal Applied Artificial Intelligence (AAAI)