This work presents an interpretable and computationally efficient Video-Text Alignment (VTA) framework for cross-modal retrieval. Unlike end-to-end multimodal models that rely on large, opaque latent spaces and scale poorly with video length, VTA decomposes the retrieval process into transparent, modular components. The system first performs keyframe selection to reduce redundant temporal information and conducts object detection to extract salient visual concepts. In parallel, captions are analyzed through part-of-speech tagging to identify meaningful nouns and verbs. By modeling co-occurrence statistics between detected objects and textual keywords, VTA prunes irrelevant candidates early, dramatically reducing the search space and the number of trainable parameters—only 3% of those required for CLIP-based fine-tuning—while keeping encoders frozen to avoid overfitting to limited video-text datasets.

After filtering, VTA further improves retrieval through genre-based clustering and lightweight contrastive learning modules specialized to semantically coherent subsets of the data. These modules employ simple linear projections on top of frozen encoders, yet achieve competitive accuracy by focusing on more homogeneous sample groups. The entire pipeline is interpretable, as it provides explicit intermediate decisions, such as detected objects, POS-tagged keywords, genre predictions, and conditional probability estimates that link the two modalities. Experiments on the MSR-VTT benchmark demonstrate that VTA matches or surpasses state-of-the-art non-LLM baselines in both video-to-text and text-to-video retrieval, while offering constant-time inference and clear insight into its decision-making process.