Blind video quality assessments (BVQA) has become a key part of video streaming pipelines, especially with the rise of short-form user generated content (UGC). Without a reference video, BVQA estimates the quality of the underlying video content using human guided metrics including mean opinion scores (MOS). Many state-of-the-art methods employ large CLIP modules leading to increasingly larger deep learning based pipelines. Given the rapid growth of UGC, lightweight models may save streaming platforms massive amounts of compute and power. 

We are focused on developing a green learning alternative by incorporating raw features from specific sub-domains. Our research has found that a fusion of raw features that capture global and local detail combined with semantic information appears to provide sufficient information to predict MOS, even with trivial temporal schemes such as mean pooling. Currently, we generate raw features from natural scene statistic models including BRISQUE and V-BLIINDs, while local and semantic information is captured by a pre-trained Swin-T model. We plan on further reducing our model size using alternative feature extractors including EfficientNet, MobileNet, or the discrete wavelet transform.