NVIDIA AI Blueprint: Video Search and Summarization (VSS)
Table of Contents
Overview
The NVIDIA Blueprint for Video Search and Summarization (VSS) provides a suite of reference architectures for building vision agents and AI-powered video analytics applications. Those architectures bring together accelerated vision microservices, vision language models (VLMs), and large language models (LLMs) so you can use them in existing applications, as standalone microservices, or as part of a larger vision agent.
VSS is organized into three areas of processing and analysis: real-time video intelligence (feature extraction, embeddings, and stream understanding with results published to a message broker), downstream analytics (enrichment of metadata into trajectories, incidents, and verified alerts), and agentic and offline processing (orchestrated tools for search, Q&A, summarization, and clip retrieval, including via the Model Context Protocol).
This repository implements the blueprint and powers the NVIDIA build experience for natural-language video agents—search, summarization, visual Q&A, and related workflows—backed by generative AI, VLMs and LLMs, and NVIDIA NIM microservices as configured in the stacks below.
Use Case / Problem Description
The NVIDIA AI Blueprint for Video Search and Summarization addresses the challenge of deploying visual agents capable of interacting with large volumes of video data, both stored and streamed. This can be used to create vision AI agents, that can be applied to a multitude of use cases such as monitoring smart spaces, warehouse automation, and SOP validation. This is important where quick and accurate video analysis can lead to better decision-making and enhanced operational efficiency.
Agent Workflows
We provide multiple reference Agent Workflows which demonstrate how the individual components can be leveraged by an agent: