Video Q&A LLM (Gemini)

Overview

This tutorial demonstrates how to use the Gemini API to process and analyze video content.

Specifically, it shows how to upload a video file using File API, and then use a generative model to extract descriptive information about the video.

The workflow utilizes the gemini-1.5-flash model to generate a text-based description of a given video clip.

Additionally, it provides an example of Integrating the Gemini Model into a LangChain Workflow for Video Data, showcasing how to build a chain that processes and analyzes video content seamlessly within the LangChain framework.

Table of Contents

References


Environment Setup

Set up the environment. You may refer to Environment Setup for more details.

[Note]

  • langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.

  • You can checkout the langchain-opentutorial for more details.

%%capture --no-stderr
!pip install langchain-opentutorial
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_google_genai",
        "google-generativeai",
    ],
    verbose=False,
    upgrade=False,
)

API KEY Issuance

  • Obtain an API KEY from the link.

Important:

  • The File API used in this tutorial requires API keys for authentication and access.

  • Uploaded files are linked to the cloud project associated with the API key.

Unlike other Gemini APIs, the API key also grants access to data uploaded via the File API, so it's crucial to store the API key securely.

# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "GOOGLE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Video-Q&A-LLM-Gemini",
    }
)
Environment variables have been set successfully.
from dotenv import load_dotenv

load_dotenv()
True

Data preparation

license free video from pexels

Please download the video and copy it to the ./data folder for the tutorial

# Set video file name
video_path = "data/sample-video.mp4"

Upload and Preprocess video using Gemini API

Next, use the File API to upload the video file.

import google.generativeai as genai

print("Uploading files...")

# Upload the file and return the file object
video_file = genai.upload_file(path=video_path)

print(f"Upload complete: {video_file.uri}")
Uploading files...
    Upload complete: https://generativelanguage.googleapis.com/v1beta/files/ycq94nkeb9gd

After uploading the file, you can call get_file to verify if the API has successfully processed the file.

get_file allows you to check the uploaded file associated with the File API in the cloud project linked to the API key.

import time

# Videos need to be processed before you can use them.
while video_file.state.name == "PROCESSING":
    print("Please wait while the video upload and preprocessing are completed...")
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

# Raise an exception if the processing fails
if video_file.state.name == "FAILED":
    raise ValueError(video_file.state.name)

# Print completion message
print(
    f"\nVideo processing is complete!\nYou can now start the conversation: {video_file.uri}"
)
Please wait while the video upload and preprocessing are completed...
    
    Video processing is complete!
    You can now start the conversation: https://generativelanguage.googleapis.com/v1beta/files/ycq94nkeb9gd

Generate content (Gemini API)

After the video is preprocessed, you can use the generate_content function from Gemini API to request questions about the video.

# Prompt message
prompt = "Describe this video clip"

# Set model to Gemini 1.5 Flash
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

# request response to LLM
response = model.generate_content(
    [prompt, video_file], request_options={"timeout": 600}
)
# print response
print(response.text)
Here's a description of the video clip:
    
    The video shows an aerial, high-angle view of a red passenger train traveling along a railway line that runs parallel to a road through a picturesque valley. 
    
    
    Here's a breakdown of the scene:
    
    * **The Train:** A long, red passenger train is the central focus, moving from the bottom to the middle of the frame.  It's a fairly modern-looking train.
    
    * **The Valley:** The valley is lush green, with fields dotted with yellow wildflowers (likely dandelions).  The grass is vibrant and appears to be well-maintained pastureland. Several farmhouses and buildings are scattered throughout the valley. A small stream or river meanders alongside the road and tracks.
    
    * **The Mountains:** Towering mountains, partially snow-capped, form a dramatic backdrop. The mountains are steep and rocky, showcasing a mix of textures and shades of green and grey.
    
    * **The Atmosphere:** The overall atmosphere is peaceful and idyllic, with clear blue skies and abundant sunlight suggesting a pleasant spring or summer day.
    
    The video appears to be drone footage, smoothly following the train's progress through the valley. The camera angle provides a sweeping perspective that showcases the beauty of the landscape and the integration of the train within the environment. The entire scene evokes a sense of serene beauty and the charm of rural Switzerland.
    

Below is an example of stream output (with the stream=True option added).

# Prompt message
prompt = "What type of train is shown in this video, and what color is it?"

# Set model to Gemini 1.5 Flash
model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

# request stream response to LLM
response = model.generate_content(
    [prompt, video_file], request_options={"timeout": 600}, stream=True
)

# print stream response
for chunk in response:
    print(chunk.text, end="", flush=True)
That's a narrow-gauge railway train.  More specifically, it appears to be a type of railcar used on the  Appenzell Bahn (AB) in Switzerland.  The train is primarily red in color, with some black and white accents.

Integrating the Gemini Model into a LangChain Workflow for Video Data

Here is an example of using LangChain with the Gemini model.

The model is loaded via ChatGoogleGenerativeAI from langchain_google_genai, allowing multimodal data to be included in the content of HumanMessage using the media type.

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage

# Initialize the Gemini model with the specified version
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")

# Create a message to send to the model and attach the video file as media input
message = HumanMessage(
    content=[
        {"type": "text", "text": "Please analyze the content of this video."},
        {
            "type": "media",
            "mime_type": video_file.mime_type,
            "file_uri": video_file.uri,
        },
    ]
)

# Stream the response and process each chunk
for chunk in llm.stream([message]):
    print(chunk.content)
This
     video shows an aerial view of a red train traveling along a railway line that runs
     parallel to a road through a picturesque valley in what appears to be the Swiss Alps

    
    
    Here's a breakdown of the content:
    
    * **Scenery:** The valley is lush green, with fields dotted with yellow wildflowers, likely
     dandelions.  The valley is surrounded by steep, verdant hillsides and majestic snow-capped mountains in the background. A small stream or river runs
     alongside the road and railway.  Several farmhouses or chalets are scattered throughout the valley.  The overall impression is one of idyllic rural Switzerland.
    
    * **Transportation:** A long red train is the central focus, moving steadily along the
     railway tracks. The train appears to be a passenger train, given its length and the typical design.  A road runs parallel to the tracks, offering a contrasting mode of transportation.
    
    * **Camera Work:** The video is shot from a
     drone, providing a high-angle, sweeping perspective. The camera follows the train as it moves through the valley, giving viewers a sense of the scale and beauty of the landscape. The drone maintains a relatively constant distance and speed to follow the train.
    
    * **Overall Impression:** The video is visually stunning and evokes a
     sense of tranquility and the beauty of nature in a mountainous region.  It's a perfect example of promotional footage for tourism, showcasing Switzerland's landscapes and transportation infrastructure.
    

File Deletion

Files are automatically deleted after 2 days, or you can manually delete them using files.delete().

# File deletion
genai.delete_file(video_file.name)

print(f"The video has been deleted: {video_file.uri}")
The video has been deleted: https://generativelanguage.googleapis.com/v1beta/files/ycq94nkeb9gd

Last updated