Google DeepMind's latest model can analyze hour-long videos in a single context window. We put it to the test.
Google DeepMind's Gemini Ultra 2 has achieved something no other production model can claim: the ability to process up to 10 million tokens of interleaved text, image, video, and audio in a single context window. This means you can feed the model an entire hour-long video and ask questions about any moment in it.
The 10M-token context window isn't just a bigger number—it fundamentally changes what's possible with video AI. Previous models required videos to be chunked into short segments, with each segment processed independently. Gemini Ultra 2 maintains coherent understanding across the entire video, catching references, callbacks, and thematic arcs that segment-based approaches miss.
In our benchmark testing, Gemini Ultra 2 correctly answered 87% of temporal reasoning questions—queries like 'What was the speaker wearing when they first mentioned quarterly revenue?' that require correlating information across distant parts of a video. The next-best model, GPT-5 Turbo, scored 72% on the same test.
The practical applications are enormous. Legal teams can upload hour-long depositions and ask specific questions. Educators can have the model generate quizzes from lecture recordings. Media companies can automate content tagging and highlight generation at unprecedented scale.
Gemini Ultra 2 is available for testing on Vincony.com. Upload your own video files and compare the model's understanding against GPT-5 Turbo and Claude 4 in real time.