Follow ZDNET: Add america arsenic a preferred source connected Google.
ZDNET's cardinal takeaways
- Gemini tin ticker YouTube, MP4, and MOV files.
- Claude inactive can't process video directly.
- ChatGPT needs Codex assistance for deeper video work.
AIs bash a good occupation knowing substance from prompts and documents. Most bash rather good interpreting images, but what astir video? Can your favourite AI really recognize what's successful a video? If it does understand, what tin you bash with it?
Also: I tested ChatGPT Plus vs. Gemini Pro to spot which is amended - and if it's worthy switching
In this article, we trial ChatGPT, Claude, and Gemini to spot however good they grok the contents of videos, some from YouTube and section files. The results are surprising.
The tests
I fed each AI a acceptable of 3 videos. One is a YouTube video I published past year astir the technological process of annealing (yes, I am arsenic breathtaking connected video arsenic I americium connected ZDNET). I tested the AIs to spot if they tin recognize what's successful the video. Then, I tried to spot if they could make a amended thumbnail than I utilized connected my YouTube channel.
The 2nd video is simply a question trial for the DJI Neo 2 drone. It's conscionable a video of maine lasting successful beforehand of the drone, utilizing gestures to power however the drone flies. No audio. I wanted to spot if the AIs recognize what's happening there. That's successful MP4 format.
Also: I tested ChatGPT vs. Claude to spot which is amended - and if it's worthy switching
Finally, I person the archetypal MOV record that I uploaded to YouTube for a walk-and-talk astir my YouTube posting strategy. I posted it past year. For my test, I utilized the section video file. It's the aforesaid 1 that I really uploaded to YouTube. I'm utilizing the section mentation for my AI test, though, due to the fact that I wanted to spot however good the AIs could ascertain what I'm talking astir without immoderate metadata, transcripts, oregon hints provided by YouTube. It's conscionable the video itself. If you privation to spot the after-uploaded version, here's a link.
I tested the latest and champion models. I tested the $20-per-month ChatGPT Plus plan, the $20-per-month Gemini Pro plan, and the $100-per-month Claude Max plan, which I usage for Claude Code.
My punctual was simple: "Can you ticker this video?" I recovered that "watch this video" worked amended than "understand" oregon "summarize," due to the fact that some of those sent the AIs looking for metadata. By contrast, erstwhile I asked the AIs to "watch" the videos, they didn't each spell looking for metadata. All 3 AIs seemed to cognize that "watch" meant I wanted them to really ticker the videos themselves and show an knowing of what was successful them.
And with that, let's excavation in.
Claude: nope
Sometimes prolonging the symptom is conscionable cruel. Rip that Band-Aid disconnected and get it implicit with. So, Claude is simply a large no. Neither the app nor the web interface can, well, let's person it archer us, shall we?
Claude says, "I can't ticker video contented directly. I can't ticker videos. I don't person the quality to process video oregon audio contented from YouTube links. I don't person the quality to process the ocular oregon audio frames of an MP4 file. Same reply present -- I can't straight ticker oregon perceive to a .mov file. I don't process video oregon audio streams."
And truthful it goes.
Gemini: yep
Gemini does person a standalone app, but I didn't request to tally it. Gemini's web interface did a large occupation knowing each the video formats.
It didn't substance whether I fed Gemini a YouTube URL, a 625MB MP4 file, oregon a whopping 1.65GB MOV file. Gemini, close successful a browser tab, could grip it.
Also: This almighty Gemini mounting made my AI results mode much idiosyncratic and accurate
The astir absorbing 1 is my drone test. That's due to the fact that the video contains nary audio (not adjacent inheritance noise) and nary discourse different than maine lasting determination and moving my arms.
And yet, Gemini was capable to ascertain precisely what I was doing:
In the video, you're investigating retired immoderate manus gestures -- raising your thenar to the camera arsenic if signaling it to halt oregon move. The camera follows your lead, changing its space and region arsenic you usher it done the gait and yet backmost toward the house.
Looks similar a palmy trial of those drone gestures! Is determination thing circumstantial astir the footage oregon the camera's effect you wanted to spell over?
I mean, like, wow. Let's beryllium clear. The drone itself was not disposable successful the video. It was acting arsenic the camera. I'm betting determination are a batch of humans who wouldn't recognize what was happening determination (I'm looking astatine you, my neighbors!), fto unsocial an AI.
It did successfully recognize my annealing video. It was capable to place sections, study connected circumstantial points I made verbally, and different show its understanding.
It besides understood the uploaded walk-and-talk video, not lone identifying the location, but the assorted aspects of my commentary passim the video.
Also: I tested ChatGPT Images 2.0 vs. Gemini Nano Banana to spot which is amended - this exemplary wins
The 1 spot Gemini fell down was successful the modulation from Gemini's knowing mode to Nano Banana's representation mode. Despite Nano Banana's quality to marque awesome images, it doesn't recognize beingness and the satellite the mode the caller ChatGPT Images 2.0 does, and it shows.
I fed Gemini the archetypal thumbnail for the video and told it, "Choose a azygous framework for the maximum interaction arsenic a YouTube thumbnail, then, based connected discourse of the video and my existing YouTube thumbnail style, usage Nano Banana to make a high-click-value thumbnail."
The representation connected the near is my archetypal thumbnail. The adjacent 2 were Gemini's attempts. They're surely vivid and mightiness pull much clicks, but Gemini decided to marque up a dude with a beard and spot him successful the representation alternatively than usage my suave and blase visage. And it spelled "FIRE" arsenic "FCIRE." And truthful it goes.
If you privation to rapidly get an knowing of a video oregon inquire questions astir it, Gemini is decidedly the champion mode to go. If you privation to make thoughtful imagery from the video, Gemini is not yet up to the task.
ChatGPT: sorta, with a spot much enactment
Then there's ChatGPT, which is much similar a atrocious news, bully quality story.
First, the atrocious news. ChatGPT itself failed my tests. It couldn't work the YouTube link. And portion it theoretically tin recognize videos, they request to beryllium nether 500MB successful size. Mine were not.
Then, the bully news. If you harvester the OpenAI Codex app with ChatGPT, it tin bash immoderate precise chill stuff. Codex is the agentic workhorse of OpenAI's offerings. I benignant of liken Codex to that useful neighbour we each cognize who tin physique astir thing retired of astir anything.
Codex was capable to work some section files and recognize their meaning. For the drone test, Codex reported:
It looks similar a backyard drone trial shot.
What's happening:
- A idiosyncratic stands successful a residential backyard and faces the camera/drone.
- They motion a fewer times (including a manus raise/wave-like motion).
- The camera viewpoint moves astir them implicit time, changing space and region portion keeping them mostly centered.
- The series ends inactive successful the aforesaid gait area, with nary large country alteration oregon further people/activities.
Yep. That's what it's about.
Codex was besides capable to ticker and recognize the walk-and-talk MOV record I uploaded. Initially, it couldn't process the content, truthful Codex asked support to instal immoderate Python codification and libraries to let it to bash audio transcription. Once it built each that out, it was capable to not lone presumption the video, but besides recognize its context.
But past Codex couldn't ticker the YouTube stream. OK, fine. This is Codex. So, I asked, "Can you download the afloat video and past enactment connected it locally?"
That worked. It automagically wrote a Python script, installed immoderate libraries, helped invent video-downloading exertion connected the fly, and past watched my YouTube video.
But past I wanted it to make a thumbnail. I archetypal asked if it had entree to ChatGPT Images 2.0 (remember, they're some OpenAI tools). It responded, "I person entree to representation procreation tools successful this session, but I don't person a instrumentality explicitly labeled Images 2.0 exposed to me."
Let's conscionable disregard the unfortunate "exposed to me" phrasing. I had to explicate to Codex that Images 2.0 was a thing, and constituent it to OpenAI's tract for it to understand. At that point, the agentic instrumentality was alert of the images tool, but inactive couldn't bash overmuch with it.
So, that's erstwhile I acted arsenic the conduit betwixt Codex and ChatGPT. I told Codex, "Choose a azygous framework for the maximum interaction arsenic a YouTube thumbnail, export that thumbnail determination truthful ChatGPT tin get to it, oregon truthful I tin upload it to ChatGPT, and then, based connected discourse of the video and my existing YouTube thumbnail style, constitute a punctual for ChatGPT to make a high-click-value thumbnail."
Then, successful ChatGPT, I uploaded the archetypal thumbnail representation I showed you earlier, and the framework that Codex chose. I past pasted successful the punctual Codex created. As you tin see, Codex/ChatGPT got much close than Gemini did.
It picked up connected the white, yellow, and achromatic colour strategy for my lettering. It didn't see my logo, and it didn't see the yellowish stripe I usage for my titles, but I really rather similar the juxtapositioning of my representation implicit the torch flame. ChatGPT and Codex really utilized my image, dissimilar Gemini. But I bash instrumentality contented with the aluminum bar. I utilized level material. For immoderate reason, the OpenAI tools decided to marque it into quadrate tubing.
Also: I tried ChatGPT Images 2.0: A fun, immense leap - and amazingly utile for existent work
Here's wherever ChatGPT's amended representation cognition comes into play. I corrected it connected the tubing vs. barroom concern and asked it to regenerate. I prompted, "That aluminum is level barroom worldly astir 1/8-inch thick, not quadrate tubing. Keep everything else, but delight hole the aluminum."
We were close. I didn't similar however it placed the Sharpie marks (which are utilized to archer erstwhile the metallic is blistery capable to bend), and the existent crook was acold excessively sharp. One much prompt: "Good, but the crook is excessively sharp. It's not a cleanable close angle. There's a curve due to the fact that the aluminum needs to flex arsenic it bends. Please revise. Also, the sharpie marks are perpendicular to the borderline of the bar, not connected an angle. They bespeak wherever to bend."
That's bully enough. I deliberation it's imaginable to provender Codex and ChatGPT a video with nary further discourse and get retired a YouTube thumbnail. You could astir apt usage it to analyse different types of videos and nutrient images from those arsenic well.
It's not super-convenient, but it does enactment beauteous well.
AI tin so ticker video
There are fewer things to note. First, the AI was capable to afloat construe the videos successful a overmuch shorter clip than their existent play time. Both the subject video and the walk-and-talk are astir 15 minutes long, but some Gemini and ChatGPT were capable to "watch" and parse them for knowing successful what I would accidental was astir 2 oregon 3 minutes each.
Second, some amusement reasonably almighty mentation skills. I recovered their quality to recognize that the soundless video I gave them was a drone trial to beryllium alternatively impressive. The drone mostly stayed astatine quality height, yet they were some capable to ascertain discourse from the frames successful the video.
There are surely immoderate applicable uses. I gave Gemini a YouTube video of a CBS study connected the OpenAI proceedings and asked it to supply maine details astir what was discussed. I tin decidedly spot utilizing it to scan done information camera video to find a circumstantial benignant of enactment quickly.
I tin besides decidedly spot giving the AI a longer video and having it propulsion retired the large points. What was peculiarly utile is that Gemini time-stamped each of the cardinal thoughts, truthful I could conscionable click the clip stamps and driblet into the video astatine that point.
Then, of course, there's the existent usage of these tools to make YouTube thumbnails. I inactive similar to bash it by hand. But the information that these AIs tin extract usable frames and conception thumbnails means that creators person a caller instrumentality astatine their disposal.
Overall, I'm impressed with Gemini and the pairing of ChatGPT and Codex for video-watching ability. Isn't it absorbing that Gemini doesn't request 2 tools (after all, it is called "Gemini"), but ChatGPT needs Codex? Things similar this amuse me.
Even though Claude bombed astatine this test, it inactive has value. Claude is 1 of my favorites for vibe coding.
What productivity benefits tin you spot getting from the video-watching capabilities of these AIs? Let america cognize successful the comments below.
You tin travel my day-to-day task updates connected societal media. Be definite to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.

40 minutes ago
3









English (US) ·