Show ChatGPT what you see: Voice and image features are live (for a price)

ChatGPT logo in front of eyes on screens
(Image credit: Bing Image Creator | OpenAI)

What you need to know

  • OpenAI adds image and voice recognition functions to ChatGPT, with the latter exclusive to mobile devices alongside a new advanced text-to-speech engine.
  • Both features require a subscription to either ChatGPT Plus or ChatGPT Enterprise.
  • The update will gradually roll out to English-speaking users worldwide during the next two weeks.

ChatGPT is working towards developing a more natural user experience by implementing voice and imaging communication that works both ways. In theory, users can spend less time typing and pondering the most effective prompts and enjoy more time seeing answers. Detailing its plans to gradually roll out these new capabilities in a recent blog post, OpenAI explains who will have access and when.

Those subscribed to an individual $20 ChatGPT Plus or a business-focused Enterprise subscription will begin to see image-based prompts and responses within the next two weeks on all platforms. Meanwhile, voice conversations will be exclusive to iOS and Android devices, with a manual opt-in found in the app's 'Settings' menu under 'New Features.' OpenAI aims to mitigate errors by gradually deploying these new modes, so don't fret if you can't see them yet.

Doesn't this technology already exist?

Bing can already interpret your speech and turn it into prompts, but there are always ways to improve. (Image credit: Windows Central)

Although OpenAI takes apparent pride in this announcement, speech recognition and text-to-speech technologies have existed for years. Almost any smartphone app can transcribe your voice into written prompts, although the quality of results can vary depending on the underlying code. ChatGPT now uses Whisper, an open-source speech recognition system written by in-house developers, alongside a partnership with professional voice actors to train more lifelike speech for its generative AI.

While AI assistants like Bing Chat for mobile already exist on smartphones, ChatGPT demonstrates its new back-and-forth voice conversations with a rapid response time. Anything that reduces the time between interpreting spoken prompts and hearing a natural-sounding reply will undoubtedly appeal to anyone who prefers not to type on smaller screens.

An interesting tidbit from the announcement details how the Whisper model can generate 'human-like audio from just text and a few seconds of sample speech,' which could be more exciting as a concept for users to digitize custom-made voices for their AI assistants.

How can ChatGPT understand what it sees?

Urtopia E-Bike

OpenAI showed how ChatGPT can help fix a bicycle, but the possibilities are limitless. (Image credit: Daniel Rubino)

The most thrilling part of this update relates to ChatGPT's new ability to infer details from any image you provide. Opening up your smartphone's camera for a quick snap, you can optionally highlight specific areas of inquiry, as a demo video shows a user asking for help with lowering a bicycle seat. Sure enough, the app gives detailed answers with follow-up questions about the required tools. Naturally, the implication of mistaken identities and hallucinations immediately comes to mind, and OpenAI acknowledges the challenges ahead.

Prior to broader deployment, we tested the model for risk in domains such as extremism and scientific proficiency, (enabling us) to align on a few key details for responsible usage.


OpenAI already has experience with 'Be My Eyes,' an AI-powered mobile app that connects the sight-impaired community with volunteers who can help describe whatever the camera is pointed toward. Between this and the ChatGPT neural network, correctly identifying objects and scenes will increase over time thanks to this database of information. However, restricting the AI from making statements about the appearance of individuals is part of the balance between ethical guidelines and technical limitations.

The image-recognition code harnesses a combination of GPT-3.5 and GPT-4, capable of recognizing anything from real-world photographs to digital screenshots and text documents. As with anything else related to the almost limited potential of ChatGPT, OpenAI explains that this emerging technology is focused foremost on the English language. However, that may change in the future and seems likely enough based on the recent (and rapid history) of generative AI.

Ben Wilson
Channel Editor

Ben is the channel editor for all things tech-related at Windows Central. That includes PCs, the components inside, and any accessory you can connect to a Windows desktop or Xbox console. Not restricted to one platform, he also has a keen interest in Valve's Steam Deck handheld and the Linux-based operating system inside. Fueling this career with coffee since 2021, you can usually find him behind one screen or another. Find him on Mastodon to ask questions or share opinions.