Saturday, January 13, 2024

Speech with ChatGPT.... honestly, it is amazing

A massively underrated feature of ChatGPT is its speech functionality on smartphones. If you haven’t tried it – do so.

The app, when opened, has a headphone symbol. Touch that, and you just speak, with the dialogue continuing. It’s is quite liberating.

You can talk quicker, think more freely and the transcription is shit hot – really good, even for someone like me with an accent. With speech you find yourself having less ‘texty’ thoughts and more free-flowing dialogue and more inquiry. The fact that you hear someone speak back also changes the dynamic. It is true dialogue, whereas text takes physical effort, needs a keyboard and depends upon your typing speed.

We shouldn’t be surprised at this. Our brains have evolved for speech dialogue. We did not have to learn how to speak and hear, that came naturally. We can all do it. It takes years to learn how to read and write. The audio version seems more chatty than the text version or maybe that’s my imagination.


It can cope with other major languages. You can translate any sentence you have in English – great for being out there in a foreign country trying to be understood. This is performance support, delivering help at a specific time of need but you can translate entire paragraphs by reading them in and waiting.

After downloading the app, simply click on the 'Headphones' symbol.

A white circle will appear - just speak to it.... and continue the dialogue...

It will remember what you say when you want to refine something or ask something specific when ;learning a language. Just tap the red button if you want to interrupt the conversation.

You can also 'pause' the conversation by tapping the pause button bottom left and leave it paused for as long as you wish.

Learning a language

Out for a walk and want to learn a language? Get it to ask you questions in, say German, state what level you require and it will tell you if your spoken English translation is correct. You can also ask it questions in German and get German replies. Clarifying any specific words is easy. I can see this revolutionising language learning.

Performance support

I made the point about performance support in translating and this is perhaps the feature’s greatest advantage. I can imagine that this would help with performance support, getting help, when you are not at a computer, on the factory floor, in a meeting.

Short simulations

You can get it to do a spoken simulation. I’ve tried it with sales simulations, preparing for interviews all sorts of tasks. It’s on the button. This really is learning by doing. 


Of course, teachers, when inexperienced often ask questions than don’t wait long enough for an answer. The teacher has automised recall but the learner may be retrieving it from long-term memory much more slowly. You learn to wait for what seems like an unnatural time, say three or more seconds. This is the sort of thing we may need to build into teaching dialogue systems using GenAI.

When latency is eliminated, and this speech has the same cadence as normal dialogue, we will see massive use in this mode. If you were asked how fast turn taking was in real life, on average, what would you say? The fact that we have to listen, process then think of an answer suggests something substantial. In fact it is 300ms.

A conversation is a social event, it takes two to tango, turns are taken, (there is much less overlap than you may imagine), there are backchannels such as ‘mmm’…’yeah’ that encourage others to continue, and there are different types of turns or handovers depending in the context and language game. An odd feature is the fact that we know much of what we are going to say before the other person is finished. This is why it feels different from text dialogue, where things are more considered and crisp.

We can see a time when LLMs consider their reply before the actually full prompt is written and that free-flowing dialogue is quicker.


As AI has delivered dialogue, it seems sensible to consider dialogue as speech for all sorts of use cases, from simple queries to translations and learning. I’ve heard of people using it for brainstorming, story telling. Try it when out with the dog, in the car… it’s a far better listener than any human.


Bear in mind that this is a Beta and that for Plus users GPT-4 has a cap of 50 messages every three hours. For users on the Enterprise plan there is no message cap. And it has some limitations such as phonetically pronouncing Die in german as Die (as in Die Hard). Also, don;t ask it for the football scores - it's not a real time personal assistant. Fora Beta though, it's pretty amazing.

No comments: