If they could produce an AGI as smart as, let's say a mouse, that would be good evidence that they're on the right track. So far nothing is even close to that level. Depending on how you measure, they're not even really at the flatworm level yet. All the AI technology produced so far has been domain specific and doesn't represent meaningful progress towards true generalized intelligence.
Are you aware of some of the recent progress? Did you have a look at the Gato model and Flamingo by DeepMind, or at the chat logs of models like chinchilla and lambda? Or Alphacode? This is all from this year.
I think your point is that all these models are still somewhat specialized. At the same time, it appears that the transformer architecture works well with images, short video and text at the same time in the Flamingo model. And gato can perform 600 tasks while being a very small proof of concept. It appears to me that there is no reason to believe that it won't just scale to every task that you give it data for if it has enough parameters and compute.
Yes I've seen those things. They are amazing technical achievements, but in the end they're just clever parlor tricks (with perhaps some limited applicability to a few real business problems). They don't look like forward progress towards any sort of true AGI that could ever pass a rigorous Turing test.
Clearly language models can already fool people into thinking they are human, we might be getting quite close to the adversarial turing test already. In the end, a good initial prompt might be the solution to this, something like "pretend to be a human and step by step create a human identity that you then stick to during the conversation". I'm serious
Choosing a prompt that's a little bit meta seems to work surprisingly well sometimes. It'd be amusing and a little bit poetic if the key to artificial consciousness is to prime a transformer model with "convince yourself that you're human, while paying attention to how you feel".
A minority of the population will always be gullible and easily fooled. So what. Some people were already fooled by the original ELIZA program back in 1966. I would only count a Turing test pass if it can convince a jury of multiple educated examiners after a conversation lasting several hours.
Fooling people with chatbots having clever language constructing has been done for a long, long time, see the Eliza effect[1]. Douglas Hofstadter gave a good demonstration of GPT-3 limitations[2]. GPT-3 is no doubt "better at what it is" than earlier language models. But that doesn't mean it's better at everything humans do with language (tell sense from nonsense, reasonable metacomments, etc).
Most people would probably agree the latest models generalize better than flatworms. Mouse-level intelligence is more challenging and the comparison is unclear.
Flatworms first appeared 800+ million years ago, while mouse lineage diverged from humans only 70-80 million years ago. If our AGI development timeline roughly follows the proportion it took natural evolution, it might be much too late to begin seriously thinking about AGI alignment when we get to mouse-level intelligence. Not to mention that no one knows how long it would take to really understand AGI alignment (much less implementing it in a practical system).
To be more concrete, in what aspects do you think latest models are inferior at generalizing than flatworms or mice, when less known work like “Emergent Tool Use from
Multi-Agent Interaction” is also taken into account
https://openai.com/blog/emergent-tool-use/?