I literally had the same experience when I asked the top code LLMs (Claude Code, GPT-4o) to rewrite the code from Erlang/Elixir codebase to Java. It got some things right, but most things wrong and it required a lot of debugging to figure out what went wrong.
It's the absolute proof that they are still dumb prediction machines, fully relying on the type of content they've been trained on. They can't generalize (yet) and if you want to use them for novel things, they'll fail miserably.
I just wished the LLM model providers would realize this and instead would provide specialized LLMs for each programming language. The results likely would be better.
The local models JetBrains IDEs use for completion are specialized per-language. For more general problems, I’m not sure over-fitting to a single language is any better for a LLM than it is for a human.
Sure, it's easier to solve an easier problem, news at eleven. In particular, translating from C# to Java could probably be automated with some 90% accuracy using a decent sized bash script.
This mostly means that LLMs are good at simpler forms of pattern matching, and have much harder time actually reasoning at a significant depth. (It's not easy even for human intellect, the finest we currently have.)
Claude Code / 4o struggle with this for me, but I had Claude Opus 4 rewrite a 2,500 line powershell script for embedded automation into Python and it did a pretty solid job. A few bugs, but cheaper models were able to clean those up. I still haven't found a great solution for general refactoring -- like I'd love to split it out into multiple Python modules but I rarely like how it decides to do that without me telling it specifically how to structure the modules.
I’m curious what your process was. If you just said “rewrite this in Java” I’d expect that to fail. If you treated the llm like a junior developer or an official project, worked with them to document the codebase, come up with a plan, tasks for each part of the code base and a solid workflow prompt- I would expect it to succeed.
There is a reason to go the extra mile for juniors. They eventually learn and become seniors. With AI I'd rather just do it myself and be done with it.
But you can just do it once with AI. It’s just a script process that you would set up for any project. It’s just an on boarding process.
And I’ll know by when I say do it once I mean, obviously processes have to be it on to get exactly what you want out of them, but that’s just how process works . Once it’s working the way you want to just reuse it.
If you try to ride a bicycle, do you expect to succeed at the first try? Getting AI code assistants to help you write high quality code takes time. Little by little you start having a feel for what prompts work, what don't, what type of tasks the LLMs are likely to perform well, which ones are likely to result in hallucinations. It's a learning curve. A lot of people try once or twice, get bad results, and conclude that LLMs are useless. But few people conclude that bicycles are useless if they can't ride them after trying once or twice.
I will give you an example of where you are dead wrong, and one where the article is spot on (without diving into historic artifacts).
I run HomeAssistant, I don't get to play/use it every day. Here, LLM's excel at filling in the (legion) of blanks in both the manual and end user devices. There is a large body of work for it to summarize and work against.
I also play with SBC's. Many of these are "fringe" at best. LLM's are as you say "not fit for purpose".
What kind of development you are using LLM's for will determine your experience with them. The tool may or may not live up to the hype depending how "common", well documented and "frequent" your issue is. Once you start hitting these "walls" you realize that no, real reason, leaps of inference and intelligence are still far away.
I also made this experience. As long as the public level of knowledge is high, LLMs are massively helpful. Otherwise not so much and still hallucinating. It does not matter if you think highly of this public knowledge. QFT, QED and Gravity are fine, AD emulation on SAMBA, or Atari Basic not so much.
If I would program Atari Basic, after finishing my Atari Emulator on my C64, I would learn the environment and test my assumptions. Single shot LLMs questions won't do it. A strong agent loop could probably.
I believe that LLMs are yanking the needle to 80%. This level is easy achievable for professionals of the trade and this level is beyond the ability of beginners. LLMs are really powerful tools here. But if you are trying for 90% LLMs are always trying to keep you down.
And if you are trying for 100%, new, fringe or exotic LLMs are a disaster because they do not learn and do not understand, even while being inside the token window.
We learn that knowledge, (power) and language proficiency are an indicator for crystalline but not fluid intelligence
80 percent of what, exactly?
A software developer's job isn't to write code, it's understanding poorly-specified requirements.
LLMs do nothing for that unless your requirements are already public on Stackoverflow and Github. (And in that case, do you really need an LLM to copy-paste for you?)
LLM's whiffing hard on these sorts of puzzles is just amusing.
It gets even better if you change the clues from innocent things like "driving tests" or "day care pickup" to things that it doesn't really want to speak about. War crimes, suicide, dictators and so on.
Or just flat out make up words whole cloth to use as "activates" in the puzzles.
> They'll never be fit for purpose. They're a technological dead-end for anything like what people are usually throwing them at, IMO.
This comment is detached from reality. LLMs in general have been proven to be effective at even creating complete, fully working and fully featured projects from scratch. You need to provide the necessary context and use popular technologies with enough corpus to allow the LLM to know what to do. If one-shot approaches fail, a few iterations are all it takes to bridge the gap. I know that to be a fact because I do it on a daily basis.
> Cool. How many "complete, fully working" products have you released?
Fully featured? One, so far.
I also worked on small backing services, and a GUI application to visualize the data provided by a backing service.
I lost count of the number of API testing projects I vibe-coded. I have a few instruction files that help me vibecode API test suites from the OpenAPI specs. Postman collections work even better.
> If you are far from an expert in the field maybe you should refrain from commenting so strongly because some people here actually are experts.
Your opinion makes no sense. Your so called experts are claiming LLMs don't do vibecoding well. I, a non-expert, am quite able to vibecode my way into producing production-ready code. What conclusion are you hoping to draw from that? What do you think your experts' opinion will achieve? Will it suddenly delete the commits from LLMs and all the instruction prompts I put together? What point do you plan to make with your silly appeal to authority?
I repeat: non-experts are proving to be possible, practical, and even mundane what your so-called experts claim to not work. What do you plan to draw from that?
Do what I couldn't with these supposedly capable LLMs:
- A Wear OS version of Element X for Matrix protocol that works like Apple Watch's Walkie Talkie and Orion—push-to-talk, easily switching between conversations/channels, sending and playing back voice messages via the existing spec implementation so it works on all clients. Like Orion, need to be able to replay missed messages. Initiating and declining real-time calls. Bonus points for messaging, reactions and switching between conversations via a list.
- Dependencies/task relationships in Nextcloud Deck and Nextcloud Tasks, e.g., `blocking`, `blocked by`, `follows` with support for more than one of each. A filtered view to show what's currently actionable and hide what isn't so people aren't scrolling through enormous lists of tasks.
- WearOS version of Nextcloud Tasks/Deck in a single app.
- Nextcloud Notes on WearOS with feature parity to Google Keep.
- Implement portable identities in Matrix protocol.
- Implement P2P in Matrix protocol.
- Implement push-to-talk in Element for Matrix protocol ala Discord, e.g., hold a key or press a button and start speaking.
- Implement message archiving in Element for Matrix protocol ala WhatsApp where a message that has been archived no longer appears in the user's list of conversations, and is instead in an `Archived` area of the UI, but when a new message is received in it, it comes out of the Archive view. Archive status needs to sync between devices.
Open source the repo(s) and issue pull requests to the main projects, provide the prompts and do a proper writeup. Pull requests for project additions need to be accepted and it all needs to respect existing specs. Otherwise, it's just yet more hot air in the comments section. Tired of all this empty bragging. It's a LARP and waste of time.
As far as I'm concerned, it is all slop and not fit for purpose. Unwarranted breathless hype akin to crypto with zero substance and endless gimmicks and kidology to appeal to hacks.
Guarantee you can't meaningfully do any of the above and get it into public builds with an LLM, but would love to be proven wrong.
If they were so capable, it would be a revolution in FOSS, and yet anyone who heavily uses it produces a mix of inefficient, insecure, idiotic, bizarre code.
It's the absolute proof that they are still dumb prediction machines, fully relying on the type of content they've been trained on. They can't generalize (yet) and if you want to use them for novel things, they'll fail miserably.