LLM Hallucinations in Practical Code Generation

dl.acm.org

67 points by appwiz 5 days ago

imiric 2 days ago

This is great. We need more research into solving this fundamental problem, yet AI companies prefer to chase benchmarks and pump out value-added products.

The RAG-based mitigation is interesting, but quite limited, as mentioned. It would only work if the user can provide ground truth data, which for code generation is relatively straightforward, but it's much more difficult for most other factual information. We can't directly rely on data from the web, since the sources need to be carefully reviewed by a human first, which is the labor-intensive task that requires human domain experts.

So this approach seems like a band-aid, and wouldn't be generally applicable. I'm not in the AI industry, but from the perspective of a user it seems that the hallucination problem requires a much more foundational solution.

nijave 2 days ago

I think there's room for more agentic systems that combine RAG, MCP and traditional static analyzer tools
For instance, RAG could be used to provide coding standards and best practices such as sanitizing user inputs used in file system lookups. MCP could be used to integrate with up to date and authoritative (official) docs. Static tools could run and analyze the results and feed errors back into the LLM to correct.
It seems a lot of tools rely on raw LLM queries and expect the IDE or other tools to take over instead of providing a consolidated experience.

rbanffy a day ago

I find it interesting that we say "hallucination" when an LLM is wrong and, when we are wrong we simply made a mistake.

stitched2gethr a day ago

Yes, although when I'm wrong I rarely say "Oh, now I see. This is definitely it!"
- rbanffy a day ago
  
  They do a lot of embarrassment avoidance and gaslighting.

simonw 2 days ago

I still don't think hallucinations in generated code matter very much. They show up the moment you try to run the code, and with the current batch of "coding agent" systems it's the LLM itself that spots the error when it attempts to run the code.

I was surprised that this paper talked more about RAG solutions than tool-use based solutions. Those seem to me like a proven solution at this point.

imiric 2 days ago

I'm surprised to read that from a prominent figure in the industry such as yourself.
The problem is that many hallucinations do not produce a runtime error, and can be very difficult to spot by a human, even if the code is thoroughly reviewed, which in many cases doesn't happen. These can introduce security issues, do completely different things from what the user asked (or didn't ask), do things inefficiently, ignore conventions and language idioms, or just be dead code.
For runtime errors, feeding them back to the LLM, as you say, might fix it. But even in those cases, the produced "fix" can often contain more hallucinations. I don't use agents, but I've often experienced the loop of pasting the error back to the LLM, only to get a confident yet non-working response using hallucinated APIs.
So this problem is not something external tools can solve, and requires a much deeper solution. RAG might be a good initial attempt, but I suspect an architectural solution will be needed to address the root cause. This is important because hallucination is a general problem, and doesn't affect just code generation.
- simonw a day ago
  
  If you define "hallucinations" to mean "any mistakes at all" then yes, a compiler won't catch them for you.
  I define hallucinations as a a particular class of mistakes where the LLM invents eg a function or method that does not exist. Those are solved by ensuring the code runs. I wrote more about that here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
  Even beyond that more narrow definition of a hallucination, tool use is relevant to general mistakes made by an LLM. The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser, for example: https://simonwillison.net/2025/Jun/23/phoenix-new/
  The more tools like this come into play, the less concern I have about the big black box of matrices occasionally hallucinating up some code that is broken in obvious or subtle ways.
  It's still on us as the end users to confirm that the code written for us actually does the job we set out to solve. I'm fine with that too.
  - imiric a day ago
    
    > If you define "hallucinations" to mean "any mistakes at all" then yes, a compiler won't catch them for you.
    That's not quite my definition. If we're judging these tools by the same criteria we use to judge human programmers, then mistakes and bugs should be acceptable. I'm fine with this to a certain extent, even though these tools are being marketed as having superhuman abilities. But the problem is that LLMs create an entirely unique class of issues that most humans don't. Using nonexistent APIs is just one symptom of it. Like I mentioned in the comment below, they might hallucinate requirements that were never specified, or fixes for bugs that don't exist, all the while producing code that compiles and runs without errors.
    But let's assume that we narrow down the definition of hallucination to usage of nonexistent APIs. Your proposed solution is to feed the error back to the LLM. Great, but can you guarantee that the proposed fix will also not contain hallucinations? As I also mentioned, in most occasions when I've done this the LLM simply produces more hallucinated code, and I get stuck in a neverending loop where the only solution is for me to dig into the code and fix the issue myself. So the LLM simply wastes my time in these cases.
    > The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser
    That's great, but can you trust that it will cover all real world usage scenarios, test edge cases and failure scenarios, and do so accurately? Tests are code as well, and it can have the same issues as application code.
    I'm sure that we can continue to make these tools more useful by working around these issues and using better adjacent tooling as mitigation. But the fundamental problem of hallucinations still needs to be solved. Mainly because it affects tasks other than code generation, where it's much more difficult to deal with.
    
    simonw a day ago
    
    > Your proposed solution is to feed the error back to the LLM. Great, but can you guarantee that the proposed fix will also not contain hallucinations?
    You do it in a loop. Keep looping and fixing until the code runs.
    > but can you trust that it will cover all real world usage scenarios, test edge cases and failure scenarios, and do so accurately?
    Absolutely not. Most of my blog entry about why code hallucinations aren't as dangerous as other mistakes talks about that as being the real problem humans need to solve when using LLMs to write code: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/...
    From the start of that article:
    > The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!
    
    imiric 13 hours ago
    
    > You do it in a loop. Keep looping and fixing until the code runs.
    I suppose, but we shouldn't need to brute force our tools into working...
    And as you point out in that article, some of these issues won't be caught by the compiler or interpreter. Where we disagree is that I think most of these are introduced by the inherent problem of hallucination, not because the model is not large enough, or wasn't trained on the right data. I.e. I don't think this is something we can engineer our way out of, but that solving it will require changes at the architectural level.
    Yes, ultimately we still need existing software engineering practices to confirm that the output is correct, but in the age of "vibe coding", when people are deploying software that they've barely inspected or tested (many of whom don't even have the skills or experience to do so!), built by tools that can produce thousands of lines of code in an instant, all of those practices go out the window. This should scare all of us, since it will inevitably make the average quality of software go down.
    I reckon that the amount of experienced programmers who will actually go through that effort is miniscule. Realistically, reviewing and testing code requires a great deal of effort, and is often not the fun part of the job. If these tools can't be relied on to help me with tasks I sometimes don't want to do, and if I have to babysit them at every step of the way, then how much more productive are they making me? There's a large disconnect between how they are being promoted and how they're actually used in the real world.
    Anyway, it's clear that we have different views on this topic, and we use LLMs very differently, but thanks for the discussion. I appreciate the work you're doing, and your content is always informative. Cheers!
  - HarHarVeryFunny a day ago
    
    I think the more general/useful definition of "hallucination" is anytime the LLM predicts next word based on "least worst" (statistically) choice rather than based on any closely matching samples in the training data.
    The LLM has to generate some word each time it is called, and unless it recognizes soon enough that "I don't know" is the best answer (in of itself problematic, since any such prediction would be based on the training data, not the LLM's own aggregate knowledge!), then it may back itself into a corner where it has no well-grounded continuation, but nonetheless has to spit out the statistically best prediction, even if that is a very bad ungrounded prediction such as a non-existent API, "fits the profile" concocted answer, or anything else ...
    Of course the LLM's output builds on itself, so any ungrounded/hallucinated output doesn't need to be limited to a single word or API call, but may instead consist of a whole "just trying my best" sentence or chunk of code (better hope you have unit test code coverage to test/catch it).
- lelele 2 days ago
  
  > The problem is that many hallucinations do not produce a runtime error [...]
  Don't hallucinations mean nonexistent things, that is, in the case of code: functions, classes, etc. How could they fail to lead to a runtime error, then? The fact that LLMs can produce unreliable or inefficient code is a different problem, isn't it?
  - plausibilitious 2 days ago
    
    This argument is the reason why LLM output failing to match reality was labelled 'hallucination'. It makes it seem like the LLM only makes mistakes in a neatly verifiable manner.
    The 'jpeg of the internet' argument was more apt I think. The output of LLMs might be congruent with reality and how the prompt contents represent reality. But they might also not be, and in subtle ways too.
    If only all code that has any flaw in it would not run. That would be truly amazing. Alas, there are several orders of magnitude more sequences of commands that can be run than that should be run.
  - imiric 2 days ago
    
    Hallucinations can be manifested in different ways. Using nonexistent APIs is just one of them. The LLM could just as well hallucinate code that doesn't fix a problem, or hallucinate that a problem exists in the first place, all while using existing APIs. This might not be a major issue for tasks like programming where humans can relatively easily verify the output, but in other scientific fields this can be much more labor-intensive and practically infeasible to do, as this recent example[1] showcases. So hallucination is a problem that involves any fabricated output that isn't grounded in reality.
    Which isn't to say that it is a universal problem. In some applications such as image, video or audio generation, especially in entertainment industries, hallucinations can be desirable. They're partly what we identify as "creativity", and the results can be fun and interesting. But in applications where facts and reality matter, they're a big problem.
    [1]: https://news.ycombinator.com/item?id=44174965
    
    HarHarVeryFunny a day ago
    
    You can test every line of code in your program, but how many people actually do?
    It's one thing if you are just creating a throwaway prototype, or something so simple that you will naturally exercise 100% of the code when testing it, but when you start building anything non-trivial it's easy to have many code paths/flows that are rarely executed or tested. Maybe you wrote unit tests for all the obvious corner cases, but did you consider the code correctness when conditions A, then B, then C ... occurs?). Even 100% code coverage (every line of code tested) isn't going to help you there.
    
    simonw a day ago
    
    > You can test every line of code in your program, but how many people actually do?
    In my mind, that's what separates genuinely excellent professional programmers from everybody else.
    
    HarHarVeryFunny a day ago
    
    I think it's perhaps more that you learn to write code that is easy to test and debug, consisting of some minimal set of simple orthogonal components, etc. You test every function of course, but learn to intuitively design out these combinatorial complexities that could nonetheless still be lurking, and pre-emptively include assertions to try to catch anything you may have overlooked.
fulafel 2 days ago

Hopefully this will be a catalyst to bring language integraed machine readable schema checking into wider use (ala Clojure), as static typing is crap for structured data/API stuff.
DarkNova6 a day ago

"I still don't think hallucinations in generated code matter very much"
Tell these our Python developers who don't test anything outside of a narrow happy path.
- simonw a day ago
  
  Yeah, if you use LLMs to write code and don't test outside the happy path you're vibe coding, you're not software engineering.
mucha a day ago

Interesting. How do existing systems catch Task Requirement hallucinations?
- simonw a day ago
  
  They don't. My comment was about "hallucinations in generated code".

th0ma5 2 days ago

I thought this was a very good read about the many of the issues that are faced without having any ground truth to reason against. It is interesting how many different ways people have developed to work around missing information, and the marginal improvements it makes in some benchmarks.

tmaly a day ago

I arrived at a similar conclusion through trial and error. Giving one-shot examples of code as context has given me the best results.

dzikibaz 2 days ago

How are "LLM hallucinations" different from a low-quality training dataset or randomly picked tokens due to overly random sampling settings?

citrin_ru 2 days ago

What I see even in good models is that when you ask something hard or impossible (but looking routine) instead of replying “I cannot” they hallucinate. A better dataset would help only to solve problems which can be solved (based on this dataset).

nerdjon 2 days ago

My favorite is trying to use it to generate an IAM policy and keys are just hallucinated based on expectations of what the keys would be called and are either wrong or they flat out don't exist if you are dealing with more advanced conditions.

mikeocool 2 days ago

> keys are just hallucinated based on expectations of what the keys would be called and are either wrong or they flat out don't exist
To be fair this is also how I generally try to write IAM configs without AI.

cryptica 2 days ago

I suspect hallucinations in LLMs are the result of contradictions in their training sets which were trained into it.

I suspect it's just like with humans. People who learn quickly and don't carefully curate their knowledge to resolve contradictions as they learn, they tend to make similar mistakes when it comes to subjects which they did not invest much time fully studying.

If I was an AI researcher, what I would try to do is find the highest quality information possible concerning very few axiomatic topics, with as few contradictions as possible, then train it into the LLM until it can generate text and basic reasoning which is fully accurate... Then once we have this basic but fully rational AI, start feeding it new data but, before giving it any piece of data to learn from, you first ask the AI to indicate if this new data contradicts any of its current knowledge. You only let it update its weights with the new data as-is if it does not contradict its existing knowledge. If it does contradict its existing knowledge, either discard it or maybe feed it the data but with some synthetic preamble like "Some people believe that..." so that it's aware of the existence of this belief system but knows that it's not to be internalized as its own beliefs.

Or maybe there is a way to do this to detect contradictions by looking at the weights themselves. You can rollback a round of training if the weights update in a way which suggests that a conflicting piece of information was learned in a specific round of training. Maybe there can be a different ANN which looks at the weights of the LLM during training and it was trained to detect contradictions and decides when to rollback a round of training.

AdieuToLogic 2 days ago

> I suspect hallucinations in LLMs are the result of contradictions in their training sets which were trained into it.
A simpler explanation, and I posit a correct one, is people anthropomorphize an algorithm by describing the result of a particular path within a statistical model used to generate tokens as being "hallucinations" due to them being unexpected by the person interpreting the text.
> I suspect it's just like with humans.
Therein lies the problem.
- geon 2 days ago
  
  Yes. ”Hallucinations” are not an edge case or an error mode, but part of the normal operation of the llm.

dheera 2 days ago

On the other hand, I think it's cute when LLMs hallucinate a Python library that doesn't exist, because it probably means it's worth creating into existence.