
Misconceptions of LLM Chatbots: For scientists and business professionals it is critical to know the source of any AI generated answer or assertion. If we cannot trace the sources accurately we are unlikely to trust the output. Imagine reading a literature review where no sources were cited.
The technique used to provide as accurate as possible linkage between source material and AI generated answers is called Retrieval Augmented Generation (RAG). This is what many organisations appear to be deploying.
Simplifying, this is essentially a ‘Google-like’ search on paragraphs of text rather than full documents, with the top n paragraphs (chunks) simply appended as content to your question (LLM prompt). These paragraphs determine the usefulness of the answer you get, the LLM just summarises this somewhat.
The implication is that this is effectively an Information Retrieval (IR) driven technique. The extraction/summarisation happens at the end. It does not use the whole text corpus or information space, so suffers the same issues raised with Google search, in that it only presents the top n results, what is statistically most similar (embedding) to your prompt (question).
Just like Google this lends itself to lookup/known item questions, where there is a right answer. For example, “What is the age of the Bathonian?”. For exploratory search goals, this technique will likely, just like Google search, drive you to what is most obvious. For example, “What do we know about host rocks for copper in this area?”. Outliers will be missed and you may not be taken off the well beaten path.
This can be an issue because we may think of “AI” in this sense as something special, more than it is. Something that utilises the whole text corpus. It absolutely does not in this case. We are using the statistical similarity of words to come up with a ranked set of paragraphs which go into a summariser. The obvious will always likely out-compete the less obvious.
This may be very important to know so people are not ‘fooled’ by the highly convincing AI generated answer. This is not a point about so called ‘hallucinations’. You may have been given an answer which is true for your exploratory search question. The point is that you may not have been given the most useful and significant information held within the information space, available because only a fraction of it is being used.
If we are interested in data driven discovery for science and business opportunities, we need to look beyond just limited ‘LLM driven chatbot’ style techniques. To include techniques which stack ‘everything’ including outliers from our information space. Language Models and other Natural Language Processing techniques offer so much more than a Chatbot…
I touch on some of these points in an ethics paper published late last year. Link in the comments.
https://www.journalofgeoethics.eu/index.php/jgsg/article/view/63