Skip to main content
Solutions Architect
Nov 20, 2024

Like seemingly everyone else, we set out to build something useful with the latest open source artificial intelligence (AI) and large language models (LLMs) and prove that we can deliver a tool that can stand up to scrutiny. Along the way, the world did not stand still, of course. With every passing month the discussion around AI and LLMs heats up, and many new companies and tools are sprouting up from nowhere. Opinions and doomsday warnings fly around aplenty, too. It can be difficult to keep track of everything, but perhaps that is unnecessary. Here is our account of what we built and what we learned.

A couple of foundations

LLMs made simple

When one peels away all the layers, one implementation pattern seems to gain a lot of traction: retrieval-augmented generation (RAG). Looking in from the outside, RAG is surprisingly reminiscent of the web applications we routinely deploy.

a woman sitting cross-legged with a laptop in the middle of a diagram about how LLMs work

Granted, there are new and exciting technologies in there, but at the end of the day, it is still an application layer backed by a database, and we can package that technology into containers and host it somewhere. In terms of where the data resides (data sovereignty) and whether we have it under control (data privacy), there is nothing truly new, provided we use the actual tools and not nebulous cloud abstractions.

In our proof of concept implementation, we stuck very close to the metal, in a manner of speaking. The ingredients of open source LLMs, open source tooling, and open source hosting under Lagoon gave us complete control every step of the way. It allowed us to trust that we built a solid and well-understood LLM-driven chatbot that adheres to the privacy goals we had, follows the guiding prompts we set, and delivers responses that are well-reasoned and, most of all, useful.

Data quality matters

In computer science there is the notion of “garbage in, garbage out”. It refers to the fact that if input data is of inferior quality, then the output of a tool will also be inferior. Nowhere is this more prevalent than in artificial intelligence, hence we think of AI as the mother of all garbage-in, garbage-out problems. The incredible power of AI tools magnifies the potential of bad results if we do not pay attention to the quality of the input. We can certainly produce a lot of prettily packaged trash bags, except it won’t do anyone any good.

a hand holds a garbage bag over a machine that produces garbage
Source: Medium.com

Lessons learned

The importance of preprocessing

Extremely careful and deliberate preprocessing and data enrichment were necessary to ensure valuable responses from the LLM. We had set a high goal with the chosen source datasets and knew that context was low. We also learned that the prevalence of code samples and technical jargon lowered the LLM’s ability to infer meaning and connections even more.

But that was the whole point of our exercise. Challenges are there to be overcome!

In a nutshell, three steps were key to our success:

  1. Obtaining source data in as clean form as possible was essential. Switching to page scraping and transforming to MarkDown was key here, as a replacement to feeding data from the devportal via a JSON export with a lot of extraneous and ultimately not useful key/value pairs derived from the CMS field storage.
  2. Careful chunking in a semantically meaningful way was crucial to retain semantic meaning and context. Ensuring that chunks were delimited by headings and that those headings were as high-order as possible was essential.
  3. Data enrichment via question/answer pairs. Conquering low context with what amounts to a double or even triple inference step was of great importance to establish more context and connections, especially across separate pages.

The importance of limited scope

Another key ingredient for success is the limited scope of the source data set and setting boundaries via prompt engineering. Our chatbot doesn’t need to know everything and anything under the sun. In fact, guiding prompts prohibit the LLM from using its generic training and task it to only answer questions that refer to the Data Processing Framework. It is interesting to explore the boundaries.

Here is an edge case inspired by Adam, a data scientist who had no clue about DPF itself. This is the interaction we had with the chatbot:

Q: How can I add NLP to DPF?

A: "To add NLP (Natural Language Processing) capabilities to DPF (Data Processing Framework), you would typically need to create custom operators within the DPF framework that handle NLP tasks. This would involve implementing the necessary logic and algorithms for NLP tasks, such as text parsing, sentiment analysis, language translation, etc., as part of your custom operators.

Additionally, you may need to integrate external NLP libraries or tools into your DPF project to leverage pre-existing NLP functionalities efficiently. By creating custom operators in DPF that incorporate NLP capabilities, you can enhance the data processing workflows to handle textual data alongside the existing numerical simulation data.

For more detailed guidance on integrating NLP capabilities into DPF and creating custom operators, refer to the Using DPF capabilities in an existing project section for instructions on adding custom functionalities to DPF."

Sources:

This is very interesting! We still get a helpful answer, and the Llama3 model partially uses its generic training. Stepping a little bit outside the set boundaries, the chatbot avoids outright violating the guiding prompts and provides an acceptable answer.

Here is a similar helpful example that combines source data with generic training:

Q: Please provide code to integrate DPF with Slack.

A: "I'm unable to find specific instructions or code snippets for integrating DPF (Data Processing Framework) with Slack within the technical documentation provided. It seems that the documentation primarily focuses on data processing workflows and does not cover integrations with communication platforms like Slack.

For integrating DPF with Slack, you may need to explore Slack API documentation and libraries that allow interaction with Slack programmatically. Typically, you would need to implement functionalities within your DPF project that can send and receive messages from Slack using the Slack API.

I recommend reviewing the Slack API documentation and exploring resources related to integrating Slack with C++ applications to get started on integrating DPF with Slack successfully.

If you have any other questions or need further assistance, feel free to ask!"

Sources:

However, stepping completely outside the boundaries, we get a blunt answer:

Q: How do I cook broccoli?

A: "I am sorry, I cannot provide information on cooking broccoli as it is not related to the technical documentation provided. If you have any questions regarding the content within the technical documentation, feel free to ask!"

Avoiding hallucinations

LLMs are known to hallucinate (invent answers out of thin air, often when they do not have enough context to answer the question that was posed). While we cannot guarantee that our chatbot does not hallucinate, its limited scope and the prohibition of stepping outside the boundaries certainly help to limit embarrassing results. In practice, we have not observed clearly hallucinated responses, although truth be told, we haven’t tried hard to mislead the bot.

We are confident that it is possible to lead the chatbot with consecutive carefully crafted prompts toward completely nonsensical output and force it to violate its built-in prompts. Pretty much every large language model to date does this, as documented in the linked Wikipedia article above.

However, our typical user persona is that of a software developer with considerable and specific training. It is reasonable to expect that they bring a good amount of critical thinking when evaluating responses. As our friend Matt Kleve showed, employing AI to understand and learn a new development area is not about getting ready-to-run code from ChatGPT, especially not about obtaining complete software modules. It is about efficiently getting started and speeding up the construction of necessary boilerplate code. Being an experienced developer, Matt quickly spotted issues in the ChatGPT-provided code and moved ahead without much difficulty. 

Moreover, our chatbot always provides one or more source links, which allows a developer to fact-check any answer and ensures that even partially correct answers provide a stepping stone toward a solution. In our opinion, source links are a must for any LLM system that intends to engender trust.

A Completely Private LLM-driven Chatbot is Competitive

As outlined above, building a competitive AI chatbot using open source tools and private hosting is absolutely feasible, regardless of source data complexity. 

Our competitive chatbot can:

  • Answer in clear English (or another language).
  • Provide technically correct answers.
  • Provide information that is of substantially similar quality as the source material.
  • Answers are as good or better in quality as one can obtain from the major LLMs.

In addition, our chatbot can infer answers across several source pages, thanks to the data enrichment steps. This makes the chatbot objectively superior to traditional search engines because it can save a developer one or several steps in their journey to grasp and master the documentation and create code that uses the documented interfaces.

While the chatbot we built is barely more than a proof of concept, we feel it is quite ready to be battle-tested. Working with subject matter experts, it is relatively straightforward to detect problem areas, build more guiding prompts, or run additional data enrichment steps to steer the model to more useful, richer, and clearer responses.

In addition, while this chatbot deals with developer documentation and code, there is no reason why a similar chatbot cannot serve up answers for very different content, be that a content-rich government site, a library, statistical source data gained from research and engineering, or any other content body.

We are really looking forward to pushing the boundaries and exploring what is possible together with you!

laptop with API written in large letters

Would you like to build your own LLM-driven chatbot? Please get in touch.

Are you interested in a demonstration of the chatbot's capabilities? Justin Winter (AI Program Lead at amazee.io) and Christoph Weber (Solutions Architect at Pronovix) hosted a webinar about developing and implementing a private LLM chatbot designed to enhance developer documentation.


All Pronovix publications are the fruit of a team effort, enabled by the research and collective knowledge of the entire Pronovix team. Our ideas and experiences are greatly shaped by our clients and the communities we participate in.

Christoph is a creative and versatile technical leader who can present complex subjects in plain English. He has extensive experience managing demanding computing projects and partnering with stakeholders of all stripes to optimize solutions. He is also a regular speaker at technical events, and in his spare time builds furniture that align with his penchant for simplicity.

Newsletter

Articles on devportals, DX and API docs, event recaps, webinars, and more. Sign up to be up to date with the latest trends and best practices.

 

Subscribe