The realm of AI/ML, especially Generative AI, has garnered significant attention worldwide following the emergence of ChatGPT. Consequently, there has been a surge of interest in developing various models and tools within this domain.
In this article, we will take a look at how to interact with AI models using Java. But before that, we will take a look at what an “AI Model” is, the terms and concepts related to it.
AI/ML Primer
Artificial intelligence (AI) models are computational algorithms crafted to process and produce information, often emulating human cognitive abilities. By assimilating patterns and insights from extensive datasets, these models have the capacity to generate predictions, text, images, or other forms of output, thereby augmenting a multitude of applications spanning diverse industries.
Numerous AI models exist, each tailored to serve specific purposes. While ChatGPT has garnered attention for its text input and output capabilities, other models and companies provide a range of inputs and outputs to cater to diverse needs that includes Images, Audio, Video etc.
What distinguishes models such as GPT (Generative Pre-trained Transformer) is this pre-training functionality transforms AI into a versatile developer tool, eliminating the need for a deep understanding of machine learning or model training.
LLM – Large Language Models
LLM (Large Language Model) refers to a type of AI model designed to understand and generate human-like text at a high level of proficiency. LLMs are trained on vast amounts of text data and are capable of performing a wide range of natural language processing tasks, including text generation, translation, summarization, question answering, and more. Examples of LLMs include GPT (Generative Pre-trained Transformer) models such as GPT-3, BERT (Bidirectional Encoder Representations from Transformers), and others. These models have demonstrated impressive capabilities in understanding and generating text, leading to their widespread use in various applications, including chatbots, virtual assistants, content creation tools, and more.
Integrating LLM into applications requires access to LLM Provides like OpenAI, Google Vertex AI, Azure OpenAI etc., or software like Ollama, LM Studio, LocalAI etc that allow LLMs to be run locally. We will see about running LLMs locally later in this article.
Let’s see some more terms and concepts before we get into the code integrating with LLMs
Tokens
In the context of Large Language Models (LLMs), tokens refer to the basic units of text that the model processes. These tokens can represent individual words, subwords, or even characters, depending on how the model is trained and configured.
When a piece of text is input into an LLM, it is typically tokenized into smaller units before being processed by the model. Each token corresponds to a specific unit of text, and the model generates output based on the patterns and relationships it learns from the input tokens.
Tokenization is a crucial step in the operation of LLMs, as it allows the model to break down complex text data into manageable units for processing. By tokenizing text, LLMs can analyze and generate responses with a granular level of detail, enabling them to understand and generate human-like text.
Tokenization can vary based on the specific tokenization scheme used and the vocabulary size of the model
In some tokenization schemes, a single word may be split into multiple tokens, especially if it contains complex morphology or is not present in the model’s vocabulary. For example:
- Word: “university”
- Tokens: [“uni”, “vers”, “ity”]
- Explanation: In this example, the word “university” is split into three tokens: “uni”, “vers”, and “ity”. This decomposition allows the model to capture the morphological structure of the word.
Conversely, multiple consecutive words may be combined into a single token, particularly in subword tokenization schemes like Byte Pair Encoding (BPE) or WordPiece. For example:
- Phrase: “natural language processing”
- Token: “natural_language_processing”
- Explanation: In this example, the phrase “natural language processing” is combined into a single token “natural_language_processing”. This allows the model to treat the entire phrase as a single unit during processing, which can be beneficial for capturing multi-word expressions or domain-specific terminology.
The examples provided above are for the purposes of understanding and need not represent how it is actually processed by the LLM
Prompts and Prompt Templates
Prompts
Prompts lay the groundwork for language-based inputs, directing an AI model towards generating particular outputs. While those acquainted with ChatGPT might view prompts as mere textual inputs submitted through a dialog box to the API, their significance extends beyond this. In numerous AI models, the prompt text transcends a mere string, encompassing broader contextual elements. As we saw in the previous section on tokens, how tokens are processed varies differently based on the context and the tokenization schemes.
Developing compelling prompts is a blend of artistic creativity and scientific precision. The significance of this interaction method has led to the emergence of “Prompt Engineering” as a distinct discipline. A plethora of techniques aimed at enhancing prompt effectiveness are continually evolving. Dedication to refining a prompt can markedly enhance the resultant output.
Prompt Templates
Prompt templates serve as structured guides for crafting effective prompts, helping users communicate their intentions clearly and succinctly to AI models.
Prompt templates can vary depending on the specific use case or application domain. They may include placeholders for variables or user inputs, guiding users to provide contextually relevant information. By following a prompt template, users can ensure consistency and clarity in their prompts, which in turn improves the performance and relevance of the AI model’s responses.
For example, a prompt template for a chatbot might include placeholders for the user’s inquiry, desired action, and any relevant context or constraints. By filling in these placeholders with specific details, users can create well-formed prompts that elicit accurate and useful responses from the chatbot. Following is a sample chatbot prompt template
Planning to book a [service]?
Let me know your preferred date and time and
I'll assist you with the booking process.
Enhancing/Updating the Data to the AI Model
GPT 3.5/4.0 dataset extends only until September 2021 which becomes an apparent limitation for getting updated data. Consequently, the model says that it does not know the answer to questions that require knowledge beyond that date. The dataset can be from a few hundred gigabytes to a few petabytes.
In order to incorporate additional data to the model the following techniques are used
-
Fine Tuning: a conventional method in machine learning, entails adjusting the model’s parameters and altering its internal weighting. This extremely resource-intensive process is a challenge when training large models like GPT and certain models may not provide this capability.
-
Retrieval Augmented Generation (RAG): RAG, also referred to as “Prompt Stuffing”, offers a pragmatic approach. In this method, the system extracts unstructured data from documents, processes it, and stores it in a vector database such as Chroma, Pinecone, Milvus, Qdrant, and others. During retrieval, when an AI model is tasked with answering a user’s query, the question along with all “similar” document fragments retrieved from the vector database are incorporated into the prompt forwarded to the AI model.
-
Function Calling: This mechanism facilitates the registration of custom user functions, linking large language models with external system APIs. These systems enable LLMs to access real-time data and execute data processing tasks on their behalf.
Integrating LLMs into applications
Now let’s dive into the coding aspect of integrating LLMs in the applications. The following are the prerequisites
- Ollama: Ollama is a lightweight, extensible framework for building and running language models on your local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Download and install the appropriate binary for your OS.
- Langchain4j: LangChain4j is a Java library designed to simplify integrating AI and large language models (LLMs) into Java applications. It offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with a different LLM or embedding store, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports over 10 popular LLM providers and more than 15 embedding stores.
- JBang: JBang is a neat little tool that enables running Java code as script. It directly runs the java source file and saves the effort of setting up or configuring the project for Maven, Gradle or any other build system. It also manages the dependencies of external libraries in the comment of the source itself as we’ll see in the following code. You can also read about JBang in our previous article
-
First, download the Ollama binary and install it. Alternatively, one can install LM Studio as well, that allows running of the LLM models locally. However, in this article, we will use Ollama
-
Next download and run the Ollama LLM model. Executing the following command in the shell downloads and runs the LLM
ollama run mistral
You can run any other model like llama2, phi etc as well. However, note that Ollama
will download the required model which will be a few gigabytes in size.
-
Download and install JBang. When executing the code, JBang expects the Java binary to be in the PATH, if not JBang will download the necessary JDK as well.
-
Type the following code in your editor and save it as
OllamaMistralExample.java
//JAVA 21
//DEPS dev.langchain4j:langchain4j:0.28.0
//DEPS dev.langchain4j:langchain4j-ollama:0.28.0
import java.io.Console;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaChatModel;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import dev.langchain4j.model.output.Response;
class OllamaMistralExample {
private static final String MODEL = "mistral";
private static final String BASE_URL = "http://localhost:11434";
private static Duration timeout = Duration.ofSeconds(120);
public static void main(String[] args) {
Console console = System.console();
String model = console.readLine(
"Welcome, Butler at your service!!\n\nPlease choose your model - Type '1' for the Basic Model and '2' for Streaming Model:");
String question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
while (!"exit".equalsIgnoreCase(question)) {
if ("1".equals(model)) {
basicModel(question);
} else {
streamingModel(question);
}
question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
}
}
static void basicModel(String question) {
ChatLanguageModel model = OllamaChatModel.builder()
.baseUrl(BASE_URL)
.modelName(MODEL)
.timeout(timeout)
.build();
System.out.println("\n\nPlease wait...\n\n");
String answer = model.generate(question);
System.out.println(answer);
}
static void streamingModel(String question) {
StreamingChatLanguageModel model = OllamaStreamingChatModel.builder()
.baseUrl(BASE_URL)
.modelName(MODEL)
.timeout(timeout)
.temperature(0.0)
.build();
CompletableFuture<Response<AiMessage>> futureResponse = new CompletableFuture<>();
model.generate(question, new StreamingResponseHandler<AiMessage>() {
@Override
public void onNext(String token) {
System.out.print(token);
}
@Override
public void onComplete(Response<AiMessage> response) {
futureResponse.complete(response);
}
@Override
public void onError(Throwable error) {
futureResponse.completeExceptionally(error);
}
});
futureResponse.join();
}
}
- Now type the following command to run the program, we will see the explanation to this shortly.
jbang OllamaMistralExample.java
JBang automatically downloads the dependencies and runs this Java file.
Now let’s get into the code
The comment line at the top of the file are processed by JBang. //JAVA
comment line indicates the target JDK version and the ones that start with // DEPS
define the library dependencies. Here we define the Langchain4j libraries (core + ollama) that JBang downloads and processes. For further details about the JBang comment lines, please visit the JBang website.
The jbang OllamaMistralExample class defines two methods apart from the main
method – basicModel
and streamingModel
. The quick difference them is that the basicModel
waits for the LLM to generate the full response and respond back. The user will have to wait until the LLM completes the generation. LLMs generate one token at a time, so the LLM Providers offer a way to stream the tokens as soon as they are generated which significantly improves the user experience as the user can start reading the response almost immediately than waiting for the entire response. Therefore, the streamingModel
method harnesses this streaming capability and starts to output the response as soon it receives from the LLM Provider.
Langchain4j provides APIs for both the standard response and streaming response. The ChatLanguageModel
interface is for getting the standard response and the StreamingChatLanguageModel
interface is for the streaming response. Both the interfaces provide similar methods, however the StreamingChatLanguageModel
requires the StreamingResponseHandler
interface implementation to be passed as an argument.
The StreamingResponseHandler
interface specifies the following methods
public interface StreamingResponseHandler<T> {
void onNext(String token);
default void onComplete(Response<T> response) {}
void onError(Throwable error);
}
-
onNext
gets called when the LLM generates a token and responds back. -
onComplete
is a default method that does nothing, however can be overriden to deal with the complete response that gets delivered once the LLM has completed generating the response. -
onError
is invoked when there is an error generating the response.
The basicModel
method uses the OllamaChatModel.builder()
to build the class implementing the ChatLanguageModel
interface and the streamingModel
method uses the OllamaStreamingChatModel.builder()
to build the class implementing the StreamingChatLanguageModel
interface.
For both the interface types – standard and streaming the following fields need to be passed to the builders of each type
- Base URL:
http://localhost:11434
The URL and port where Ollama exposes the LLM service - Model Name:
mistral
in this example. - Timeout: Timeout is optional, however it is safe to set it in a local environment because LLMs could be slow to generate response due to resource constraints like No GPU, less memory etc.
Both the ChatLanguageModel
and StreamingChatLanguageModel
interfaces provide the generate method which is similar however as mentioned above, the StreamingChatLanguageModel
‘s generate
method expects an additional argument which is the implementation of the StreamingResponseHandler
interface.
Try running the code above and enter into the world of AI/ML using LLMs. What we have seen above is just the beginning. There’s a lot more to explore in this space, especially what Langchain4j offers – AiServices
, Structured Data Extraction
, Chains
, Embedding
, RAG
, Function Calling
and more.
Apart from Langchain4j, Spring AI also has support for AI/ML the same way Langchain4j does. We’ll explore those in the upcoming articles.
Happy Coding!