Exploring the Integration of Different LLMs into Your Workflow
As the landscape of language models continues to evolve, experimenting with different large language models (LLMs) can provide insights into their capabilities and performance on various tasks. In this guide, we delve into the practical aspects of swapping in another LLM using the vitals R package. By exploring how to create new chat objects and evaluate tasks across different models, we highlight the nuances and considerations inherent in using LLMs.
Setting Up a New Chat Object
To begin utilizing a different LLM, such as Google Gemini 3 Flash Preview, you need to create a new chat object. This initial setup is essential as it lays the groundwork for task evaluations. The code snippet below demonstrates how to establish this connection within R:
r
my_chat_gemini <- chat_google_gemini(model = “gemini-3-flash-preview”)
Once you have your new chat object, you can proceed to run various tasks with it.
Options for Running Tasks
When it comes to running the same task with a different model, you have three versatile approaches:
-
Clone an Existing Task: This method allows you to duplicate a task and set the new chat as its solver. This approach maintains the original task structure while applying the new model.
r
my_task_gemini <- my_task$clone()
my_task_gemini$set_solver(generate(my_chat_gemini))
my_task_gemini$eval(epochs = 3) -
Set Solver During Execution: Alternatively, you can streamline the process by cloning the task and specifying the new chat as the solver during the evaluation step.
r
my_task_gemini <- my_task$clone()
my_task_gemini$eval(epochs = 3, solver_chat = my_chat_gemini) -
Create a New Task from Scratch: This is the most flexible option, allowing you to give the new task a specific name and configure various parameters, including the dataset and scoring methods.
r
my_task_gemini <- Task$new(
dataset = my_dataset,
solver = generate(my_chat_gemini),
scorer = model_graded_qa(
partial_credit = FALSE,
scorer_chat = ellmer::chat_anthropic(model = “claude-opus-4-6”)
),
name = “Gemini flash 3 preview”
)
my_task_gemini$eval(epochs = 3)
Managing API Keys
Before running evaluations, ensure that your API keys are set for each provider you intend to test. This step is crucial for models that require authentication. Conversely, local LLMs, such as those managed by the ollama package, typically circumvent this requirement.
Viewing Results Across Multiple Task Runs
Once you’ve executed tasks with various models, comparing results becomes essential. The vitals_bind() function allows you to merge findings from different tasks effortlessly:
r
both_tasks <- vitals_bind(
gpt5_nano = my_task,
gemini_3_flash = my_task_gemini
)
This operation produces an R data frame that includes crucial columns for task identification, scores, and metadata detailing the inputs and outputs of each model.
Flattening Metadata for Analysis
For a more straightforward analysis of the combined results, un-nesting the metadata column makes scanning through data more efficient:
r
library(tidyr)
both_tasks_wide <- both_tasks |>
unnest_longer(metadata) |>
unnest_wider(metadata)
With this structured data, you can now easily generate visual representations of your results.
Running Visualizations
After organizing your results into a digestible format, you can create visualizations to compare model outputs effectively. Utilizing libraries like dplyr and ggplot2, you can extract and analyze the results dynamically.
r
Filter for bar chart results
barchart_results <- both_tasks_wide |>
filter(id == “barchart”)
for (i in seq_len(nrow(barchart_results))) {
code_to_run <- extract_code(barchart_results$result[i])
score <- as.character(barchart_results$score[i])
task_name <- barchart_results$task[i]
epoch <- barchart_results$epoch[i]
cat(“\n”, strrep(“=”, 60), “\n”)
cat(“Task:”, task_name, “| Epoch:”, epoch, “| Score:”, score, “\n”)
cat(strrep(“=”, 60), “\n\n”)
tryCatch(
{
plot_obj <- eval(parse(text = code_to_run))
print(plot_obj)
Sys.sleep(3)
},
error = function(e) {
cat(“Error running code:”, e$message, “\n”)
Sys.sleep(3)
}
)
}
cat(“\nFinished displaying all”, nrow(barchart_results), “bar charts.\n”)
This script systematically cycles through each bar chart result, displaying the relevant information and visualizations.
Testing Local LLMs
Local LLMs present a promising use case for users concerned about data privacy or those limited by hardware constraints. The vitals package, in conjunction with ollama, enables seamless testing of various local LLMs.
To utilize local models, install and run the ollama application. The commands ollama pull <model-name> and ollama run <model-name> facilitate downloading and starting LLMs on your system. For example:
r
ollama pull ministral-3:14b
With the rollama R package, you can also download and manage local models directly within your R environment:
r
rollama::pull_model(“ministral-3:14b”)
Creating Tasks with Local Models
The flexibility of creating tasks with local models mirrors that of the cloud-based models. You can set up similar evaluation tasks with each local model, ensuring standardized tests across varied platforms.
r
ministral_chat <- chat_ollama(model = “ministral-3:14b”)
ollama_task <- Task$new(
dataset = my_dataset,
solver = generate(ministral_chat),
scorer = model_graded_qa(
scorer_chat = ellmer::chat_anthropic(model = “claude-opus-4-6”)
)
)
ollama_task$eval(epochs = 5)
By cloning and modifying the task for different models, you can quickly compare performance metrics.
Extracting Structured Data from Text
One of the noteworthy features of the vitals package is its ability to extract structured data from unstructured text. The function generate_structured() can analyze specific elements such as topics, speaker names, and dates.
First, define the dataset for extraction, followed by constructing a type object that outlines the data structure you wish to retrieve:
r
extract_dataset <- data.frame(…)
my_object <- type_object(
workshop_topic = type_string(),
speaker_name = type_string(),
current_speaker_affiliation = type_string(),
date = type_string(“Date in yyyy-mm-dd format”),
start_time = type_string(“Start time in hh:mm format”)
)
Next, create structured tasks for your models with defined parameters:
r
my_task_structured <- Task$new(
dataset = extract_dataset,
solver = generate_structured(solver_chat = my_chat, type = my_object),
scorer = model_graded_qa(…)
)
Clone this task for other models, and remember to set parameters accordingly before running evaluations.
Evaluation and Results Compilation
As evaluations conclude, compile the results into a single data frame for comparison. This unified view enhances readability and allows for deeper analysis on performance across different models—enabling strategic decisions based on empirical data.
r
structured_tasks <- vitals_bind(…) # Merge results
saveRDS(structured_tasks, “structured_tasks.Rds”)
Conclusion
Navigating the process of integrating various LLMs into your workflows not only elevates your understanding of each model’s strengths and weaknesses but also equips you with the tools to conduct systematic evaluations. The flexibility to run the same task across multiple models, analyze results, and perform structured extractions from text renders the vitals R package an invaluable asset in the evolving field of AI and language processing.