Summarising Top 100 UK Climbs: Running Local Language Models with LM Studio and R

R-bloggers 2024-11-25

[This article was first published on Musings on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Since my last entry on this blog, the landscape of data science hasbeen massively disrupted by the advent of large language models (LLM).Areas in data science such as text mining and natural languageprocessing have been revolutionised by the capabilities of thesemodels.

Remember the days of manually tagging and reading through text data?Well, they’re long gone, and I can only say that notall blog posts age equally well (RQDA was one of my favourite Rpackages for analysing qualitative data; not only is it now redundant,it is also no longer available on CRAN). I am also not sure that thereis much value anymore in usingn-gram / word frequency as well as word clouds to surface key themesfrom a large corpus of text data, when you can simply use a LLMthese days to generate summaries and insights.

To get with the times (!), I have decided to explore the capabilitiesof LM Studio, a platform that allowsyou to run language models locally. The benefits of running alanguage model locally are:

  • you can interact with it directly from your R environment, withoutthe need to rely on cloud-based services.
  • There is no need to pay for API calls – as long as you can affordthe electricity bill to run your computer, you can generate as much textas you want!

In this blog post, I will guide you through the process of setting upLM Studio, integrating it with R, and applying it to a dataset on UK’s top 100 cyclingclimbs (my latest pastime). We will create a custom function tointeract with the language model, generate prompts for the model, andvisualize the results. Let’s get started!

image from Giphy
image from Giphy

Setting Up LM Studio

Install LM Studio and download models

Before we begin, ensure you have the following installed:

After you have downloaded and installed LM Studio, open theapplication. Go to the Discover tab (sidebar), whereyou can browse and search for models. In this example, we will be usingthe Phi-3-mini-4k-instructmodel, but you can of course experiment with any other model that youprefer – as long as you’ve got the hardware to run it!

Now, select the model from the top bar to load it:

To check that everything is working fine, go to theChat tab on the sidebar and start a new chat tointeract with the Phi-3 model directly. You’ve now got your languagemodel up and running!

Required R Packages

To effectively work with LM Studio, we will need several Rpackages:

  • tidyverse – for data manipulation
  • httr – for API interaction
  • jsonlite – for JSON parsing

You can install/update them all with one line of code:

# Install necessary packagesinstall.packages(c("tidyverse", "httr", "jsonlite"))

Let us set up the R script by loading the packages and the data wewill be working with:

# Load the packageslibrary(tidyverse)library(httr)library(jsonlite)top_100_climbs_df <- read_csv("https://raw.githubusercontent.com/martinctc/blog/refs/heads/master/datasets/top_100_climbs.csv")

The top_100_climbs_df dataset contains information onthe top 100 cycling climbs in the UK, which I’ve pulled from the Cycling Uphill website,originally put together by Simon Warren. Theseare 100 rows, and the following columns in the dataset:

  • climb_id: row unique identifier for the climb
  • climb: name of the climb
  • height_gain_m: height gain in meters
  • average_gradient: average gradient of the climb
  • length_km: total length of the climb in kilometers
  • max_gradient: maximum gradient of the climb
  • url: URL to the climb’s page on Cycling Uphill

Here is what the dataset looks like when we rundplyr::glimpse():

glimpse(top_100_climbs_df)## Rows: 100## Columns: 7## $ climb_id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…## $ climb            <chr> "Cheddar Gorge", "Weston Hill", "Crowcombe Combe", "P…## $ height_gain_m    <dbl> 150, 165, 188, 372, 326, 406, 166, 125, 335, 163, 346…## $ average_gradient <dbl> 0.05, 0.09, 0.15, 0.12, 0.10, 0.04, 0.11, 0.11, 0.06,…## $ length_km        <dbl> 3.5, 1.8, 1.2, 4.9, 3.2, 11.0, 1.5, 1.1, 5.4, 1.4, 9.…## $ max_gradient     <dbl> 0.16, 0.18, 0.25, 0.25, 0.17, 0.12, 0.25, 0.18, 0.12,…## $ url              <chr> "https://cyclinguphill.com/cheddar-gorge/", "https://…

Our goal here is to use this dataset to generate text descriptionsfor each of the climbs using the language model. Since this is for textgeneration, we will do a bit of cleaning up of the dataset, convertinggradient values to percentages:

top_100_climbs_df_clean <- top_100_climbs_df %>%  mutate(    average_gradient = scales::percent(average_gradient),    max_gradient = scales::percent(max_gradient)    )

Setting up the Local Endpoint

Once you have got your model in LM Studio up and running, you can setup a local endpoint to interact with it directly from your Renvironment.

To do this, go to the Developer tab on the sidebar,and click ‘Start Server (Ctrl + R)’.

Setting up a local endpoint allows you to interact with the languagemodel directly from your R environment. If you leave your defaultsettings unchanged, your endpoints should be as follows:

In this article, we will be using the chat completions endpoint forsummarising / generating text.

Writing a Custom Function to Connect to the Local Endpoint

The next step here is to write a custom function that will allow usto send our prompt to the local endpoint and retrieve the response fromthe language model. Since we have 100 climbs to describe, writing acustom function allows us to scale the logic for interacting with themodel, which save us time and reduces the risk of errors. We can alsoreuse this function as a template for other future projects.

Creating a custom function

Below is a code snippet for creating a custom function to communicatewith your local LM Studio endpoint:

# Define a function to connect to the local endpointsend_prompt <- function(system_prompt, user_prompt, endpoint = "http://localhost:1234/v1/chat/completions") {  # Define the data payload for the local server  data_payload <- list(    messages = list(      list(role = "system", content = system_prompt),      list(role = "user", content = user_prompt)    ),    temperature = 0.7,    max_tokens = 500,    top_p = 0.9,    frequency_penalty = 0.0,    presence_penalty = 0.0  )  # Convert the data to JSON  json_body <- toJSON(data_payload, auto_unbox = TRUE)    # Define the URL of the local server  response <- POST(    endpoint,     add_headers(      "Content-Type" = "application/json"),    body = json_body,    encode = "json")  if (response$status_code == 200) {    # Parse response and return the content in JSON format    response_content <- content(response, as = "parsed", type = "application/json")    response_text <- response_content$choices[[1]]$message$content    response_text  } else {    stop("Error: Unable to connect to the language model")  }}

There are a couple of things to note in this function:

  1. The send_prompt function takes in three arguments:system_prompt, user_prompt, andendpoint.
  • We distinguish between the system and user prompts here, which istypically not necessary for a simple chat completion. However, it isuseful for more complex interactions where you want to guide the modelwith specific prompts. The system prompt is typically used for providingoverall guidance, context, tone, and boundaries for the behaviour of theAI, while the user prompt is the actual input that you want the AI torespond to.
  • The endpoint is the URL of the local server that we areconnecting to. Note that we have used the chat completions endpointhere.
  1. The data_payload is a list that contains the messages(prompts) and the parameters that you can adjust to control the outputof the language model. These parameters can vary depending on the modelyou are using - I typically search for the “API documentation” or the“API reference” for the model as a guide. Here are the parameterswe are using in this example:
  • messages is a list of messages that the language modelwill use to generate the text. In this case, we have a system messageand a user message.
  • temperature controls the randomness of the output. Ahigher temperature will result in more random output.
  • max_tokens is the maximum number of tokens that thelanguage model will generate.
  • top_p is the nucleus sampling parameter, and analternative to sampling with temperature. It controls the probabilitymass that the model considers when generating the next token.
  • frequency_penalty and presence_penalty areused to penalize the model for repeating tokens or generatinglow-frequency tokens.
  1. The json_body is the JSON representation of thedata_payload list. We need to transform the list into JSONformat because this is what is expected by the local server. We do thiswith jsonlite::toJSON().

  2. The response object is the result of sending a POSTrequest to the local server. If the status code of the response is 200,then we return the content of the response. If there is an error, westop the function and print an error message.

Now that we have our function, let us test it out!

Testing the Function

To ensure your function works as expected, run a simple test:

# Test the generate_text functiontest_hill <- top_100_climbs_df_clean %>%  slice(1) %>% # Select the first row  jsonlite::toJSON()send_prompt(  system_prompt = paste(    "You are a sports commentator for the Tour de France.",    "Describe the following climb to the audience in less than 200 words, using the data."  ),  user_prompt = test_hill  )## [1] "Ladies and gentlemen, hold on to your helmets as we approach the infamous Cheddar Gorge climb – a true testament of resilience for any cyclist tackling this segment! Standing at an imposing height gain of approximately 150 meters over its lengthy stretch spanning just under four kilometers, it demands every last drop from our riders. The average gradient here is pitched at around a challenging 5%, but beware – the climb isn't forgiving with occasional sections that reach up to an extreme 16%! It’s not for the faint-hearted and certainly no place for those looking for respite along this grueling ascent. The Cheddar Gorge will separate contenders from pretenders, all in one breathtakingly scenic setting – a true masterclass of endurance that is sure to make any Tour de France rider's legs scream!"

Not too bad, right?

Running a prompt template on the Top 100 Climbs dataset

What we have created in the previous section are effectively a prompttemplate for the system prompt, and the user prompt is made up of thedata we have on the climbs, converted to JSON format. To apply thisprogrammatically to all the 100 climbs, we can make use of thepurrr::pmap() function in tidyverse, whichcan take a data frame as an input parameter and apply a function to eachrow of the data frame:

# Define system promptsys_prompt <- paste(  "I have the following data regarding a top 100 climb for road cycling in the UK.",  "Please help me generate a description based on the available columns, ending with a URL for the reader to find more information.")# Generated descriptions for all climbstop_100_climbs_with_desc <-  top_100_climbs_df_clean %>%  pmap_dfr(function(climb_id, climb, height_gain_m, average_gradient, length_km, max_gradient, url) {    user_prompt <- jsonlite::toJSON(      list(        climb = climb,        height_gain_m = height_gain_m,        average_gradient = average_gradient,        length_km = length_km,        max_gradient = max_gradient,        url = url        )        )        # climb description    climb_desc <- send_prompt(system_prompt = sys_prompt, user_prompt = user_prompt)    # Return original data frame with climb description appended as column    tibble(      climb_id = climb_id,      climb = climb,      height_gain_m = height_gain_m,      average_gradient = average_gradient,      length_km = length_km,      max_gradient = max_gradient,      url = url,      description = climb_desc    )  })

The top_100_climbs_with_desc data frame now contains theoriginal data on the top 100 climbs, with an additional columndescription that contains the text generated by thelanguage model. Note that this part might take a little while to run,depending on the specs of your computer and which model you areusing.

Here are a few examples of the generated descriptions:

Box Hill is a challenging climb in the UK, with an average gradientof approximately 5%, and it stretches over a distance of just under 2kilometers (130 meters height gain). The maximum gradient encountered onthis ascent reaches up to 6%. For more detailed information about BoxHill’s topography and statistics for road cyclists, you can visit theCyclinguphill website: https://cyclinguphill.com/box-hill/

Ditchling Beacon stands as a formidable challenge within the UK’s topclimbs for road cycling, boasting an elevation gain of 142 meters overits length. With an average gradient that steepens at around 10%,cyclists can expect to face some serious resistance on this uphillbattle. The total distance covered while tackling the full ascent isapproximately 1.4 kilometers, and it’s noteworthy for reaching a maximumgradient of up to 17%. For those keenly interested in road cyclingclimbs or looking to test their mettle against Ditchling Beacon’s steepinclines, further details are readily available at https://cyclinguphill.com/100-climbs/ditchling-beacon/.

Swains Lane is a challenging road climb featured on the top 100 listfor UK cycling enthusiasts, standing proudly at number one with itsdistinctive characteristics: it offers an ascent of 71 meters over justunder half a kilometer (0.9 km). The average gradient throughout thisroute maintains a steady and formidable challenge to riders, peaking atapproximately eight percent—a testament to the climb’s consistentdifficulty level. For those seeking even more rigorous testing grounds,Swains Lane features sections where cyclists can face gradients soaringup to an impressive 20%, which not only pushes physical limits but alsodemands a high degree of technical skill and mental fortitude from theriders tackling this climb.Riders looking for more detailed informationabout this top-tier British road ascent can visit https://cyclinguphill.com/swains-lane/where they will find comprehensive insights, including historical dataon past climbs and comparisons with other challenging routes across theUK cycling landscape.

If you are interested in exploring the entire dataset with thegenerated column, you can download this here.

Conclusion

In this blog, we’ve explored the process of setting up LM Studio andintegrating local language models into your R workflow. We discussedinstallation, creating custom functions to interact with the model,setting up prompt templates, and ultimately generating text descriptionsfrom a climbing dataset.

Now it’s your turn! Try implementing the methods outlined in thisblog and share your experiences or questions in the comments sectionbelow. Happy coding!

To leave a comment for the author, please follow the link and comment on their blog: Musings on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Summarising Top 100 UK Climbs: Running Local Language Models with LM Studio and R