Applications to real data

Demo

Demo 3
Author

Emi Tanaka

Published

October 15, 2024

The image shows a laptop screen displaying a graphical representation of the Earth with various data points and lines overlaid on it.

This demo assumes that you have all the software requirements.

library(tidyverse)
library(SAI)
library(elmer)

We’ll use LLMs to help us with the following data processing tasks:

Make sure to validate by doing spot checks or otherwise!

LLM may propagate majority view so take caution on its usage, particularly for classification tasks.

1 AirBnB reviews data

The airbnb_reviews data from the SAI package contains AirBnB reviews in Sydney, Australia. We look at the listing “Heart of Rockdale 2” (ID 35428495) with host Jason. The host describes the neighbourhood of the listing as below.

good location easy to access all Sydney has to offer..

This listing has 20 reviews shown below.

reviews <- airbnb_reviews |> 
  filter(listing_id == 35428495) |> 
  pull(review_text)

reviews
 [1] "Excellent place to stay for single person travelling. All necessities are there hair dryer, kitchenette, cutlery, clean towels and so on. Highly recommended."                                                                                                                                                                                                                                                                             
 [2] "Good privacy and self contained."                                                                                                                                                                                                                                                                                                                                                                                                          
 [3] "Convenient location, east check in and clean space. Would stay again."                                                                                                                                                                                                                                                                                                                                                                     
 [4] "다음날 공항에 가기 전 묵기 좋았습니다<br/>깔끔해서 묵기 편했습니다<br/>하지만 난방기구가 하나도 없어서 저는 밤새 덜덜 떨면서 자야 했습니다...🥲<br/>너무 추운거 빼고는 완벽했습니다"                                                                                                                                                                                                                                                       
 [5] "Great location, walking distance to train station, cafes, restaurants and Lady Robinson's beach. If you need to work or stream, bring your own internet."                                                                                                                                                                                                                                                                                  
 [6] "Good stay overall!"                                                                                                                                                                                                                                                                                                                                                                                                                        
 [7] "Great little single person room, good for short .<br/>It has a kitchenette, clean and usable toilet and shower.<br/>Lots of good food in Rockdale within walking distance"                                                                                                                                                                                                                                                                 
 [8] "Thanks Jason, was perfect for what I needed"                                                                                                                                                                                                                                                                                                                                                                                               
 [9] "Clean & tidy place, Convient location with proximity to shops,airport & train."                                                                                                                                                                                                                                                                                                                                                            
[10] "Very last minute booking due to delayed flight, very close to airport. Thanks for the accomodation Jason! Highly recommend :)"                                                                                                                                                                                                                                                                                                             
[11] "it's a tidy clean home. and Jason is nice . it's just the time i arrived was late and the light is not on. as a girl i'll be a bit scared ,and really suggest jason to change the blind into some thicker material, that blind for girls are not very safe. and could put a inside locker chain. i know it's a female thing. but it will definitely make us feel safer. ty"                                                                
[12] "An older style room with older furniture, but very clean and functional. Great communication from the owner."                                                                                                                                                                                                                                                                                                                              
[13] "Had a great stay. Fantastic location close to the airport and Jason was very accommodating and easy to communicate with."                                                                                                                                                                                                                                                                                                                  
[14] "Everything was perfect, great host"                                                                                                                                                                                                                                                                                                                                                                                                        
[15] "Good basic accommodation for one person needing to make an early start for the airport."                                                                                                                                                                                                                                                                                                                                                   
[16] "Excellent for a quick stay"                                                                                                                                                                                                                                                                                                                                                                                                                
[17] "Stayed here for 2.5 weeks and was really happy with the place. The unit was tiny, but had all the amenities I needed as a solo traveller. 10 min walk to the ocean and beaches, Rockdale train station, Anytime Fitness,  shopping centre and Coles supermarket. I work remotely, so the only thing I wish was better was the wifi, which was pretty spotty most days. I used my wifi hotspot or worked from a café, so that worked for me"
[18] "Super handy to Sydney Airport, very prompt and clear communication, and totally reasonable, simple accommodation with everything that any reasonable maintenance person would happily need."                                                                                                                                                                                                                                               
[19] "thank you, clean convenient and quiet neighbourhood"                                                                                                                                                                                                                                                                                                                                                                                       
[20] "Perfect for one night for solo travellors. All u need in a small bedsit. Can be cold though so either ask for another blanket or take a spare one."                                                                                                                                                                                                                                                                                        

1.1 Sentiment score

sentimenter <- chat_ollama(model = "llama3.1:8b", 
                           system_prompt = "Score the sentiment of the input text from 1 to 10 where 1 is the most negative and 10 is the most positive. Just return the sentiment score only.",
                    seed = 1,
                    api_args =list(top_p = 0.1, 
                                   temperature = 0))

# for each review, ask the `sentimenter` to score.
sentiment <- map_dbl(reviews, ~as.numeric(sentimenter$chat(.x)))

mean(sentiment)
[1] 8.75

So the mean sentiment of this listing is 8.75 out of 10. You can see the review text and sentiment score computed by the LLM in Table 1. Do you agree with the scoring? It’s a good idea to do some sample checks that it’s working as intended before processing your data in large scale.

Table 1: The review text and corresponding sentiment score computed by the LLM.
reviews sentiment
Excellent place to stay for single person travelling. All necessities are there hair dryer, kitchenette, cutlery, clean towels and so on. Highly recommended. 9
Good privacy and self contained. 8
Convenient location, east check in and clean space. Would stay again. 9
다음날 공항에 가기 전 묵기 좋았습니다
깔끔해서 묵기 편했습니다
하지만 난방기구가 하나도 없어서 저는 밤새 덜덜 떨면서 자야 했습니다…🥲
너무 추운거 빼고는 완벽했습니다
6
Great location, walking distance to train station, cafes, restaurants and Lady Robinson’s beach. If you need to work or stream, bring your own internet. 9
Good stay overall! 8
Great little single person room, good for short .
It has a kitchenette, clean and usable toilet and shower.
Lots of good food in Rockdale within walking distance
9
Thanks Jason, was perfect for what I needed 10
Clean & tidy place, Convient location with proximity to shops,airport & train. 9
Very last minute booking due to delayed flight, very close to airport. Thanks for the accomodation Jason! Highly recommend :) 10
it’s a tidy clean home. and Jason is nice . it’s just the time i arrived was late and the light is not on. as a girl i’ll be a bit scared ,and really suggest jason to change the blind into some thicker material, that blind for girls are not very safe. and could put a inside locker chain. i know it’s a female thing. but it will definitely make us feel safer. ty 7
An older style room with older furniture, but very clean and functional. Great communication from the owner. 8
Had a great stay. Fantastic location close to the airport and Jason was very accommodating and easy to communicate with. 10
Everything was perfect, great host 10
Good basic accommodation for one person needing to make an early start for the airport. 8
Excellent for a quick stay 9
Stayed here for 2.5 weeks and was really happy with the place. The unit was tiny, but had all the amenities I needed as a solo traveller. 10 min walk to the ocean and beaches, Rockdale train station, Anytime Fitness, shopping centre and Coles supermarket. I work remotely, so the only thing I wish was better was the wifi, which was pretty spotty most days. I used my wifi hotspot or worked from a café, so that worked for me 9
Super handy to Sydney Airport, very prompt and clear communication, and totally reasonable, simple accommodation with everything that any reasonable maintenance person would happily need. 10
thank you, clean convenient and quiet neighbourhood 9
Perfect for one night for solo travellors. All u need in a small bedsit. Can be cold though so either ask for another blanket or take a spare one. 8

1.2 Summarise texts

For summarising the text, we need to parse all reviews. To do this we have to collapse all the review text into one but also have a way to signal that the reviews are separate. To do this we’ll quote each review and separate them by a comma.

review_all <- paste0(paste0('"', reviews,'"'), collapse = ", ")

summariser <- chat_ollama(model = "llama3.1:8b",
                          system_prompt = "The user will input reviews of a listing. Summarise the input text into one paragraph including the overall sentiment. Just return the summary.",
                          seed = 1,
                          api_args = list(top_p = 0.1,
                                          temperature = 0))

summariser$chat(review_all) |> cat()
The listing is a simple, self-contained accommodation suitable for single travelers, with all the necessary amenities such as kitchenette, clean towels, and hair dryer. The location is convenient, close to public transport, shops, and beaches, making it ideal for short stays or early departures from Sydney Airport. While some guests noted that the room can be cold, especially at night, most reviewers praised the cleanliness, tidiness, and good communication with the owner. Overall, the sentiment is overwhelmingly positive, with many guests recommending the listing to others.

What do you think about the summary it produced? Again, do some spot checks!

Bonus: you may have noticed that the fourth review is in another language. You can use LLM to identify and translate text. For convenience, SAI package does this for you as follows.

We’ll set the model for SAI first.

sai_set_model(model_ollama(model = "llama3.1:8b"))

We can identify the language as:

sai_what_language(reviews[4])
[1] "Korean"

And translate to English (default) as below:

sai_translate(reviews[4]) |> cat()
I stayed here the night before flying out and it was great. The place is very clean and comfortable. However, there were no heating devices so I shivered all night. It was perfect except for the cold.

The above functions are convenient functions that use LLM under the hood.

2 Salary

The salary data from SAI contains the survey results from a 2021 Ask a Manager Salary Survey.

We’ll take a 400 random responses to reduce the time for demonstration.

set.seed(1)
salary_sample <- slice_sample(salary, n = 400)

2.1 Clustering

The responders entered the job title by text, as such there are as many job titles as many responders. We may wish to group them thematically so we’ll cluster them using the LLM as below.

In this process, you may like to cluster the group based on a sub-sample (200 below) as LLM have a context window and the long list of all job titles may make the input text unwieldy long.

job_all <- paste0(paste0('"', sample(salary_sample$job_title, 200), '"'), collapse = ", ")

job_grouper <- chat_ollama(model = "llama3.1:8b",
                          system_prompt = "The input text is a list of job title from survey responses. Clean the job titles and group similar job titles together. Just return the groups only as a json object.",
                          seed = 1,
                          api_args = list(top_p = 0.5,
                                          temperature = 0,
                                          response_format = list(type = "json_object")))

job_groups <- job_grouper$chat(job_all) |> 
  jsonlite::fromJSON()

job_groups
$Management
 [1] "Data Engineering Manager"                  
 [2] "Program Director"                          
 [3] "Library director"                          
 [4] "Sr Contracts Manager"                      
 [5] "Director of Customer Experience"           
 [6] "office manager"                            
 [7] "Payor Support Specialist"                  
 [8] "Head of Devops"                            
 [9] "Senior Program Manager"                    
[10] "Digital Account Manager"                   
[11] "Manager"                                   
[12] "Customer Success Manager"                  
[13] "EA"                                        
[14] "Admin"                                     
[15] "Executive assistant"                       
[16] "VP Marketing/Merchandising/Creative"       
[17] "Sr Manager"                                
[18] "Human Resource Manager"                    
[19] "Office Manager"                            
[20] "HR Coordinator"                            
[21] "IT Helpdesk Assistant"                     
[22] "HR Operations Manager"                     
[23] "Account coordinator"                       
[24] "Research scientist"                        
[25] "Government worker"                         
[26] "Service Manager"                           
[27] "Marketing Coordinator"                     
[28] "Consultant"                                
[29] "Administrative Assistant & Systems Analyst"
[30] "Alliance Manager"                          
[31] "Marketing Manager"                         
[32] "Administrative Assistant"                  
[33] "Manager, Access Services"                  
[34] "Sr. Account Relationship Manager"          
[35] "RN coordinator"                            
[36] "Digital print technician"                  
[37] "Benefits & Compensation Manager"           
[38] "Director of Development"                   
[39] "Senior Associate"                          
[40] "Library Director"                          
[41] "Team Lead Inbound Scheduling"              
[42] "Grants Manager"                            
[43] "HR Manager"                                
[44] "Graduate Student"                          
[45] "Director of Software Engineering"          
[46] "Clinical Trainer"                          
[47] "Quality Manager"                           
[48] "Veterinary technician supervisor"          
[49] "Executive Director"                        

$Engineering
 [1] "Data Engineer"                     "Frontend Engineer"                
 [3] "Devops Engineer"                   "Software Engineer"                
 [5] "Senior Software Engineer"          "Cognitive implementation engineer"
 [7] "Water Resources Engineer"          "Staff Engineering Analyst"        
 [9] "Mechanical drafter"                "IT Business Analyst"              
[11] "Web Developer"                     "Security engineer"                
[13] "Software Dev"                      "Database Administrator"           

$Science
 [1] "Postdoc"                        "Research manager"              
 [3] "Psychotherapist"                "Financial Controller"          
 [5] "Director of Research Services"  "Associate Professor"           
 [7] "Geologist"                      "Machine Learning Engineer"     
 [9] "Cytogenetics Technologist"      "Psychologist"                  
[11] "Licensed Clinical Psychologist" "Partner"                       
[13] "Actuary"                        "Chemical Engineer"             

$Education
[1] "Teacher"                                
[2] "World Language teacher"                 
[3] "High school teacher"                    
[4] "EL kindergarten teacher"                
[5] "Adjunct professor and exhibiting artist"
[6] "Associate Director of Admissions"       
[7] "Professor of [Humanities Discipline]"   
[8] "Instructor"                             

$Law
[1] "Attorney"                                  
[2] "Lawyer"                                    
[3] "Patent attorney"                           
[4] "Licensed Mental Health Counselor"          
[5] "Global digital and social media HR manager"

$Finance
[1] "Financial Controller"            "Forecasting Analyst"            
[3] "Billing Manager"                 "Loan Processor"                 
[5] "Accounting Assistant"            "Bookkeeper"                     
[7] "investment advisor"              "Director of Revenue Cycle & EHR"

$Healthcare
[1] "Psychotherapist"                  "RN coordinator"                  
[3] "Veterinary technician supervisor" "Clinical Trainer"                

$IT
[1] "Communications Logistics Analyst"   "Communications Strategist"         
[3] "Digital Account Manager"            "IT Helpdesk Assistant"             
[5] "Web Developer"                      "Database Administrator"            
[7] "Salesforce administrator/developer" "Software Engineer 2"               

$Marketing
[1] "Communications Strategist"           
[2] "Senior Digital Marketing Executive"  
[3] "Marketing and Communications Manager"
[4] "Marketing Coordinator"               
[5] "VP Membership and Programs"          

$Research
[1] "Postdoc"                       "Research manager"             
[3] "Director of Research Services" "Research scientist"           
[5] "Program Coordinator"           "Resource specialist"          
[7] "Graduate Student"             

$Sales
[1] "Sales analyst"              "Solutions Consultant"      
[3] "Lead Product Designer"      "VP Membership and Programs"

$Other
[1] "Communications associate" "stripper"                
[3] "student"                  "Games Specialist"        
[5] "product photographer"     "Waitress"                

2.2 Classification

Once we identify the classification groups like in Section 2.1 or otherwise, we can classify each job title to one of these groups as below. We’ll only classify 20 observations as this will be time consuming for a demonstration.

groups <- paste0(names(job_groups), collapse = ", ")
job_classifier <- chat_ollama(model = "llama3.1:8b",
                          system_prompt = paste0("Classify the input text to [", groups,"]. Just return the group only."),
                          seed = 1,
                          api_args = list(top_p = 0.5,
                                          temperature = 0))
jobs <- map_chr(salary_sample$job_title[1:20], ~job_classifier$chat(.x))

The resulting classification can be seen in Table 2. Do you agree with the classifications? Some I don’t agree with (e.g. Technical Writer grouped as Engineering but most others look good to me).

Table 2: The classification of the job title to a group.
Job Title Group
Technical Writer Engineering
Executive assistant Management
Archivist Other
Associate Director of Admissions Education
Senior examiner (claims) Law
Library Director Education
Machine Learning Engineer Engineering
Assistant Director, Finance Finance
Librarian Education
Learning Designer Education
Senior policy analyst Law
Operations Manager Management
Lead Marketing Analyst Marketing
Production manager Engineering
Sr. Executive Assistant Management
Regulatory Manager Law
Finance Director Finance
Assistant manager Management
Solicitor Law
Billing Manager Finance

It is best to scan all classification and if there is only a small issue, you can manually correct these. If there are more, you can give some examples of what is the correct classification to LLM and reclassify all.

2.3 Standardise data

In the survey, respondents entered their country by text entry. If you look at the entries below, you’ll notice that there are many variations of the same country, e.g. “United States” and “USA”.

salary_sample |> 
  # there is one entry, which is an 
  # answer to another question so remove for now
  filter(nchar(country) < 25) |> 
  count(country, sort = TRUE) |> 
  print(n = Inf)
# A tibble: 29 × 2
   country                      n
   <chr>                    <int>
 1 United States              141
 2 USA                        128
 3 US                          34
 4 Canada                      23
 5 United Kingdom               8
 6 United States of America     8
 7 U.S.                         7
 8 UK                           7
 9 Usa                          5
10 Ireland                      4
11 united states                4
12 usa                          4
13 Australia                    3
14 Germany                      3
15 England                      2
16 Netherlands                  2
17 The Netherlands              2
18 Uk                           2
19 United states                2
20 CANADA                       1
21 China                        1
22 France                       1
23 IS                           1
24 India                        1
25 Israel                       1
26 New zealand                  1
27 Philippines                  1
28 SWITZERLAND                  1
29 U.S                          1

You can use a method similar to Section 2.2 to standardise the country entries, however, this has been conveniently wrapped as a function sai_fct_match in the SAI package.

country <- sai_fct_match(
  salary_sample$country,
  levels = c(
    "United States America",
    "Canada",
    "United Kingdom",
    "Ireland",
    "Iceland",
    "India",
    "Israel",
    "New Zealand",
    "Switzerland",
    "Philippines",
    "China",
    "France",
    "Netherlands",
    "Australia",
    "Germany"
  )
)

The above conversion can be seen in Table 3. You can notice that while most are correct, not all where mapped correctly. Since these are only small numbers of these, you can correct these manually.

Table 3: Conversion of the country to select levels.
original converted
United States United States America
Canada Canada
USA United States America
US United States America
United states United States America
United States of America United States America
New zealand New Zealand
United Kingdom United Kingdom
UK United Kingdom
Uk United Kingdom
usa United States America
Usa United States America
Germany Germany
Australia Australia
India India
Israel Israel
united states United States America
Netherlands Netherlands
SWITZERLAND Switzerland
U.S United States America
U.S. United States America
The Netherlands NA
I was brought in on this salary to help with the EHR and very quickly was promoted to current position but compensation was not altered. United States America
Ireland Ireland
France France
China China
England United Kingdom
IS United States America
Philippines Philippines
CANADA Canada