library(tidyverse)
library(SAI)
library(elmer)
This demo assumes that you have all the software requirements.
We’ll use LLMs to help us with the following data processing tasks:
- give a sentiment score from text,
- summarise the text,
- cluster the text entries,
- classify the text into particular categories, and
- standardise the data.
Make sure to validate by doing spot checks or otherwise!
LLM may propagate majority view so take caution on its usage, particularly for classification tasks.
1 AirBnB reviews data
The airbnb_reviews
data from the SAI
package contains AirBnB reviews in Sydney, Australia. We look at the listing “Heart of Rockdale 2” (ID 35428495) with host Jason. The host describes the neighbourhood of the listing as below.
good location easy to access all Sydney has to offer..
This listing has 20 reviews shown below.
<- airbnb_reviews |>
reviews filter(listing_id == 35428495) |>
pull(review_text)
reviews
[1] "Excellent place to stay for single person travelling. All necessities are there hair dryer, kitchenette, cutlery, clean towels and so on. Highly recommended."
[2] "Good privacy and self contained."
[3] "Convenient location, east check in and clean space. Would stay again."
[4] "다음날 공항에 가기 전 묵기 좋았습니다<br/>깔끔해서 묵기 편했습니다<br/>하지만 난방기구가 하나도 없어서 저는 밤새 덜덜 떨면서 자야 했습니다...🥲<br/>너무 추운거 빼고는 완벽했습니다"
[5] "Great location, walking distance to train station, cafes, restaurants and Lady Robinson's beach. If you need to work or stream, bring your own internet."
[6] "Good stay overall!"
[7] "Great little single person room, good for short .<br/>It has a kitchenette, clean and usable toilet and shower.<br/>Lots of good food in Rockdale within walking distance"
[8] "Thanks Jason, was perfect for what I needed"
[9] "Clean & tidy place, Convient location with proximity to shops,airport & train."
[10] "Very last minute booking due to delayed flight, very close to airport. Thanks for the accomodation Jason! Highly recommend :)"
[11] "it's a tidy clean home. and Jason is nice . it's just the time i arrived was late and the light is not on. as a girl i'll be a bit scared ,and really suggest jason to change the blind into some thicker material, that blind for girls are not very safe. and could put a inside locker chain. i know it's a female thing. but it will definitely make us feel safer. ty"
[12] "An older style room with older furniture, but very clean and functional. Great communication from the owner."
[13] "Had a great stay. Fantastic location close to the airport and Jason was very accommodating and easy to communicate with."
[14] "Everything was perfect, great host"
[15] "Good basic accommodation for one person needing to make an early start for the airport."
[16] "Excellent for a quick stay"
[17] "Stayed here for 2.5 weeks and was really happy with the place. The unit was tiny, but had all the amenities I needed as a solo traveller. 10 min walk to the ocean and beaches, Rockdale train station, Anytime Fitness, shopping centre and Coles supermarket. I work remotely, so the only thing I wish was better was the wifi, which was pretty spotty most days. I used my wifi hotspot or worked from a café, so that worked for me"
[18] "Super handy to Sydney Airport, very prompt and clear communication, and totally reasonable, simple accommodation with everything that any reasonable maintenance person would happily need."
[19] "thank you, clean convenient and quiet neighbourhood"
[20] "Perfect for one night for solo travellors. All u need in a small bedsit. Can be cold though so either ask for another blanket or take a spare one."
1.1 Sentiment score
<- chat_ollama(model = "llama3.1:8b",
sentimenter system_prompt = "Score the sentiment of the input text from 1 to 10 where 1 is the most negative and 10 is the most positive. Just return the sentiment score only.",
seed = 1,
api_args =list(top_p = 0.1,
temperature = 0))
# for each review, ask the `sentimenter` to score.
<- map_dbl(reviews, ~as.numeric(sentimenter$chat(.x)))
sentiment
mean(sentiment)
[1] 8.75
So the mean sentiment of this listing is 8.75 out of 10. You can see the review text and sentiment score computed by the LLM in Table 1. Do you agree with the scoring? It’s a good idea to do some sample checks that it’s working as intended before processing your data in large scale.
reviews | sentiment |
---|---|
Excellent place to stay for single person travelling. All necessities are there hair dryer, kitchenette, cutlery, clean towels and so on. Highly recommended. | 9 |
Good privacy and self contained. | 8 |
Convenient location, east check in and clean space. Would stay again. | 9 |
다음날 공항에 가기 전 묵기 좋았습니다 깔끔해서 묵기 편했습니다 하지만 난방기구가 하나도 없어서 저는 밤새 덜덜 떨면서 자야 했습니다…🥲 너무 추운거 빼고는 완벽했습니다 |
6 |
Great location, walking distance to train station, cafes, restaurants and Lady Robinson’s beach. If you need to work or stream, bring your own internet. | 9 |
Good stay overall! | 8 |
Great little single person room, good for short . It has a kitchenette, clean and usable toilet and shower. Lots of good food in Rockdale within walking distance |
9 |
Thanks Jason, was perfect for what I needed | 10 |
Clean & tidy place, Convient location with proximity to shops,airport & train. | 9 |
Very last minute booking due to delayed flight, very close to airport. Thanks for the accomodation Jason! Highly recommend :) | 10 |
it’s a tidy clean home. and Jason is nice . it’s just the time i arrived was late and the light is not on. as a girl i’ll be a bit scared ,and really suggest jason to change the blind into some thicker material, that blind for girls are not very safe. and could put a inside locker chain. i know it’s a female thing. but it will definitely make us feel safer. ty | 7 |
An older style room with older furniture, but very clean and functional. Great communication from the owner. | 8 |
Had a great stay. Fantastic location close to the airport and Jason was very accommodating and easy to communicate with. | 10 |
Everything was perfect, great host | 10 |
Good basic accommodation for one person needing to make an early start for the airport. | 8 |
Excellent for a quick stay | 9 |
Stayed here for 2.5 weeks and was really happy with the place. The unit was tiny, but had all the amenities I needed as a solo traveller. 10 min walk to the ocean and beaches, Rockdale train station, Anytime Fitness, shopping centre and Coles supermarket. I work remotely, so the only thing I wish was better was the wifi, which was pretty spotty most days. I used my wifi hotspot or worked from a café, so that worked for me | 9 |
Super handy to Sydney Airport, very prompt and clear communication, and totally reasonable, simple accommodation with everything that any reasonable maintenance person would happily need. | 10 |
thank you, clean convenient and quiet neighbourhood | 9 |
Perfect for one night for solo travellors. All u need in a small bedsit. Can be cold though so either ask for another blanket or take a spare one. | 8 |
1.2 Summarise texts
For summarising the text, we need to parse all reviews. To do this we have to collapse all the review text into one but also have a way to signal that the reviews are separate. To do this we’ll quote each review and separate them by a comma.
<- paste0(paste0('"', reviews,'"'), collapse = ", ")
review_all
<- chat_ollama(model = "llama3.1:8b",
summariser system_prompt = "The user will input reviews of a listing. Summarise the input text into one paragraph including the overall sentiment. Just return the summary.",
seed = 1,
api_args = list(top_p = 0.1,
temperature = 0))
$chat(review_all) |> cat() summariser
The listing is a simple, self-contained accommodation suitable for single travelers, with all the necessary amenities such as kitchenette, clean towels, and hair dryer. The location is convenient, close to public transport, shops, and beaches, making it ideal for short stays or early departures from Sydney Airport. While some guests noted that the room can be cold, especially at night, most reviewers praised the cleanliness, tidiness, and good communication with the owner. Overall, the sentiment is overwhelmingly positive, with many guests recommending the listing to others.
What do you think about the summary it produced? Again, do some spot checks!
Bonus: you may have noticed that the fourth review is in another language. You can use LLM to identify and translate text. For convenience, SAI
package does this for you as follows.
We’ll set the model for SAI
first.
sai_set_model(model_ollama(model = "llama3.1:8b"))
We can identify the language as:
sai_what_language(reviews[4])
[1] "Korean"
And translate to English (default) as below:
sai_translate(reviews[4]) |> cat()
I stayed here the night before flying out and it was great. The place is very clean and comfortable. However, there were no heating devices so I shivered all night. It was perfect except for the cold.
The above functions are convenient functions that use LLM under the hood.
2 Salary
The salary
data from SAI
contains the survey results from a 2021 Ask a Manager Salary Survey.
We’ll take a 400 random responses to reduce the time for demonstration.
set.seed(1)
<- slice_sample(salary, n = 400) salary_sample
2.1 Clustering
The responders entered the job title by text, as such there are as many job titles as many responders. We may wish to group them thematically so we’ll cluster them using the LLM as below.
In this process, you may like to cluster the group based on a sub-sample (200 below) as LLM have a context window and the long list of all job titles may make the input text unwieldy long.
<- paste0(paste0('"', sample(salary_sample$job_title, 200), '"'), collapse = ", ")
job_all
<- chat_ollama(model = "llama3.1:8b",
job_grouper system_prompt = "The input text is a list of job title from survey responses. Clean the job titles and group similar job titles together. Just return the groups only as a json object.",
seed = 1,
api_args = list(top_p = 0.5,
temperature = 0,
response_format = list(type = "json_object")))
<- job_grouper$chat(job_all) |>
job_groups ::fromJSON()
jsonlite
job_groups
$Management
[1] "Data Engineering Manager"
[2] "Program Director"
[3] "Library director"
[4] "Sr Contracts Manager"
[5] "Director of Customer Experience"
[6] "office manager"
[7] "Payor Support Specialist"
[8] "Head of Devops"
[9] "Senior Program Manager"
[10] "Digital Account Manager"
[11] "Manager"
[12] "Customer Success Manager"
[13] "EA"
[14] "Admin"
[15] "Executive assistant"
[16] "VP Marketing/Merchandising/Creative"
[17] "Sr Manager"
[18] "Human Resource Manager"
[19] "Office Manager"
[20] "HR Coordinator"
[21] "IT Helpdesk Assistant"
[22] "HR Operations Manager"
[23] "Account coordinator"
[24] "Research scientist"
[25] "Government worker"
[26] "Service Manager"
[27] "Marketing Coordinator"
[28] "Consultant"
[29] "Administrative Assistant & Systems Analyst"
[30] "Alliance Manager"
[31] "Marketing Manager"
[32] "Administrative Assistant"
[33] "Manager, Access Services"
[34] "Sr. Account Relationship Manager"
[35] "RN coordinator"
[36] "Digital print technician"
[37] "Benefits & Compensation Manager"
[38] "Director of Development"
[39] "Senior Associate"
[40] "Library Director"
[41] "Team Lead Inbound Scheduling"
[42] "Grants Manager"
[43] "HR Manager"
[44] "Graduate Student"
[45] "Director of Software Engineering"
[46] "Clinical Trainer"
[47] "Quality Manager"
[48] "Veterinary technician supervisor"
[49] "Executive Director"
$Engineering
[1] "Data Engineer" "Frontend Engineer"
[3] "Devops Engineer" "Software Engineer"
[5] "Senior Software Engineer" "Cognitive implementation engineer"
[7] "Water Resources Engineer" "Staff Engineering Analyst"
[9] "Mechanical drafter" "IT Business Analyst"
[11] "Web Developer" "Security engineer"
[13] "Software Dev" "Database Administrator"
$Science
[1] "Postdoc" "Research manager"
[3] "Psychotherapist" "Financial Controller"
[5] "Director of Research Services" "Associate Professor"
[7] "Geologist" "Machine Learning Engineer"
[9] "Cytogenetics Technologist" "Psychologist"
[11] "Licensed Clinical Psychologist" "Partner"
[13] "Actuary" "Chemical Engineer"
$Education
[1] "Teacher"
[2] "World Language teacher"
[3] "High school teacher"
[4] "EL kindergarten teacher"
[5] "Adjunct professor and exhibiting artist"
[6] "Associate Director of Admissions"
[7] "Professor of [Humanities Discipline]"
[8] "Instructor"
$Law
[1] "Attorney"
[2] "Lawyer"
[3] "Patent attorney"
[4] "Licensed Mental Health Counselor"
[5] "Global digital and social media HR manager"
$Finance
[1] "Financial Controller" "Forecasting Analyst"
[3] "Billing Manager" "Loan Processor"
[5] "Accounting Assistant" "Bookkeeper"
[7] "investment advisor" "Director of Revenue Cycle & EHR"
$Healthcare
[1] "Psychotherapist" "RN coordinator"
[3] "Veterinary technician supervisor" "Clinical Trainer"
$IT
[1] "Communications Logistics Analyst" "Communications Strategist"
[3] "Digital Account Manager" "IT Helpdesk Assistant"
[5] "Web Developer" "Database Administrator"
[7] "Salesforce administrator/developer" "Software Engineer 2"
$Marketing
[1] "Communications Strategist"
[2] "Senior Digital Marketing Executive"
[3] "Marketing and Communications Manager"
[4] "Marketing Coordinator"
[5] "VP Membership and Programs"
$Research
[1] "Postdoc" "Research manager"
[3] "Director of Research Services" "Research scientist"
[5] "Program Coordinator" "Resource specialist"
[7] "Graduate Student"
$Sales
[1] "Sales analyst" "Solutions Consultant"
[3] "Lead Product Designer" "VP Membership and Programs"
$Other
[1] "Communications associate" "stripper"
[3] "student" "Games Specialist"
[5] "product photographer" "Waitress"
2.2 Classification
Once we identify the classification groups like in Section 2.1 or otherwise, we can classify each job title to one of these groups as below. We’ll only classify 20 observations as this will be time consuming for a demonstration.
<- paste0(names(job_groups), collapse = ", ")
groups <- chat_ollama(model = "llama3.1:8b",
job_classifier system_prompt = paste0("Classify the input text to [", groups,"]. Just return the group only."),
seed = 1,
api_args = list(top_p = 0.5,
temperature = 0))
<- map_chr(salary_sample$job_title[1:20], ~job_classifier$chat(.x)) jobs
The resulting classification can be seen in Table 2. Do you agree with the classifications? Some I don’t agree with (e.g. Technical Writer grouped as Engineering but most others look good to me).
Job Title | Group |
---|---|
Technical Writer | Engineering |
Executive assistant | Management |
Archivist | Other |
Associate Director of Admissions | Education |
Senior examiner (claims) | Law |
Library Director | Education |
Machine Learning Engineer | Engineering |
Assistant Director, Finance | Finance |
Librarian | Education |
Learning Designer | Education |
Senior policy analyst | Law |
Operations Manager | Management |
Lead Marketing Analyst | Marketing |
Production manager | Engineering |
Sr. Executive Assistant | Management |
Regulatory Manager | Law |
Finance Director | Finance |
Assistant manager | Management |
Solicitor | Law |
Billing Manager | Finance |
It is best to scan all classification and if there is only a small issue, you can manually correct these. If there are more, you can give some examples of what is the correct classification to LLM and reclassify all.
2.3 Standardise data
In the survey, respondents entered their country by text entry. If you look at the entries below, you’ll notice that there are many variations of the same country, e.g. “United States” and “USA”.
|>
salary_sample # there is one entry, which is an
# answer to another question so remove for now
filter(nchar(country) < 25) |>
count(country, sort = TRUE) |>
print(n = Inf)
# A tibble: 29 × 2
country n
<chr> <int>
1 United States 141
2 USA 128
3 US 34
4 Canada 23
5 United Kingdom 8
6 United States of America 8
7 U.S. 7
8 UK 7
9 Usa 5
10 Ireland 4
11 united states 4
12 usa 4
13 Australia 3
14 Germany 3
15 England 2
16 Netherlands 2
17 The Netherlands 2
18 Uk 2
19 United states 2
20 CANADA 1
21 China 1
22 France 1
23 IS 1
24 India 1
25 Israel 1
26 New zealand 1
27 Philippines 1
28 SWITZERLAND 1
29 U.S 1
You can use a method similar to Section 2.2 to standardise the country entries, however, this has been conveniently wrapped as a function sai_fct_match
in the SAI
package.
<- sai_fct_match(
country $country,
salary_samplelevels = c(
"United States America",
"Canada",
"United Kingdom",
"Ireland",
"Iceland",
"India",
"Israel",
"New Zealand",
"Switzerland",
"Philippines",
"China",
"France",
"Netherlands",
"Australia",
"Germany"
) )
The above conversion can be seen in Table 3. You can notice that while most are correct, not all where mapped correctly. Since these are only small numbers of these, you can correct these manually.
original | converted |
---|---|
United States | United States America |
Canada | Canada |
USA | United States America |
US | United States America |
United states | United States America |
United States of America | United States America |
New zealand | New Zealand |
United Kingdom | United Kingdom |
UK | United Kingdom |
Uk | United Kingdom |
usa | United States America |
Usa | United States America |
Germany | Germany |
Australia | Australia |
India | India |
Israel | Israel |
united states | United States America |
Netherlands | Netherlands |
SWITZERLAND | Switzerland |
U.S | United States America |
U.S. | United States America |
The Netherlands | NA |
I was brought in on this salary to help with the EHR and very quickly was promoted to current position but compensation was not altered. | United States America |
Ireland | Ireland |
France | France |
China | China |
England | United Kingdom |
IS | United States America |
Philippines | Philippines |
CANADA | Canada |