Landscape of Large Language Models

Introduction to Large Language Models for Statisticians

Emi Tanaka

15th October 2024

ChatGPT

ChatGPT was released to public on 30th November 2022.
ChatGPT gained a staggering 100 million active users in 2 months¹.

ChatGPT Heralds an Intellectual Revolution

25 February 2023

By Henry Kissinger, Eric Schmidt, and Daniel Huttenlocher

Generative artificial intelligence presents a philosophical and practical challenge on a scale not experienced since the start of the Enlightenment.

A new technology bids to transform the human cognitive process as it has not been shaken up since the invention of printing. The technology that printed the Gutenberg Bible in 1455 made abstract human thought communicable generally and rapidly. But new technology today reverses that process. Whereas the printing press caused a profusion of modern human thought, the new technology achieves its distillation and elaboration. In the process, it creates a gap between human knowledge and human understanding. If we are to navigate this transformation successfully, new concepts of human thought and interaction with machines will need to be developed. This is the essential challenge of the Age of Artificial Intelligence.

…

Mr. Kissinger served as secretary of state, 1973-77, and White House national security adviser, 1969-75. Mr. Schmidt was CEO of Google, 2001-11 and executive chairman of Google and its successor, Alphabet Inc., 2011-17. Mr. Huttenlocher is dean of the Schwarzman College of Computing at the Massachusetts Institute of Technology. They are authors of “The Age of AI: And Our Human Future.” The authors thank Eleanor Runde for her research.

ChatGPT Demo

https://chatgpt.com/

Chatbot timeline

Refresh of browser may be needed for timeline to render correctly

Data Source: https://en.wikipedia.org/wiki/List_of_chatbots (Accessed on 11/08/2024)

Rise of large language models

Refresh of browser may be needed for timeline to render correctly

Data Source: https://en.wikipedia.org/wiki/Large_language_model#List (Accessed on 11/08/2024)

Using a LLM

Vendor API

…

Requires internet access
Requires account with vendor
Ongoing payment for usage

Local LLM

GPT4All LM Studio Jan llama.cpp llamafile Ollama NextChat …

No internet access required
No account required
Several GB of hard disk space required
At least 16GB RAM required for 7b parameter LLMs

Models:

chatgpt-4o-latest
gpt-4o
gpt-4o-mini
gpt-3.5-turbo
dall-e-3
text-embedding-ada-002
tts-1-hd
whisper-1
…

API access via cURL

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Pricing can range from ~US$0.75 to ~US$75.00 per 1 million tokens depending on the model used.

Ollama

Models:

llama3.2:1b
llama3.2:3b
llama3.1:8b
llama3.1:70b
llama3.1:405b
gemma2:2b
gemma2:9b
gemma2:27b
llava:7b
…

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Pricing: FREE.
The model name has a suffix with the number of parameters, e.g. llama3.1:8b has about 8 billion parameters.
Larger number of parameters requires larger RAM (~16GB RAM required for 7b).

Which LLM to use?

LLM leaderboards, e.g. Open LLM Leaderboard and LLM Arena.
Smaller number of model parameters may not perform as well for complex tasks.
There are some specialised LLMs, e.g.
- dall-e-3 generates images from user text input
- llava:7b is a multi-modal model and can take image input
- mathstral:7b is designed for math reasoning and scientific discovery
- deepseek-coder-v2:16b comparable to gpt-4-turbo in code-specific tasks
- meditron:7b adapted from llama2 for medical domain
Currently I use mostly:
- Open AI: gpt-4o, gpt-4o-mini and dall-e-3 with payment of US$5 so far
- Ollama: llama3.1:8b and llava:7b

Demo #1

Predictive model

Large language model at its core predicts the next word given a sequence of words.

Input: All models are wrong, but some are

Updated input:

All models are wrong, but some are useful

Updated input:

All models are wrong, but some are useful.

LLM

useful

<|end|>

Output: useful.

Tokenization

Input Where there’s a will, there’s a

Token Where there ’s a will , there ’s a

Token ID 11977 1354 802 261 738 11 1354 802 261

LLM

2006

way

Tokens are not necessary words

Every LLM has its own tokenizer with varying vocabulary size.
Rule of thumb for common English text: 1 token = ~4 characters = ~0.75 words

summary summarise summarize summarising summarizations

summary summ ar ise summ ar ize summ ar ising summ ar izations

3861 141249 277 1096 750 5066 25434

Special tokens

In practice, summarise may be tokenized as summ ##ar ##ise where “##” indicates another token prefixes it.

Due to popularity of chat tokens since 2023, tokenizers have adapted to a conversational direction with special tokens to indicate speaker role, such as
- <|system|> – high-level instructions
- <|user|> – user (usually human) queries or prompts,
- <|assistant|> – typically the model’s response, and
- other, e.g. <|tool|> .

Token distribution

Input: Where there’s a will, there’s a

Output is randomly sampled from likely tokens weighted by their respective probabilities.

Original distribution:

Token Probability
2006

301

4443

…

Small top_p:

Token Probability
2006

301

High temperature

Token Probability
2006

301

4443

…

seed ensures the same random sample given the same input (important for reproducibility!), but the same seed may not yield the same result across different systems.

Demo #2

Prompt engineering

Prompt engineering is the process of designing and refining the prompts for a language model to generate specific types of output.

chat1 <- elmer::chat_ollama(model = "llama3.1:8b", 
                            seed = 1,
                            api_args = list(temperature = 0),
                            echo = TRUE)

Suppose we want to classify the sentiment of the text as positive or negative.

chat1$chat('Classify the text as positive or negative.
           "This is a great book!"')

I would classify this text as **positive**. The use of the word "great" 
indicates a strong and enthusiastic endorsement of the book.

Instruction-based prompt

You can give more specific instructions regarding the format.

chat2 <- elmer::chat_ollama(model = "llama3.1:8b", 
                            system_prompt = "Just give the answer.",
                            seed = 1,
                            api_args = list(temperature = 0),
                            echo = TRUE)

The response is now concise.

chat2$chat('Classify the text as positive or negative.
           "This is a great book!"')

Positive.

In this case, a similar answer would be generated if system prompt was included in the user prompt.

Zero-shot prompt

When LLM are given no examples of the task to carry out, this is referred to as zero-shot prompt.

chat2$chat('Which discipline does machine learning belong to?')

Computer Science.

chat2$chat('Which discipline does logistic regression belong to?')

Statistics and Machine Learning.

LLM can be fine-tuned by training on a new labelled data, but this is computationally expensive and out of scope for typical analysts.

Prompt engineering with one- or few-shot prompt is a low cost approach to provide in-context learning.

In-context learning

If examples of expected response are provided in the prompt, these are called one-shot prompt (if one example) or a few-shot prompt (if more than one example).

chat2$chat('Which discipline does machine learning belong to?
           Examples:

           Technique: Regression
           Discipline: Statistics
           
           Technique: Deep learning
           Discipline: Data Science
           
           Technique: Database Management
           Discipline: Computer Science')

Data Science.

chat2$chat('Which discipline does logistic regression belong to?')

Statistics and Machine Learning, but more specifically, it belongs to the 
discipline of Statistics.

Chain-of-thought

Chain-of-thought aims to make the LLM “think” before answering.
Reasoning is a core component of human intelligence and LLM can mimic “reasoning” from memorisation and pattern matching trained from large corpus of text.

chat2$chat('Regression is a primarly tool for statisticians. 
           Which discipline does logistic regression belong to?')

Statistics.

Here is another example where we want to calculate the median:

chat2$chat("What is the median of 1, 4 and 5?")

3.

Provide some cue for reasoning.

chat2$chat("The median is the middle value of the sorted list of numbers. 
           What is the median of 1, 4 and 5?")

Since the numbers are already in order (1, 4, 5), the median is the middle 
number, which is 4.

Prompt engineering components

Some common components include:

Instruction The task (be specific as possible).
Format The response format. E.g. “return the value only” and “return as JSON”.
Example Example(s) of input and expected response.
Context Additional information about the context of the task.
Persona Describe the role of the LLM. E.g. “You are a statistics tutor”.
Audience Describe the target audience. E.g. “Explain it to a high school student”.
Tone The tone of the text. E.g. “Respond professionally”.

Self-consistency

To ensure the responses are reliable, you can generate the response to the same prompt multiple times and select the mode.

chat2$chat("What is the mean of 1, 9 and 5?")

(1 + 9 + 5) / 3 = 15 / 3 = 5.

chat2$chat("What is the mean of 1, 9 and 5?")

15.

chat2$chat("What is the mean of 1, 9 and 5?")

(1 + 9 + 5) / 3 = 15 / 3 = 5.

chat2$chat("What is the mean of 1, 9 and 5?")

15/3 = 5.

chat2$chat("What is the mean of 1, 9 and 5?")

15.

🤔 Pondering

(for later)

How do LLMs complement or hinder statistical thinking?

What role should LLMs play in decision-making processes and research?

How will LLMs impact the training and development of future statisticians and data scientists?

What are the use cases of LLM for you (if any)?

Landscape of Large Language Models

ChatGPT

ChatGPT Demo

Chatbot timeline

Related concepts

Rise of large language models

Using a LLM

Ollama

Which LLM to use?

Demo #1

Predictive model

Tokenization

Tokens are not necessary words

Special tokens

Token distribution

Demo #2

Prompt engineering

Instruction-based prompt

Zero-shot prompt

In-context learning

Chain-of-thought

Prompt engineering components

Self-consistency