Large Language Models (LLMs)

class: center, middle, inverse, title-slide

.title[
# Large Language Models (LLMs)
]
.subtitle[
## Machine Learning in Economics
]
.author[
### Hüseyin Taştan
]
.institute[
### Yildiz Technical University
]

---

class: my-medium-font

# Lecture Outline

- [Evolution of AI](#intro)

- [What are Large Language Models?](#what)

- [How LLMs Work](#how)

- [Why Are LLMs Powerful?](#why)

- [Examples and Applications](#examples)

- [LLMs in Economics](#LLMecon)

- [Strengths and Weaknesses](#strweak)

- [Final notes](#final)

---
name: intro

# Evolution of Artificial Intelligence (AI)

- **1950s–1990s**: Rule-based AI (Expert Systems) using **if-then logic** and predefined knowledge. Example: Early chess-playing programs.

- **1990s–2010s**: The rise of Machine Learning (ML) approach. AI began learning from data instead of just following rules. Example: Decision trees, regression models, and early neural networks.

- **2010s–Present**: Neural Networks & Deep Learning (scaled-up ML) AI became multi-layered, with models mimicking human-like pattern recognition. Example: Convolutional neural networks for image processing.

- **2017–Present**: Transformers & LLMs. The introduction of the **transformer** architecture revolutionized AI by allowing models to efficiently process entire sequences of data (like sentences or paragraphs) all at once instead of step by step. This led to the development of large language models (LLMs) such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

---
# AI Paradigm Shifts: Why ML Succeeded

- **Logic-Based AI (Expert Systems & Symbolic AI)**

- Relied on **hardcoded rules** rather than learning from data. This limited their ability to adapt to new data or scenarios.
  - Required manual knowledge encoding. Because they depended on specific rules, these systems often faced challenges with unexpected inputs or intricate problem-solving. 
  - Led to **AI winters** (periods of stalled progress due to unrealistic expectations and lack of scalability).
  
- **Data-Driven AI (Machine Learning & Deep Learning)**
  - Learned **patterns from data**, making models adaptable and scalable.
  - Powered by statistical methods, regression models, and neural networks. ML systems began outperforming rule-based methods, enabling more complex manipulations of data.
  - Enabled breakthroughs in several domains including natural language processing (NLP), computer vision, and predictive modeling.

---

# Large Language Models (LLMs)
  
- **LLMs: The Next Step in ML Evolution** 
  - Transformer-based models scale AI even further. 
  - Leverage massive datasets through self-supervised learning. 
  - Self-Supervised Learning:
    - The model teaches itself patterns from raw text—without human-labeled answers.
    - Example: A language model learns to predict missing words in a sentence (e.g., "The economy will ........... next year.") by training on millions of texts. 
    - The data itself provides structure, eliminating the need for manual labeling.

| **Learning Type**       | **Data Labeling**           | **Example** |
|-------------------------|----------------------------|-------------|
| **Supervised Learning** | Needs labeled data         | Predicting house prices using labeled sales data |
| **Self-Supervised Learning** | No manual labels; learns from structure | Predicting missing words in sentences |

---
name: what

# What are Large Language Models?

- Large Language Models (LLMs) are AI models designed to understand and generate human language through advanced neural network architectures.

- They are trained on huge amounts of text — including books, websites, articles, and more — providing them with a broad knowledge base and context understanding.

- They can write essays, answer questions, summarize documents, and even generate code.

- Imagine a supercharged version of "autocomplete".

- Given a sentence like:

- `"The economy will ______ next year."`
  - The LLM analyzes the context and predicts the most likely next word or even generates a complete thought, such as “increase” or “decline,” based on its training.

---
# Intuition

- **How the LLM Fills the Gap**:
  - The model examines the context (words before and after). This allows it to make an informed prediction based on the sentence structure and semantics.
  - It assigns probabilities to possible words by estimating the likelihood of each based on patterns it learned during training, e.g.:
      - `"grow"` → 85% probability
      - `"decline"` → 10% probability
      - `"stagnate"` → 5% probability
      - The highest probability word ("grow") is chosen.

- **LLMs do not think like humans — they are pattern matchers** 
  - Unlike humans, who may utilize intuition, emotions, and personal experiences to interpret language, LLMs operate purely on mathematical patterns and statistical correlations derived from large datasets.

---
# Example

.pull-left[

```r
library(tidytext)
library(dplyr)
library(ggplot2)
# Define a sentence with a missing word
sentence <- "The economy will ___ next year." 
# Simulated token predictions with probabilities
word_predictions <- data.frame(
  word = c("grow", "decline", "stagnate", "fluctuate"), 
  probability = c(0.85, 0.10, 0.03, 0.02) # Simulated probabilities
  )
# Plot the token probability distribution
ggplot(word_predictions, 
       aes(x = reorder(word, probability), y = probability, fill = word)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  coord_flip() + 
  labs(title = "Token Prediction Probabilities", 
       x = "Predicted Word", y = "Probability") +
  theme_minimal()
```
]

.pull-right[
![](data:image/png;base64,#LLMs_files/figure-html/ex1-out-1.png)
]

---
name: how

# How Do They Work?

- **Context Understanding**
  - LLMs analyze the context of the entire sentence rather than processing words in isolation. 
  - They use the surrounding words to understand meaning and context (e.g., "economy" signals economic terms). 
- **Tokenization**
  - The input sentence is broken down into tokens which can be words, subwords, or even characters.
  - For example:
  - "The economy will `___` next year." might be tokenized into ["The", "economy", "will", "`<mask>`", "next", "year"].
  - Here, `<mask>` is a special placeholder token used by the model to represent the *missing word* it needs to predict.  
  - Think of `<mask>` as a blank space the model has to “fill in” by guessing which word fits best given the sentence context.

---
# Transformers Architecture

- LLMs use a deep learning model called the **Transformer**.

- Unlike previous models that looked at words one after another, Transformers look at **all the words in a sentence at once** and decide which ones matter most for understanding.

- This ability lets Transformers handle **long texts more efficiently and accurately** compared to older models like RNNs.

- Transformers use both **attention mechanism** and layers of neural networks for deeper understanding.

- Every token influences the prediction of the missing word (`___`) by contributing to the model's understanding based on patterns learned from data.

- After processing, the model produces a list of raw scores for each possible word to fill the gap.

- The **softmax function** converts these scores into probabilities.

---
# The Softmax Function 
**Turning Raw Scores into Probabilities**

- After the Transformer calculates a raw score for each possible token, these scores need to be turned into probabilities.
- The **softmax function** does this by:
  - Exponentiating each score to make them positive,
  - Then dividing by the sum of all these exponentials,
  - Ensuring all resulting numbers are between 0 and 1,
  - And the probabilities add up to 1.

- Softmax formula: `$x_i$` is the raw score for token `$i$`. The denominator sums over all tokens `$j$`:

`$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$$`

---
# Softmax Example

.pull-left[
Suppose the model assigns raw scores to 3 possible words for the missing token:

| Word     | Score (raw) |
|----------|-------------|
| grow     | 2.0         |
| decline  | 1.0         |
| stagnate | 0.1         |

### **Step 1**: Exponentiate each score
- `$e^{2.0} \approx 7.39$`, `$e^{1.0} \approx 2.72$`, `$e^{0.1} \approx 1.11$`

]

.pull-right[
### **Step 2**: Sum of exponentials
- `$7.39 + 2.72 + 1.11 = 11.22$`

### **Step 3**: Divide each exponential by the sum

| Word     | Probability |
|----------|-------------|
| grow     | 7.39 / 11.22 = 0.66 (66%) |
| decline  | 2.72 / 11.22 = 0.24 (24%) |
| stagnate | 1.11 / 11.22 = 0.10 (10%) |

Here, `grow` has the highest probability. 
]

---
# Why tokenize?

- **Tokens**: LLMs don't process entire sentences as single entities. Instead, they split text into smaller units called tokens (which can be whole words, subwords, or individual characters).

- **Vocabulary**: Each LLM has a fixed list of tokens it recognizes, called its vocabulary. For example, the sentence "I love economics." might be tokenized into tokens:  `["I", "love", "economics", "."]`. The vocabulary is the complete set of such tokens the model can understand.

- **Efficiency**: Tokenization helps reduce complexity, making it easier to process language. Words that frequently appear together may be grouped as subwords (e.g., "un-", "-ing").

- **Handling Unknown Words**: If a word is not in the model's vocabulary, it can be broken down into smaller known tokens (e.g., "economics" could be split into "econom" and "ics").

---
# Embedding Representation

- **Word Embeddings**: Each token is represented as a vector in a high-dimensional space (typically 100-300).
- **Embeddings capture semantic relationships**: words with similar meanings end up close together.

.pull-left[
- Words that appear in similar contexts are placed closer together, e.g., “supply”, “demand”, and “growth” appear close together, suggesting similar usage. 
- In contrast, “inflation” and “interest” are not close together in this space, indicating distinct contexts of usage.
- Popular embedding models: Word2Vec (Google), GloVe (Stanford), FastText (Facebook)
]

.pull-right[
<img src="data:image/png;base64,#img/wordembed1.PNG" width="100%" style="display: block; margin: auto;" />
]

---
# Cosine Similarity

.pull-left[
- Cosine similarity measures the angle between two word vectors:
  - Close to 1: very similar 
  - Close to 0: unrelated 
  - Close to -1: opposite meanings (rare in word2vec)
- Formula:  
`$$\text{cosine similarity}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$$`
where `$\mathbf{u} \cdot \mathbf{v}$` is the dot product and `$\|\mathbf{u}\|$` is the vector’s length, `$\|\mathbf{u}\|=\sqrt{\sum u_i^2}$`.
]

.pull-right[
<img src="data:image/png;base64,#img/wordembed2.PNG" width="90%" style="display: block; margin: auto;" />
This 2D projection shows clusters that suggest similarity. But to truly measure similarity, we compute cosine similarity in the original embedding space.
]

---
# Example: Cosine similarity

.pull-left[

```r
word_vectors <- tibble::tibble(
  word = c("inflation", "interest", "deflation", "banana", "investment"),
  dim1 = c(0.9, 0.8, -0.8, 0.1, 0.85),
  dim2 = c(0.4, 0.6, -0.6, -0.1, 0.7)
)
# Cosine similarity function
cosine_sim <- function(a, b) {
  sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
}
# Compute similarities to "inflation"
ref_vector <- word_vectors[word_vectors$word == "inflation", 2:3] |> as.numeric()
sims <- word_vectors |>
  rowwise() |>
  mutate(similarity = cosine_sim(ref_vector, c(dim1, dim2))) |>
  ungroup()
# Show similarities
sims |> select(word, similarity)
```
]

.pull-right[

```
## # A tibble: 5 × 2
##   word       similarity
##   <chr>           <dbl>
## 1 inflation       1    
## 2 interest        0.975
## 3 deflation      -0.975
## 4 banana          0.359
## 5 investment      0.964
```
]

---
# Why Vector Math in LLMs

**Question**: `"king - man + woman ≈ ?"`

.pull-left[
- LLMs answer is `"queen"`. 
  - How does a machine know that **queen** relates to **king** and **woman**, but not to **banana**?
-  Remember: LLMs don’t “understand” text like humans. Instead, they represent each word or subword as a vector—a list of numbers capturing its meaning.
- Vector math lets the model combine and compare these meanings numerically, revealing relationships like gender and royalty.
]

.pull-right[
<img src="data:image/png;base64,#img/wordembed3.PNG" width="80%" style="display: block; margin: auto;" />
]

---
# Vector math in LLMs

- **Similar Words = Similar Vectors**: 
  - Words used in similar ways in language will have similar vectors. 
  - “interest” and “inflation” are used in economic contexts, so they end up close together in this vector space.
  
- **Relationships as Directions**: In this space, semantic relationships (like male ↔ female, present ↔ past) correspond to geometric directions.

- **Why vector math works**: Vector space encodes relationships as directions; thus, arithmetic on vectors lets the model infer related words like `king - man + woman ≈ queen`
  - What would `"France" - "Paris" + "Ankara"` give?

---
# Word Embeddings Encode Meaning Geometrically

**Male–Female**: gender difference is a direction. **Verb Tense**: Verbs like “walked” and “walking” differ in tense. That difference is also a direction in vector space. The model learns that “walked” is to “walking” as “swam” is to “swimming”. **Country–Capital**: The model sees many examples like: “Madrid is the capital of Spain.” It learns the country-capital relationship as a vector difference. So if you say: `"France" - "Paris" + "Ankara"`, it will get close to `"Turkey"`

---
# To summarize

- LLMs work with **vectors**, not raw words.

- LLMs use these word vectors at the input layer and then refine them using **layers of neural networks**.

- So when the model predicts the next word in a sentence like:
`“The king and the ____ went to the castle,”` it “knows” that words like queen or prince are semantically related, based on vectors and context.

- Meaning is encoded geometrically: 
  - Similar words = close points
  - Analogies = vector arithmetic
  - Context = dynamically updated vectors inside the model

---
# Contextual Representation

.pull-left[
- Words can have multiple meanings depending on context (e.g., “bank” as a financial institution or river side).
- To handle this, LLMs adjust word representations dynamically based on the surrounding words — they don’t treat each word as a fixed vector. 
- This dynamic representation is achieved through a mechanism called **self-attention**, which learns how each word relates to every other word in the sentence.
- Link to the article: <https://arxiv.org/abs/1706.03762>
]

.pull-right[
<img src="data:image/png;base64,#img/attention.PNG" width="100%" style="display: block; margin: auto;" />
]

---
# Attention is all you need

- **The Problem with Previous Models (recurrent neural networks)**
  - Sequential Processing: Traditional sequence models (RNNs, LSTMs, etc) process information sequentially. This means they look at one word at a time, in order. 
  - This makes them slow to train because they can’t process multiple words simultaneously (no parallelism).
  - Long-Range Dependencies: It's difficult for RNNs to capture relationships between words that are far apart in the sentence. The information has to flow through many steps.

- **Attention as a Solution**
  - Instead of processing sequentially, attention lets the model look at all words in the sentence at once when understanding each word.
  - This helps the model capture relationships between distant words easily and allows for faster, parallel processing during training.

---
# The Transformer Architecture

.pull-left[
- No Recurrence: Transformers don't process text in order (like RNNs and convolutions); they rely fully on attention mechanisms.

- Encoder-Decoder Structure: The encoder reads the input sentence and creates meaningful representations. The decoder uses these to generate the output (like translating a sentence).

- Core Idea: Self-attention inside both encoder and decoder helps the model understand context by relating all words to each other.
]

.pull-right[
<img src="data:image/png;base64,#img/transformer1.PNG" width="70%" style="display: block; margin: auto;" />
]

---
# Self-Attention

- Each word in a sentence plays three roles:
  - A query: What information am I looking for?
  - A key: What information do I have?
  - A value: The actual information I can share.

- When processing a word (the query), the model compares it to the keys of all other words to see which ones are relevant.

- The more relevant a word’s key is to the query, the more its value contributes to the updated representation of that word.

- Weighted Sum: The model calculates a "relevance score" between the query and each key. These scores are used as weights to create a weighted sum of the values.

- This weighted sum is the context-aware representation of the word.

???
Let’s break down self-attention.

Each word in a sentence plays three important roles here: query, key, and value.

Think of the query as the question a word asks — ‘What information do I need?’ The key represents what information a word holds, and the value is the actual content or info it provides.

When the model processes a word — our query — it compares this to the keys of all other words to see which are relevant.

The more relevant a word’s key is, the more its value influences the current word’s updated meaning.

To do this, the model calculates relevance scores — essentially weights — and uses those to compute a weighted sum of all values.

This weighted combination creates a new, context-aware representation of the word that captures its relationship with every other word in the sentence.

---
# Output selection

- Analogy: Think of the model like a group project. When one student (the query word) works on their part, they consider the contributions of all other students (the keys and values) to create a coherent final product—the context-aware representation.

- Using these enriched representations, the model predicts the next word by calculating probabilities for all possible words and selects the one with the highest probability as the most likely continuation of the sentence.

- This process repeats step-by-step, generating the output word-by-word while continuously incorporating context from the entire sequence.

???
Now, thinking about output selection:

Imagine the model as a group project. When one student — our current word — works on their part, they need to consider what all the other students have contributed.

This means each word’s new representation incorporates context from the entire sequence, making it richer and more connected.

Using these enriched representations, the model predicts the next word by calculating probabilities for all possible words.

It then chooses the word with the highest probability — in other words, the most likely next word in the sentence.

This happens step-by-step: after each predicted word, the model updates its context and repeats the process, building output word-by-word while always considering the full sentence context.

---
# Multi-Head Attention: Diverse Relationships

- Multiple Perspectives: 
  - Instead of a single attention calculation, the Transformer uses multiple "attention heads".

- Different Relationships: 
  - Each head learns to focus on different types of relationships between words. 
  - For example, one head might focus on syntactic relationships, while another focuses on semantic relationships.
  - Analogy: Imagine different experts analyzing a sentence. One expert might focus on the grammar, another on the meaning, and another on the tone. Multi-head attention allows the model to capture these different perspectives.

---
# Long-Distance Dependencies

.pull-left[
- Explanation: Notice how some of the attention heads for "making" focus on the words "more difficult," even though they are separated by several words. 
- This indicates that the model is recognizing the relationship between "making" and "more difficult" to understand the complete phrase "making...more difficult". 
- The different colors represent different attention heads, each potentially capturing a different aspect of the relationship.

]

.pull-right[
<img src="data:image/png;base64,#img/attention1.PNG" width="110%" style="display: block; margin: auto;" />
- This figure illustrates how the attention mechanism can capture long-range dependencies in a sentence. Specifically, it shows the attention weights for the word "making" in the sentence. (Figure 3 in the article)
]

---
# Example: Understanding "Gül" with Attention

.pull-left[
- In Turkish, the word **“gül”** (or its forms) can mean:
  - A flower 🌹  
  - To laugh 😂  
  - A person’s name 👩  
- How does an LLM know which one?
- **Self-attention** compares the word to its **surrounding context**.
- The meaning depends on:
  - Which words it attends to (keys)
  - How much weight it gives them (attention scores)
]

.pull-right[
1. `Bahçede güzel bir **gül** açtı.`  
👉 attends to **“bahçede”**, **“güzel”**, **“açtı”** → 🌹 (flower)

2. `Komik fıkraya öyle bir **güldü** ki gözlerinden yaş geldi.`  
👉 attends to **“komik”**, **“fıkra”**, **“yaş”** → 😂 (laugh)

3. `**Gül** Hanım toplantıya katılamadı.`  
👉 attends to **“Hanım”**, **“toplantı”**, **“katılamadı”** → 👩 (person name)
]

---
# From Attention to Language Understanding

- Each word gets a **contextual vector** through attention.

- By **stacking many Transformer layers**, the model builds up:
  - Grammar and syntax
  - Semantic roles (subject, object, etc.)
  - Deeper meaning (reasoning, implication)

- This is how LLMs can:
  - Answer questions based on context
  - Resolve pronouns (“she,” “it,” “they”)
  - Follow complex instructions

---
# From Attention to Language Understanding

.pull-left[
> **Example**:
> 
> `Ali met with Ayşe. She shared the report.`  
> 
> Who does “she” refer to?
> 
> ✅ The model uses attention to track entities and **choose the most likely referent**.
]

.pull-right[
> **LLMs don't just look at one word.**
>
> They consider **the entire sentence**, and even earlier context, to assign meaning.
>
> 🔁 Every word attends to every other word — across layers.
>
> This is what gives LLMs their **"understanding"** power.
]

---
# Context Window

- **Large Language Models (“LLMs”) process a limited chunk of text at any one time.**

- This chunk is called the **context window**.

- The context window is measured in **tokens**

- The model can only “see” what fits in its window.

- Information *outside* the window is not used—LLMs don’t truly “remember” entire documents.

- In **long documents or conversations**, as you add more text, the earliest parts may drop out of the window and be forgotten.

---
# How Large is the Context Window?

.pull-left[
| Model     | Max Context Window         | Architecture          |
|:----------|:---------------------------|:----------------------|
| **BERT**  | 512 tokens                 | Encoder-only          |
| **GPT-3** | ~2,000 tokens              | Decoder-only          |
| **GPT-4** | up to ~128,000 tokens      | Decoder-only          |
| **Claude 3** | up to ~200,000 tokens   | Decoder-only          |

- 🧾 One book chapter ≈ 3,000 tokens  
- 📄 100 pages ≈ 50,000 tokens  
]

.pull-right[
- 🧠 **BERT** = *Bidirectional Encoder Representations from Transformers*
  - Strong for **classification, QA, embeddings**.
  - Cannot generate text beyond the input window.

- ✍️ **GPT models** (Generative Pre-trained Transformers):
  - Process tokens **left-to-right** and suitable for **long-form generation**.
  - Larger models (GPT-4, Claude 3) support **very long documents**.
]

???
Let’s talk about **context window** — this is the number of tokens the model can “see” or consider at one time.

- First, **BERT** stands for *Bidirectional Encoder Representations from Transformers*. It uses only the encoder part of the transformer and looks at the entire input sentence in both directions — left and right — using **masked language modeling**. That means some words are hidden and the model tries to predict them using the surrounding context. 
  - Trained using **masked language modeling** 
  - Context is processed **bidirectionally** (both left and right).
When we say **bidirectional**, we mean that **BERT processes the entire input sentence at once** — it can see both the **preceding** and **following** words when interpreting a token. For example, to understand the word “bank,” BERT looks at the words *before* and *after* it in the sentence to decide whether we mean a riverbank or a financial institution. This makes it very strong for **understanding tasks**, like sentiment classification or question answering.

- However, BERT has a **limited window** — just 512 tokens. That’s usually fine for tasks like classification, question answering, or sentence embeddings, but not enough for reading long documents.

- On the other hand, **GPT models** like GPT-3 and GPT-4 are **decoder-only**. They are trained autoregressively — that is, they predict the next word one at a time, only looking at previous tokens. This is better suited for **generating** text.

- **GPT-3** has a relatively small context window, around 2,000 tokens. But GPT-4 has massively expanded that, supporting up to 128,000 tokens. That’s the equivalent of reading an entire academic paper — or several — at once.

- **Claude 3** reportedly supports up to **200,000 tokens**, which is huge — that’s about 100 pages of text. This means it can process long-form content like reports, contracts, or books.

The takeaway: **BERT is better for understanding**, GPT models are better for **generating**, and the **context window size** is a key design difference depending on the task you're trying to solve.

---
# Why is this a Limitation?

- **Can’t analyze all of a long document at once.**

- May “forget” or lose track of earlier conversation in long chats.

- **Economics example:** If you want to analyze a year of bank reports, it won’t all fit in one window.

- **Chunking:** Summarize or split long documents.
  - Break up long documents and summarize or analyze them section by section.
  - Focus the prompt on the most important parts.

- **Ongoing Research:**  
    - Newer models are expanding window size.

---
# How Big Are LLMs?

- **LLMs (like ChatGPT) are called "large" for a reason:**  They have **billions** of  *parameters* mainly consisting of weights of the network.
- The more parameters, the more complex patterns the model can capture.
- **More parameters mean:**  
  - The model can remember more patterns and rules about language.
  - But also requires more data, computer power, and energy to train.
  
| Model         | # Parameters      |
|:--------------|:-----------------|
| GPT-2         | 1.5 billion      |
| GPT-3         | 175 billion      |
| GPT-4/Claude  | Estimates: 500+ billion* |
| Small models  | ~100 million     |

> \*Exact sizes for latest models aren't always public.

---
name: why

# Why Are Transformers So Powerful?

Transformers have become the foundation for modern language models thanks to three key advantages:

- **Scalable:** Performance increases as we add more data and computational resources (leading to models like GPT-4, Gemini, etc.).

- **Self-supervised Learning:** Trained directly on vast raw text—no need for costly labeled datasets.

- **General-purpose:** Once trained, a single model can translate languages, summarize articles, answer questions, generate code, and more.

*This flexibility and power are why Transformers are used in almost every major AI system today.*

---
# Pretraining: Learning Language from Scratch

- LLMs are trained on massive text datasets:  
  Wikipedia, books, news, websites, code — **trillions of words**

- The core task: **Next-token prediction**  
  > Given a sequence of words, predict the next one  
  > `"Inflation is likely to..." → "increase"`

- No labels are required — this is **self-supervised learning** — the model trains itself by hiding words and guessing them.

- The model learns:
  - Grammar and sentence structure
  - Factual knowledge
  - Patterns of reasoning and common sense

---
# Fine-Tuning: Adapting to Specific Tasks

- After pretraining, the model is **adapted to follow instructions**
- Two main techniques:
  1. **Supervised fine-tuning**  
     > Train on question–answer or input-output examples, summaries, etc.
  2. **Reinforcement Learning from Human Feedback (RLHF)**  
     > Humans rank outputs → model learns to prefer better responses
- This makes the model:
  - More helpful
  - Better aligned with human intent
  - Able to follow user instructions (prompts):  
    > “Summarize this article.”  
    > “Write an email to my professor.”

---
# What Do LLMs Learn During Training?
- Pretraining: Learn to **model language**  
  > Word patterns, relationships, context, knowledge

- Fine-tuning: Learn to **behave helpfully**  
  > Answer questions, follow instructions, be polite

- LLMs don’t memorize everything.  
  > They **generalize** patterns from their training data

- They are good at: Text generation, Question answering, Translation, Reasoning (to a degree)

---
name: examples

# Examples of LLMs

| Model   | Creator            | Year | Notes                                 |
|---------|--------------------|------|---------------------------------------|
| GPT-4   | OpenAI             | 2023 | Used in ChatGPT, Bing, Copilot        |
| Claude  | Anthropic          | 2023 | Prioritizes safe, steerable outputs   |
| Gemini  | Google DeepMind    | 2023 | Multimodal (text + images, video)     |
| LLaMA   | Meta               | 2023 | Open-source, customizable foundation  |

*Many other LLMs exist—these are just some major examples.*

---
# LLMs in Action

- **Summarization:** 
  - *E.g.*: Summarize 10-page research papers into key points.
- **Translation:** 
  - *E.g.*: Instantly convert business emails between English, Spanish, and Japanese.
- **Question Answering:** 
  - *E.g.*: Explain “dark matter” at a 10th-grade level.
- **Code Generation:** 
  - *E.g.*: Write a Python script to analyze sales data.
- **Conversation:** 
  - *E.g.*: Power chatbots in e-commerce sites to handle customer queries 24/7.

*LLMs streamline tasks, boost productivity, and open up new ways to access and generate information.*

---
# Business Impact of LLMs and AI

- **Automate and scale** customer support, content creation, and business communications.

- **Personalize** marketing, sales, and user experiences—at an unprecedented scale.

- **Unlock insights** from massive volumes of unstructured data (e.g., emails, documents, social media).

- **Accelerate** product development, research, and innovation pipelines.

- **Enable new business models** and unlock entirely new digital services.

*AI and LLMs are not just tools—they are catalysts reshaping how organizations operate and compete.*

---
# Beyond Text: The Rise of Multimodal AI

- **Multimodal models** understand and generate multiple types of data: text, images, audio, and video.
- Examples of capabilities:
  - **Image understanding:** Describe photos, identify objects, or analyze medical scans.
  - **Speech processing:** Convert speech to text, generate natural-sounding voice responses.
  - **Video analysis:** Summarize content, detect events, or generate captions.
- Impact across industries:
  - **Healthcare:** AI-assisted imaging diagnostics and patient interaction.
  - **Media & Entertainment:** Automated video editing, subtitle creation, and content generation.
  - **Education:** Interactive learning with voice and visual aids.
  - **Customer Service:** Chatbots that can “see” images and hear customer voices.

*Multimodal AI is transforming how machines perceive and interact with the world, enabling richer, more natural experiences.*

---
# Prompt Engineering: How to Talk to LLMs

- A **prompt** is the input text or instructions you give an LLM.
- Writing good prompts can dramatically improve output quality.
- Three common prompt styles:
  - **Zero-shot:** Directly ask a question or task, no examples given  
    > _"Translate this sentence to French."_
  - **Few-shot:** Show the model a few examples first to guide its style or reasoning  
    > _"English: Hello\nFrench: Bonjour\nEnglish: Thank you\nFrench: Merci\nTranslate: Good morning"_
  - **Chain-of-Thought:** Ask the model to explain its reasoning step by step  
    > _"Explain step-by-step how to solve this math problem."_

*The way you phrase your prompt affects how the model understands and responds!*

---
# Tips for Crafting Effective Prompts

- Be **clear** and **specific** about what you want.
- Include **examples** when possible (few-shot learning).
- Specify the **desired output format** (e.g., “in bullet points,” “as a summary”).
- Experiment with prompt **length**—sometimes providing more context improves results.
- Give **explicit instructions** to reduce ambiguity  
  > Instead of "Summarize this," say "Summarize this article in 3 sentences."
- Pay attention to **tone:** LLMs respond differently to polite, formal, or casual requests.
- Even small changes in wording can lead to very different outputs!

*Prompting is an iterative process—try, refine, and see what works best!*

---
name: LLMecon

# LLMs in Economics

- **Sentiment Analysis**  
  Analyze central bank speeches, financial news, and reports to detect tone and market sentiment.

- **Survey Analysis**  
  Process open-ended survey responses at scale to extract themes and opinions.

- **Policy Briefs**  
  Automatically summarize large economic reports or complex documents into concise briefs.

- **Synthetic Data**  
  Generate realistic, privacy-preserving fake datasets for testing models or simulations.

- **Chat-based Tutoring**  
  Provide interactive, on-demand explanations of economic concepts and data to students or analysts.
  
---
name: strweak
# Strengths and Weaknesses

.pull-left[
✅ **Good at writing and explaining**  
LLMs generate fluent text and clarify concepts clearly.

✅ **Work across many domains**  
They handle diverse topics without task-specific retraining.

✅ **Can learn from few examples**  
Few-shot prompting lets them adapt quickly to new tasks.

✅ **Very flexible (one model, many tasks)**  
A single model can do translation, summarization, coding, and more.
]

.pull-right[
⚠️ **Sometimes make things up ("hallucinate")**  
They may produce plausible but incorrect or nonsensical info.

⚠️ **Don’t truly understand meaning**  
LLMs predict text statistically, lacking real comprehension.

⚠️ **Can reflect biases in the data**  
They may amplify stereotypes or biased viewpoints.

⚠️ **Expensive to train and run**  
Require significant compute resources and energy.
]

---
# Responsible Use of LLMs

- **Not a source of truth:** Always verify outputs with trusted resources.

- **Bias and hallucination risks:** Models can reflect societal biases and sometimes generate false but convincing information.

- **Privacy matters:** Don’t enter sensitive or confidential data.

- **Support, not substitute:**  
  Use LLMs as a tool to **augment** your understanding, not replace your own learning or critical thinking.  
  
- **Stay curious:** Seek to understand the "why" behind answers, and use LLMs to *guide* further study, not as the endpoint.

---
name: final

# Final Notes 🚀

- 🤖 **LLMs and AI are evolving fast**, creating new opportunities and challenges in every field.

- 💡 **Future models will be more powerful and interactive, with stronger reasoning abilities.**

- 🔄 **Jobs and tasks will change**, but new roles are emerging that value creativity, judgment, and adaptability.

- 📈 **Applications in economics and beyond are growing**—from research and analysis to policy-making and education.

- 🧠 **Stay informed and adaptable:** learning about AI is a continuous journey.

- 🤝 **The ability to understand, work with, and guide these technologies will be a key advantage.**

- 🎯 **Keep building your critical thinking, communication, and problem-solving skills**—these remain essential, no matter how AI advances!

---
# Links to Further Reading

- **Deep Learning for Economists**  
  Comprehensive notes focusing on economic applications: <https://econdl.github.io/>

- **Neural Networks and Deep Learning (3Blue1Brown)**  
  Intuitive video series explaining core neural network concepts: <https://www.3blue1brown.com/topics/neural-networks>

- **Transformers (Jay Alammar’s Illustrated Transformer)**  
  Visual and accessible guide to transformer architecture: <https://jalammar.github.io/illustrated-transformer/>
  
- **Hugging Face LLM Course**  
  An in-depth, practical course covering large language models and transformers:  
  <https://huggingface.co/course>