STAT0007: Introduction to Applied Probability(Level 5) or Stochastic Processes (Level 6)

Dr. Elinor M Jones

2

Contents

I

Week 0

7

1 Overview of STAT0007

1.1 Content of the course . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 How this course will be run . . . . . . . . . . . . . . . . . . . . .

1.3 Useful books . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

II

Week 1

9

9

10

13

15

2 Preamble

17

2.1 Mathematical/ Statistical Writing . . . . . . . . . . . . . . . . . 17

III

Week 1 (Continued)

3 Preliminaries

3.1 What is a stochastic process?

3.2 Random variables . . . . . . .

3.3 Conditioning on events . . . .

3.4 Generating functions . . . . .

IV

19

.

.

.

.

Week 2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21

21

27

31

37

43

4 Introduction to discrete time Markov processes

45

4.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Introduction to discrete time Markov chains . . . . . . . . . . . . 47

4.3 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 52

V

Week 3

59

5 Classifying discrete time Markov processes

61

5.1 Irreducible classes of intercommunicating states . . . . . . . . . . 61

3

4

CONTENTS

5.2

5.3

5.4

5.5

5.6

VI

Recurrence and transience . . . . . . . . . . . . .

Periodicity . . . . . . . . . . . . . . . . . . . . . .

Class properties . . . . . . . . . . . . . . . . . . .

Further properties of discrete time Markov chains

Closed classes and absorption . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Week 4

71

6 Long-run behaviour of discrete time Markov

6.1 Invariant distributions . . . . . . . . . . . . .

6.2 Equilibrium distributions . . . . . . . . . . .

6.3 Invariant vs equilibrium distributions . . . . .

6.4 Ergodicity and the equilibrium distribution .

VII

63

64

65

67

68

processes

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Week 5

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

73

74

76

77

78

81

7 Checking the Markov property and solving problems

83

7.1 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 83

7.2 How can I tell if a process is Markov? . . . . . . . . . . . . . . . 85

VIII

Get ahead material (Reading week)

89

8 Preliminaries for continuous time Markov processes

91

8.1 Introduction to continuous-time Markov processes . . . . . . . . 91

8.2 The exponential distribution and continuous time Markov processes 92

IX

Week 6

97

9 Holding times, jump chains and transition probabilities for continuous time Markov processes

99

9.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.2 The jump chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.3 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . 104

X

Week 7

10 The

10.1

10.2

10.3

generator matrix and

Kolmogorov’s equations

The generator matrix, 𝑄

Limiting behaviour . . .

109

long run

. . . . . .

. . . . . .

. . . . . .

behaviour

111

. . . . . . . . . . . . . . . . . 111

. . . . . . . . . . . . . . . . . 114

. . . . . . . . . . . . . . . . . 116

CONTENTS

5

XI Week 8

11 Poisson processes

11.1 The Poisson process . . . . . . . . . . . . . . .

11.2 The generator matrix for the Poisson process .

11.3 Other properties of the Poisson process. . . . .

11.4 Superposition and thinning of a Poisson process

XII Week 9

121

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

123

123

125

126

130

133

12 Birth-death processes part 1

135

12.1 Birth and death processes . . . . . . . . . . . . . . . . . . . . . . 135

12.2 Transition rates and the jump chain . . . . . . . . . . . . . . . . 136

12.3 Important types of birth-death processes . . . . . . . . . . . . . . 138

6

CONTENTS

Part I

Week 0

7

Chapter 1

Overview of STAT0007

1.1

Content of the course

This module aims to provide an introduction to the study of systems which

change state stochastically with time and to facilitate the development of skills

in the application of probabilistic ideas. It is primarily intended for second and

third year students registered on the undergraduate degree programmes offered

by the Department of Statistical Science, or jointly with other departments.

On successful completion of the module a student should understand the Markov

property in discrete and continuous time; for discrete-time Markov chains, be

able to find and classify the irreducible classes of intercommunicating states,

calculate absorption or first passage times and probabilities, assess the equilibrium behaviour; for simple examples of continuous-time Markov chains, be

able to write down the forward equations, find and interpret the equilibrium

distribution.

Stochastic processes are vital to applications in finance and insurance, and have

many applications in biology and medicine, and in the social sciences. They

also play a fundamental role in areas such as queueing theory and the study

of system reliability. The material in this module can be applied to simplified

real-world situations, and provides the foundations for further study of more

complex systems.

The course will start by briefly revising of conditional probability (however you

are expected to be well-versed in this before starting the course). We’ll then

move on to look at Markov chains (discrete time and states), and in particular

long-run behaviour. The second half of the course looks at Markov processes

in continuous time, but still discrete state space, and again our ultimate aim

is to study the long-run behaviour of these processes. We’ll look at two very

special types of continuous time processes: Poisson processes and birth-death

9

10

CHAPTER 1. OVERVIEW OF STAT0007

processes.

1.2 How this course will be run

1.2.1

Course leader

The course will be led by Dr. Elinor Jones, Associate Professor (Teaching) in

the Department of Statistical Science.

1.2.2

Level 5 and 6 versions of the course

The course is offered at two levels, but the chances are you won’t be able to

choose which level (this generally depends on your year of study etc).

The content and teaching is identical between the two levels, but in the main

exam period level 6 students will complete a more challenging paper.

1.2.3

How you’ll learn

You will work on the material in your own time with live sessions to complement

your learning. A typical weekly schedule will look like the following:

• Batch of work uploaded to Moodle each Monday at 9am.

• Finish the work by the following Monday.

• Each week, work will include:

– Watch videos;

– Read course notes;

– Complete guided exercises;

– Complete and submit homework each week (not in the first week);

– Quizzes;

– Other activities, as required.

You will also attend about three live sessions a week (exact number depends on

which week we’re in).

ONLINE large group sessions (all students attend):

• Tuesday 9-10am UK time.

• Friday 9-10am UK time.

These sessions typically include:

• Overview of material;

• Time for questions;

• Going through unseen questions.

FACE-TO-FACE small group sessions:

• Tutorial (attendance is monitored).

• All students must attend their allocated session

1.2. HOW THIS COURSE WILL BE RUN

1.2.4

11

Weekly exercises

On Friday each week I will release an exercise sheet. Note: the exercise sheet is

not released at the start of the week.

• Submit by the following Thursday 10am UK time.

• Solutions will be available at 9am each Friday.

– Those with a SoRA will have until 12:01am on Friday (i.e. an additional 14 hours according to UCL guidelines) to submit.

• NO LATE SUBMISSIONS. This rule will be strictly adhered to: no

exceptions (but see note above for students with a SoRA).

• Your work will be marked and individual feedback provided for SOME

questions.

These weekly exercises will feed into the weekly tutorial (tutorials start in the

third week).

• You should have been allocated to a tutorial group.

• You cannot move tutorial group unless you have a timetable clash.

• If you need to change your tutorial group, please email Ms Karen Leport.

– Please don’t email me: I’ve no control over tutorial allocation.

During the tutorial, your tutor will go through the questions that were not

marked. You should ‘mark’ your own work while they do this.

You will then discuss an unseen question which is connected to the work you’ve

been doing. The question will be more involved than a typical homework question. You may be asked to do short tasks connected to the question, e.g.

• Discuss in small groups;

• Respond to live quiz questions.

Please take part: the tutorials are designed to benefit your learning.

1.2.5

Weekly structure in more depth

Just to make sure you are clear on how the weekly workload will be managed:

Week 1

•

•

•

•

Material released on Monday 10th Jan.

Live session Tuesday 11th Jan (welcome session only).

Live session Friday 14th Jan (Q&A for week 1 material).

Exercise sheet 1 released Friday 14th Jan.

Week 2

• Material released on Monday 17th Jan.

• Live session Tuesday 18th Jan (Q&A for week 1 material, continued).

• Submit exercise sheet 1 by 10am on Thursday 20th Jan (unless you have

a SoRA).

• Live session Friday 21st Jan (Q&A for week 1 and 2 material).

12

CHAPTER 1. OVERVIEW OF STAT0007

• Exercise sheet 2 released Friday 21st Jan.

Week 3

•

•

•

•

Material released on Monday 24th Jan.

Live session Tuesday 25th Jan (Q&A for week 2 material, continued).

Attend your allocated tutorial.

Submit exercise sheet 2 by 10am on Thursday 27th Jan (unless you have

a SoRA).

• Live session Friday 28th Jan (Q&A for week 2 and 3 material).

• Exercise sheet 3 released Friday 28th Jan.

And so on, and so forth. There will be some exceptions to this, e.g. when you

sit an ICA for the course.

1.2.6

Getting help

You can get help in the following ways:

• Post to the Forum.

• Book an oﬀice hour slot.

• If your question is connected to the tutorial material, then ask your tutor

during the tutorial. Your tutor will not answer questions outside of your

tutorial slot.

Please note that I won’t answer questions by email (the exception to this is if

your question is more personal, e.g. organising additional time for your ICA).

1.2.7

Assessment

Assessment is split into four components:

• ICA 1: Tuesday 8th February (7.5% weight)

– Short Moodle quiz.

– Takes place instead of Tuesday live session.

– Mixture of MCQs and calculation questions.

• ICA 2: Tuesday 22nd March (7.5% weight)

– Short Moodle quiz

– Takes place instead of Tuesday live session.

– Mixture of MCQs and calculation questions.

• Participation (10% weight)

– Acceptable quality of weekly exercise submission.

– Peer exercise.

• Open book exam (75% weight)

– Sometime in main exam period.

– No choice of questions.

– Level 6 students will have more challenging questions.

1.3. USEFUL BOOKS

1.3

13

Useful books

• Introduction to Probability Models by S. Ross;

• Probability and Random Processes by Grimmett and Stirzaker (more

mathematical… be warned).

14

CHAPTER 1. OVERVIEW OF STAT0007

Part II

Week 1

15

Chapter 2

Preamble

2.1

Mathematical/ Statistical Writing

Mathematics is a language for which you must learn its grammar. Whether you

are handing in a tutorial sheet or writing an exam, you need to present your

work in an appropriate manner. You could start by thinking of the following:

• Think carefully about precisely what you are trying to calculate/ solve

and what information you have available in order to do so.

• Have you clearly defined all your notation? Be precise: are your definitions watertight?

• Is your notation appropriate?

• Does your solution flow logically, that is, does each step follow on logically

from the previous step? Again, be precise. The examiner is not a mind

reader. YOU need to guide the reader through your thought process.

• Have you explained your reasoning, not just written down some formulae?

• It is important to understand what you are doing; don’t just repeat a

list of steps. Always ask yourself: ‘do I know what I’m doing?’ and ‘how

would I explain this to a fellow student?’.

Being able to write clearly will help you to think clearly. It forces you to

understand the material: if you can’t write it well, you don’t understand it well

enough. Learning mathematics is not about repetition, as it was at A-Level,

and so past papers are NOT A GOOD REVISION TOOL. In particular, I write

ICA and exam questions that probe your understanding, and not your ability

to regurgitate methods. This reflects the fact that proper mathematics is not

about learning to apply a set of methods, but using logical reasoning to find

your way through a problem.

17

18

CHAPTER 2. PREAMBLE

Some tips for writing in STAT0007 (based on previous cohorts of students):

• Random variables form the basis for most, if not every, question you will

encounter. Make sure you define each and every random variable you use.

All random variables should be assigned to CAPITAL letters.

• Probabilities should be written out in full – if you want to calculate 𝑃 (𝑋 =

1) for some random variable 𝑋, then write it as such. NEVER use notation

such as 𝑃 (1) to denote this. And, as the previous bullet point states,

writing 𝑃 (𝑥 = 1) is plain wrong.

• The ‘equals sign’ has a very specific meaning. Please don’t abuse it.

• Do a reality check on your answer. For example, a probability should

be between 0 and 1. An expectation of how many time units it takes to

reach a particular point should be at least 1 (assuming moving between

locations takes 1 unit of time, and that you’re not already there!).

• If your answer doesn’t contain written explanation, then it is not a complete solution. Think about guiding the marker through your thought process, rather than expecting the marker to second guess what you meant

to say.

• Use tutorial sheets to practice good mathematical writing.

Part III

Week 1 (Continued)

19

Chapter 3

Preliminaries

3.1

3.1.1

What is a stochastic process?

A course overview

A stochastic process is a process which evolves randomly over time (or space,

or both):

• ‘stochastic’: equivalent to ‘random’.

• ‘process’: occurring over time (or space, or both).

We will consider some simple stochastic processes which are used to model reallife situations. Our main focus will be on developing mathematical tools to

study their properties and learn about long-term behaviour of such processes.

Though this is a mathematical course, the applications of stochastic processes

are wide-ranging, e.g. in physics, engineering, the environment, social science,

epidemics, genetics and finance. We’ll see examples from some of these areas

during the course.

3.1.2

Definitions and basic properties

A stochastic process is a collection of random variables {𝑋𝑡 , 𝑡 ∈ 𝑇 } taking

values in the state space 𝑆. The parameter, or index, set 𝑇 often represents

time. We will look at both discrete-time processes and continuous-time

processes. We’ll generally use the following notation, though you need to be

flexible in adapting to different notation from time to time.

• Discrete-time processes: {𝑋𝑡 , 𝑡 ∈ 𝑇 }, where 𝑇 = {0, 1, 2, …}. This is often

written in terms of the natural numbers (including zero) rather than 𝑇 ,

i.e. {𝑋𝑛 , 𝑛 ∈ N}, where N = {0, 1, 2, …}

• Continuous-time processes: {𝑋(𝑡), 𝑡 ∈ 𝑇 }, where 𝑇 = R+ (i.e. 𝑡 ≥ 0).

21

22

CHAPTER 3. PRELIMINARIES

We’ll often write this as {𝑋𝑡 , 𝑡 ∈ 𝑇 }, which is identical to the discrete time

notation above, but note that 𝑡 here can take any value in the positive

reals whereas for discrete time processes the set of values 𝑡 can take is

countable.

• Notice that the state space can be discrete or continuous. In this course

we will only consider discrete state spaces.

3.1.3

Examples of stochastic processes

Below are examples of stochastic processes, categorised according to whether

they are continuous or discrete time processes, and whether the state space is

discrete or continuous. We only consider discrete state space processes in this

course, in both discrete and continuous time.

Discrete time, discrete state space

Here we are tossing a coin and counting the number of excess heads (i.e. how

many more heads than tails we see). The ‘time’ on the x-axis represents the coin

tosses. For example, we see from this particular realisation that the first toss

must have landed tails (because the number of excess heads is now -1), followed

by a heads (we’re now back up to 0 excess heads). The state-space is countable

(integers, so 𝑆 = Z) and the time element here is the coin tosses themselves so

we’re in discrete ‘time’ with 𝑇 = N ∪ {0}. Notice here that one coin toss is

equivalent to one unit of time, so we need to be flexible in how we view ‘time’.

Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent: if we know 𝑋𝑘 say

then 𝑋𝑘+1 will either be 𝑋𝑘 + 1 (if we see a head at the next toss) or 𝑋𝑘 − 1

otherwise.

Discrete time, continuous state space

Here we have a gambler who starts off with £10. We’re tracking how much

money they have after each game. The ‘time’ on the x-axis represents the games

that the gambler bets on, so the state-space is 𝑇 = N ∪ {0}. We consider the

state-space to be continuous here, though technically you could consider it to be

discrete. The reason why it’s easier to consider this as continuous state-space

is that there are so many possible ‘values’ that the process could take (think:

each value corresponds to a particular value in pounds and pence). Notice again

here that one game is equivalent to one unit of time, so we need to be flexible in

how we view ‘time’. Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent:

if we know 𝑋𝑘 then though we can’t provide a short list of possible values for

𝑋𝑘+1 as we could for the previous example, we could take an educated guess

depending on the game being played.

Continuous time, discrete state space

Here we are counting occurrences of a specified event. At time zero we’ve seen

no events, but the first event happens roughly at time 2.5 (we can see this as

the number of occurrences steps up to 1), and the second event happens shortly

after. By time 5 we have seen 2 events. Notice here that the events can happen

at any instant in time so we think of time as continuous. The state-space is

3.1. WHAT IS A STOCHASTIC PROCESS?

23

Figure 3.1: Tracking the number of excess heads in 20 coin tosses

discrete as we’re just counting events (𝑆 = N ∪ {0}). Also notice that if we look

at the value of the process at time say 8.3 and 8.8, 𝑋(8.3) and 𝑋(8.8), then

these aren’t independent: if we know 𝑋(8.3) then 𝑋(8.8) must be at least as big

as 𝑋(8.3) as we can’t ‘unsee’ events. Under certain distributional assumptions

which we’ll discuss later in the course, these processes are known as ‘Poisson

processes’.

Continuous time, continuous state space

Finally an example of a stochastic process that we won’t be studying further:

this is a Brownian motion which has continuous state-space and continuous

time. It has continuous paths, but an extension to Brownian motion is the Lévy

process which allows jumps in values to occur. These types of processes are

used in Finance but are beyond the scope of this course.

NOTE: Continuous-time processes can change their value/state (‘jump’) at any

instant of time; discrete-time processes can only do this at a discrete set of time

points. For discrete-time processes, when we use the word ‘time’ we mean

‘number of transitions/steps’.

24

CHAPTER 3. PRELIMINARIES

Figure 3.2: Tracking a gambler’s wealth over 30 games

3.1. WHAT IS A STOCHASTIC PROCESS?

Figure 3.3: A realisation of a Poisson process

25

26

CHAPTER 3. PRELIMINARIES

Figure 3.4: A realisation of a Brownian motion

3.2. RANDOM VARIABLES

3.2

27

Random variables

Informally speaking, a random variable is a variable (i.e. can take different

values), whose numerical values are randomly chosen from a set of possible

values, governed by a probability distribution. For example, the time that a

passenger must wait until the next southbound tube train at Goodge street

station is a variable. Now suppose you arrive at Goodge street station – the

amount of time you must wait until the next southbound tube arrives is a

realisation of the random variable ‘waiting time’.

It is important that you can spot a random variable, and can also define appropriate random variables for a given scenario.

3.2.1

Definitions

Ideally, we should think about random variables as functions. In order to do so,

we must first build up some terminology. We’ll use a coin tossing example to

illustrate: suppose you toss a coin twice and record whether it lands heads or

tails at each toss. Note that the following is a simplification of ideas from Probability Theory and we won’t be using the technicalities in this course. However,

it does no harm to see these ideas.

• An experiment is any situation where there is a set of possible outcomes.

– Our coin-tossing example is an experiment.

– The possible outcomes are HH, HT, TH or TT.

• The set of possible outcomes is called the sample space, Ω.

– In our case, Ω = {𝐻𝐻, 𝐻𝑇 , 𝑇 𝐻, 𝑇 𝑇 }

• An outcome is denoted 𝜔 and is an element of the sample space.

– Suppose we conduct the experiment, and the outcome is TH. This is

our 𝜔.

• An event, 𝐴, is a subset of the sample space.

– e.g. define an event as ‘at least one head’.

– This is satisfied by HH, HT and TH.

– This is a subset of Ω.

Probability model

Real world

Sample space, Ω

Sample point, 𝜔

Event 𝐴 ⊆ Ω

‘Set of all possible outcomes of the experiment.’

‘Outcome of the experiment.’

‘Property that the outcome of the experiment may

possess.’

With a finite sample space Ω, we can assign probabilities to individual sample

points 𝜔 via a ‘weight function’, 𝑝 ∶ Ω → [0, 1]. This allocates a value 𝑝(𝜔) for

each possible outcome 𝜔, which we interpret as the probability that 𝜔 occurs.

28

CHAPTER 3. PRELIMINARIES

We need

∑ 𝑝(𝜔) = 1,

𝜔∈Ω

then we may calculate the probability of an event 𝐴 ⊆ Ω via

P(𝐴) = ∑ 𝑝(𝜔).

𝜔∈𝐴

Caution: this doesn’t always work with infinite sample spaces, but we won’t

consider this further.

Notice that:

1. P(𝐴) ≥ 0 for all events 𝐴,

2. if 𝐴1 , 𝐴2 , 𝐴3 , … are events with 𝐴𝑖 ∩𝐴𝑗 = ∅ for all 𝑖 ≠ 𝑗, then P(∪∞

𝑖=1 𝐴𝑖 ) =

∞

∑𝑖=1 P(𝐴𝑖 ) i.e. P is countably additive,

3. P(Ω) = 1.

As the outcome of an experiment corresponds to a sample point 𝜔 ∈ Ω, we

can make some numerical measurements whose values depend on the outcome

𝜔. This gives rise to a random variable, let’s call it 𝑋, which is a function

𝑋 ∶ Ω → R. Its value at a sample point, 𝑋(𝜔), represents the numerical value

of the measurement when the outcome of the experiment is 𝜔.

In our set-up, there are different random variables you could define. Let’s define

𝑋 to be the number of heads in the two tosses of a coin. In this case 𝑋 maps

Ω to 0 (if we see TT), 1 (if we see TH or HT) or 2 (if we see HH).

𝑋(𝑇 𝑇 ) = 0

𝑋(𝑇 𝐻) = 𝑋(𝐻𝑇 ) = 1

𝑋(𝐻𝐻) = 2

All outcomes are equally likely here, and so 𝑝(𝜔) = 1/4 for all outcomes 𝜔. If

we define our event 𝐴 =“two heads”, then since the only sample point in 𝐴 is

HH,

∑ 𝑝(𝜔) = 1/4,

𝜔∈𝐴

and we say that P(𝐴) = 1/4.

Similarly, we can also define another event 𝐵 =“at least one tail”, and this is

given by P(𝑋 ≤ 1), which we can calculate as:

P(𝐵) = P(𝑋 ≤ 1) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 1})

(3.1)

Which 𝜔 satisfy (3.1) in this case? Now you should be able to calculate the

probability required.

Note: we have defined the distribution function of 𝑋 above. In general, the

distribution function of 𝑋 of a random variable is given, for 𝑥 ∈ R, by:

𝐹𝑋 (𝑥) = P(𝑋 ≤ 𝑥) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 𝑥}).

3.2. RANDOM VARIABLES

3.2.2

29

Discrete random variables

𝑋 is a discrete random variable if it takes only finitely many or countably

infinitely many values.

The probability mass function of a discrete random variable is

𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = P({𝜔 ∶ 𝑋(𝜔) = 𝑥}).

The expectation of 𝑋 is

E(𝑋) = ∑ 𝑥 P(𝑋 = 𝑥).

𝑥

For a function 𝑔 we have

E(𝑔(𝑋)) = ∑ 𝑔(𝑥) P(𝑋 = 𝑥).

𝑥

The variance of 𝑋 is

Var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .

3.2.3

Continuous random variables

𝑋 is a continuous random variable if it has a probability density func∞

tion, i.e., if there exists a non-negative function 𝑓𝑋 with ∫−∞ 𝑓𝑋 (𝑥)𝑑𝑥 = 1, such

that for all 𝑥,

𝑥

𝐹𝑋 (𝑥) = ∫ 𝑓𝑋 (𝑢)𝑑𝑢 .

−∞

Then

𝑓𝑋 (𝑥) =

𝑑𝐹𝑋 (𝑥)

.

𝑑𝑥

The expectation of 𝑋 is

∞

E(𝑋) = ∫

𝑥𝑓𝑋 (𝑥)𝑑𝑥

−∞

and the expectation of 𝑔(𝑋), for some function 𝑔, is

∞

E(𝑔(𝑋)) = ∫

𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥 ,

−∞

provided that these integrals exist. As before, the variance of 𝑋 is

var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .

30

CHAPTER 3. PRELIMINARIES

3.2.4

Important distributions

There are several probability distributions which we will use repeatedly during

the course. Make sure you are very familiar with the following.

Discrete probability distributions:

Distribution

Support

PMF

Expectation

Variance

Binomial

0, 1, …, 𝑛

P(𝑋 =

𝑘) =

𝑛𝑝

𝑛𝑝𝑞

𝑋 ∼ Bin(𝑛, 𝑝)

Poisson

𝑛!

𝑘 𝑛−𝑘

𝑘!(𝑛−𝑘)! 𝑝 𝑞

0, 1, 2, 3…

P(𝑋 = 𝜆

𝑘) =

𝑘

exp(−𝜆) 𝜆𝑘!

𝜆

1, 2, 3, 4…

P(𝑋 =

𝑘) =

(1 −

𝑝)𝑘−1 𝑝

1/𝑝

𝑞/𝑝2

Expectation

Variance

1/𝜆

1/𝜆2

𝑎 < 𝑥1 < 𝑏
2 (𝑏 − 𝑎)
otherwise
1
12 (𝑏
𝑋 ∼ Po(𝜆)
𝜆>0

Geometric

𝑋 ∼ Geom(𝑝)

Continuous probability distributions:

Distribution

Exponential

Support

PMF

+

𝜆𝑒

[𝑎, 𝑏]

1

{

0

R

−𝜆𝑥

𝑋 ∼ Exp(𝜆)

𝜆>0

Uniform

𝑋 ∼ Unif(𝑎, 𝑏)

− 𝑎)2

3.3. CONDITIONING ON EVENTS

Distribution

Gamma

Support

R+

PMF

𝛼

31

Expectation

𝛼−1

𝛽 exp(−𝛽𝑥)𝑥

Γ(𝛼) 𝛼/𝛽

Variance

𝛼/𝛽 2

𝑋 ∼ Gamma(𝛼, 𝛽)

𝛼, 𝛽 > 0

3.3

Conditioning on events

Calculating probabilities that are conditional on a particular event or occurrence is a natural requirement. Make sure you understand the rationale behind

conditional probability, don’t just learn the formulae!

Easy exercise

You have a deck of playing cards (52 cards in total, split equally into four suits:

hearts, diamonds, spades and clubs, with the first two considered ‘red’ cards,

and the others considered ‘black’ cards).

• I pick a card at random. What is the probability that the chosen card is

from the diamond suit?

• I pick a card at random and tell you that it is a red card (i.e. either hearts

or diamonds). What is the probability that the chosen card is from the

diamond suit?

The second question requires a conditional probability (it is conditional on knowing that the card is red), but you most likely calculated the probability intuitively without using any formulae. How did you do that? You applied Bayes’s

theorem intuitively, without even realising it.

3.3.1

Intuition behind conditional probability

In fact, Bayes’s theorem for calculating conditional probabilities is perfectly

intuitive. Suppose we are calculating P(𝐴|𝐶) for some events 𝐴 and 𝐶:

• This is asking for the probability of 𝐴 occurring, given that 𝐶 occurs.

• Assuming 𝐶 occurs, then the contents of the statespace Ω which relate

to 𝐶 becomes the ‘new’ statespace, Ω′ (which is just the event 𝐶 for our

purposes).

• What we are asking here, in effect, is to calculate the probability of 𝐴 in

the ‘new’ sample space Ω′ (i.e. P(𝐴 ∩ 𝐶)), scaled to Ω′ (i.e. divided by

P(𝐶)).

• The final equation is therefore P(𝐴|𝐶) = P(𝐴 ∩ 𝐶)/P(𝐶)

But why scale by P(𝐶)? Simply because the probability of all disjoint events

in any sample space must sum to 1. Our sample space is now Ω′ = 𝐶, so the

32

CHAPTER 3. PRELIMINARIES

new (probability) weight function we apply to all events in Ω′ must sum to 1.

We scale all probabilities by P(𝐶) to ensure that this is the case.

The following graphic may help you to visualise why Bayes’s theorem ‘works’.

Figure 3.5: A schematic of conditional probability

3.3.2

Formal version of conditional probability

Let 𝐴 and 𝐶 be events with P(𝐶) > 0. The conditional probability of 𝐴

given 𝐶 is

P(𝐴 ∩ 𝐶)

P(𝐴 | 𝐶) =

.

P(𝐶)

3.3. CONDITIONING ON EVENTS

33

It is easy to verify that

1. P(𝐴 | 𝐶) ≥ 0 ,

2. if 𝐴1 , 𝐴2 , … are mutually exclusive events, then

P(𝐴1 ∪ 𝐴2 ∪ … | 𝐶) = ∑ P(𝐴𝑖 | 𝐶),

𝑖

(try drawing diagrams like the above to visualise this)

3. P(𝐶 | 𝐶) = 1.

Conditional probability for random variables instead of events works is the same

way.

Discrete RVs 𝑋 and 𝑌

Continuous RVs 𝑋 and 𝑌

Joint probability mass function

Joint density function

𝑝𝑋,𝑌 (𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦)

𝑓𝑋,𝑌 (𝑥, 𝑦)

𝑋 and 𝑌 independent if and only

if for all 𝑥, 𝑦

𝑋 and 𝑌 independent if and only if for

all 𝑥, 𝑦

𝑝𝑋,𝑌 (𝑥, 𝑦) = 𝑝𝑋 (𝑥)𝑝𝑌 (𝑦) = P(𝑋 = 𝑥)P(𝑌 = 𝑓𝑦)

𝑋,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦)

Marginal probability mass

function (of 𝑋)

Marginal density (of 𝑌 )

∞

𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦)

𝑓𝑌 (𝑦) = ∫

𝑓𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥

−∞

𝑦

Conditional PMF of 𝑋 given

𝑌 =𝑦

𝑝𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 = 𝑥 | 𝑌 = 𝑦) =

Conditional density of 𝑋 given 𝑌 = 𝑦

(𝑓𝑌 (𝑦) > 0)

P(𝑋 = 𝑥, 𝑌 = 𝑦)

𝑓𝑋,𝑌 (𝑥, 𝑦)

𝑓𝑋∣𝑌 =𝑦 (𝑥) =

P(𝑌 = 𝑦)

𝑓𝑌 (𝑦)

Conditional distribution function

of 𝑋 given 𝑌 = 𝑦

Conditional distribution function of 𝑋

given 𝑌 = 𝑦

𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)

𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)

34

CHAPTER 3. PRELIMINARIES

3.3.3

Conditional expectation

Just like calculating an expectation of, say, 𝑋, requires knowing the distribution

of 𝑋, the conditional expectation of 𝑋, given that we know 𝑌 = 𝑦 requires

knowing the distribution of (𝑋|𝑌 = 𝑦).

The conditional expectation of 𝑋 given 𝑌 = 𝑦 is

E(𝑋|𝑌 = 𝑦) = {

∑𝑥 𝑥P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∞

∫−∞ 𝑥𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥

for continuous RVs

The conditional expectation of 𝑔(𝑋) given 𝑌 = 𝑦, for some function 𝑔,

is

E(𝑔(𝑋)|𝑌 = 𝑦) = {

∑𝑥 𝑔(𝑥)P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∞

∫−∞ 𝑔(𝑥)𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥

for continuous RVs

Example 3.1. Suppose that Ω = {𝜔1 , 𝜔2 , 𝜔3 } and P(𝜔𝑖 ) = 1/3 for 𝑖 = 1, 2, 3.

Suppose also that 𝑋 and 𝑌 are random variables with

𝑋(𝜔1 ) = 2, 𝑋(𝜔2 ) = 3, 𝑋(𝜔3 ) = 1

𝑌 (𝜔1 ) = 2, 𝑌 (𝜔2 ) = 2, 𝑌 (𝜔3 ) = 1

Find the conditional pmf 𝑝𝑋∣𝑌 =2 (𝑥) and the conditional expectation E(𝑋 | 𝑌 =

2).

Solution

Possible values of 𝑋 are 1, 2, 3.

𝑝𝑋∣𝑌 =2 (1) = P(𝑋 = 1 ∣ 𝑌 = 2) = 0.

𝑝𝑋∣𝑌=2 (2) = P(𝑋 = 2 ∣ 𝑌 = 2) =

P(𝑋 = 2, 𝑌 = 2)

P(𝜔1 )

1

=

= .

P(𝑌 = 2)

P(𝜔1 )+P(𝜔2 )

2

𝑝𝑋∣𝑌=2 (3) = P(𝑋 = 3 ∣ 𝑌 = 2) =

P(𝜔2 )

1

P(𝑋 = 3, 𝑌 = 2)

=

= .

P(𝑌 = 2)

P(𝜔1 )+P(𝜔2 )

2

Thus the conditional pmf of 𝑋 given 𝑌 = 2 equals

and is 0 otherwise. The conditional expectation is

1

2

for 𝑥 = 2 and for 𝑥 = 3

E(𝑋 | 𝑌 = 2) = ∑ 𝑥 P(𝑋 = 𝑥 ∣ 𝑌 = 2) = (1 × 0) + (2 × 1/2) + (3 × 1/2) = 5/2 .

𝑥

3.3. CONDITIONING ON EVENTS

35

So far, we have only calculated conditional expectations when conditioning on

specific values, e.g. E[𝑋|𝑌 = 2]. We can extend this idea to conditioning on

random variables without equating them to specific values.

The notation E(𝑋 | 𝑌 ) is used to denote a random variable that takes the

value E(𝑋 | 𝑌 = 𝑦) with probability P(𝑌 = 𝑦).

Note that E(𝑋 | 𝑌 ) is a function of the random variable 𝑌 , and is itself a

random variable.

Example 3.2. In Example 3.1 we found that

E[𝑋 | 𝑌 = 2] =

5

2

It can similarly be shown that

E[𝑋 | 𝑌 = 1] = 1

You can also check that

P(𝑌 = 1) = 1/3 and P(𝑌 = 2) = 2/3

Thus, the random variable

E[𝑋 | 𝑌 ]

has two possible values: 1 and 25 , with probabilities

is,

E[𝑋|𝑌 ] = {

3.3.4

1

5/2

1

3

and 23 , respectively. That

with probability 1/3

with probability 2/3

Useful conditioning formulae

Three important formulae:

Law of total probability Let 𝐴 be an event and 𝑌 be ANY random variable.

Then

∑ P(𝐴|𝑌 = 𝑦)P(𝑌 = 𝑦)

if 𝑌 discrete

P(𝐴) = { ∞𝑦

∫−∞ P(𝐴 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦 if 𝑌 continuous

Modification of the law of total probability: if 𝑍 is a third discrete random

variable,

36

CHAPTER 3. PRELIMINARIES

P(𝑋 = 𝑥 ∣ 𝑍 = 𝑧) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦 ∣ 𝑍 = 𝑧)

𝑦

=∑

𝑦

=∑

𝑦

P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧)

P(𝑍 = 𝑧)

P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦, 𝑍 = 𝑧)

P(𝑌 = 𝑦, 𝑍 = 𝑧)

P(𝑍 = 𝑧)

= ∑ P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦 ∣ 𝑍 = 𝑧),

𝑦

assuming the conditional probabilities are all defined.

Law of conditional (iterated) expectations

E[𝑋] = E[E(𝑋 | 𝑌 )] = {

∑𝑦 E(𝑋 | 𝑌 = 𝑦) P(𝑌 = 𝑦)

∞

∫−∞ E(𝑋 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦

if 𝑌 discrete

if 𝑌 continuous

and, more generally,

E[𝑔(𝑋)] = E(E[𝑔(𝑋) | 𝑌 ])

Note in particular:

• The law of total probability applies to ANY event 𝐴 and ANY random

variable 𝑌 .

• The law of the conditional (iterated) expectation, E[𝑋] = E[E[𝑋|𝑌 ]] applies to ANY random variables 𝑋 and 𝑌 , and we will be using this identity

heavily during the course.

Example 3.3. In Example 3.2 we found the distribution of the random variable

E(𝑋 | 𝑌 ) was

1

w.p. 1/3

E[𝑋|𝑌 ] = {

5/2 w.p. 2/3

The expectation of this random variable is

E[E(𝑋|𝑌 )] = 1 ×

1 5 2

+ × = 2.

3 2 3

Also, the expectation of 𝑋 is given by

E(𝑋) = 1 ×

1

1

1

+ 2 × + 3 × = 2,

3

3

3

so that we see that the law of the conditional (iterated) expectation, i.e. that

E[𝑋] = E[E[𝑋|𝑌 ]], holds.

3.4. GENERATING FUNCTIONS

37

Example 3.4. Let 𝑋 and 𝑌 have joint density

𝑓𝑋,𝑌 (𝑥, 𝑦) =

1 −𝑥/𝑦 −𝑦

𝑒

𝑒 , 0 < 𝑥, 𝑦 < ∞ .
𝑦
Find 𝑓𝑌 (𝑦), E(𝑋 | 𝑌 ) and hence E(𝑋).
Solution
For any 𝑦 > 0,

∞

𝑓𝑌 (𝑦) = ∫

0

1 −𝑥/𝑦 −𝑦

−𝑦

𝑒

𝑒 𝑑𝑥 = 𝑒−𝑦 [−𝑒−𝑥/𝑦 ]∞

0 =𝑒 ,

𝑦

so 𝑌 has an exponential distribution with parameter 1.

Also, for any fixed 𝑦 > 0 and any 𝑥 > 0,

𝑓𝑋∣𝑌 (𝑥 | 𝑦) =

𝑓𝑋,𝑌 (𝑥, 𝑦)

1

𝑒−𝑥/𝑦 𝑒−𝑦

= 𝑒−𝑥/𝑦 ,

=

𝑓𝑌 (𝑦)

𝑦𝑒−𝑦

𝑦

so 𝑋 | 𝑌 = 𝑦 ∼ exp(1/𝑦).

It follows that E(𝑋 | 𝑌 = 𝑦) = 𝑦 and so E(𝑋 | 𝑌 ) = 𝑌 . Using the Law of

Conditional Expectations we then have

E[𝑋] = E[E(𝑋 | 𝑌 )] = E(𝑌 ) = 1 .

3.4

Generating functions

Generating functions are USEFUL. Don’t think of them as baffling mathematical constructs; instead think of them as marvelous time-saving devices.

Imagine you have a sequence of real numbers, 𝑧1 , 𝑧2 , 𝑧3 , …. It can be hard to

keep track of such numbers and all the information they contain. A generating

function provides a way of wrapping up the entire sequence into one expression.

When we want to extract information about the sequence, we simply interrogate

this one expression in a particular way in order to get what we want.

There are many different types of generating function. These are different in

the sense that we wrap up the sequence in different ways, and so extracting

particular pieces of information from them requires different techniques. The

two types of generating function that you should be familiar with are:

1. The probability generating function (PGF);

2. The moment generating function (MGF)

38

3.4.1

CHAPTER 3. PRELIMINARIES

Probability generating functions

The probability generating function (p.g.f) of 𝑋 is defined to be

∞

𝐺𝑋 (𝑠) = E(𝑠𝑋 ) = ∑ 𝑠𝑘 P(𝑋 = 𝑘) for |𝑠| ≤ 1.

𝑘=0

Probability generating function fundamentals

The random variable

𝑋

Sequence to ‘wrap up’

Method

Information stored

Takes values in N0 = {0, 1, 2, …}

P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), P(𝑋 = 3), …

∞

∑𝑘=0 𝑠𝑘 P(𝑋 = 𝑘) (for any 𝑠 ∈ [−1, 1])

The probabilities P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), …

and some moments (expectations).

The information stored in probability generating functions can be accessed as

follows.

Calculating moments

E(𝑋) = ∣

E[𝑋(𝑋 − 1)] = ∣

𝑑𝐺𝑋 (𝑠)

∣

= 𝐺′𝑋 (1),

𝑑𝑠

𝑠=1

𝑑2 𝐺𝑋 (𝑠)

∣

= 𝐺″𝑋 (1), and so on.

𝑑𝑠2

𝑠=1

Calculating probabilities

Suppose that 𝑋 takes values in N0 = {0, 1, 2, …}. Then the PGF evaluated at

0 is:

∞

𝐺𝑋 (0) = ∑ 0𝑘 P(𝑋 = 𝑘) = P(𝑋 = 0).

𝑘=0

Furthermore, if we expand 𝐺𝑋 (𝑠) in powers of 𝑠 the coeﬀicient of 𝑠𝑘 is equal to

P(𝑋 = 𝑘), so we can also find P(𝑋 = 𝑘) for all 𝑘.

You can also work ‘backwards’: given a PGF for a random variable 𝑋, what is

the distribution of 𝑋?

3.4. GENERATING FUNCTIONS

39

Example 3.5. Suppose you are told that 𝐺𝑋 (𝑠) = 𝑝𝑠/[1 − (1 − 𝑝)𝑠]. We

can convert this expression into the summation format of a PGF and spot its

distribution:

𝑝𝑠

𝐺𝑋 (𝑠) =

,

1 − (1 − 𝑝)𝑠

∞

= 𝑝𝑠 ∑(1 − 𝑝)𝑘 𝑠𝑘

𝑘=0

∞

= ∑ 𝑠𝑗 (1 − 𝑝)𝑗−1 𝑝,

where 𝑗 = 𝑘 + 1.

𝑗=1

Therefore, P(𝑋 = 𝑗) = (1 − 𝑝)𝑗−1 𝑝, so 𝑋 ∼ Geometric(𝑝).

In fact, different distributions have different PGF formats but random variables

with the same distribution have a PGF of the same format. You can therefore

spot the distribution of a random variable instantly from the format of its PGF.

3.4.2

Moment generating functions

The moment generating function (m.g.f) of 𝑋 is defined to be

𝑀𝑋 (𝑠) = E [exp(𝑡𝑋)] = ∫ exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥

R

for any 𝑡 ∈ R such that the integral converges.

Probability generating function fundamentals

Random variable

Quantity to ‘wrap up’

Method

Information stored

Must be real valued

𝑓𝑋 (𝑥)

∫R exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥 (for any 𝑡 ∈ R such that the

integral converges)

Moments

We can access all moments of 𝑋 from 𝑀𝑋 (𝑡):

(𝑛)

E[𝑋 𝑛 ] = 𝑀𝑋 (0) =

𝑑 𝑛 𝑀𝑋

∣

𝑑𝑡𝑛 𝑡=0

The same as for PGFs, you can spot the distribution of a random variable

instantly from the format of its MGF.

Notice that a discrete or continuous random variable can have a MGF, whereas

a PGF is for discrete random variables only (i.e. those with countable state

space).

40

3.4.3

CHAPTER 3. PRELIMINARIES

What else can we do with generating functions?

We’ll concentrate here on (useful) things you can do with probability generating functions, though similar conclusions can be reached using moment generating functions too.

Calculating the distribution of sums of random variables.

Suppose that 𝑋1 , … , 𝑋𝑛 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠).

The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑛 is given by

𝐺𝑌 (𝑠) = E (𝑠𝑋1 +⋯+𝑋𝑛 ) ,

𝑛

= ∏ E (𝑠𝑋𝑖 ) ,

𝑖=1

= [𝐺𝑋 (𝑠)]𝑛 ,

and by looking at [𝐺𝑋 (𝑠)]𝑛 , we can spot the distribution of 𝑌 , which may

otherwise be diﬀicult. Also, if 𝑍1 ⟂

⟂ 𝑍2 (that is, 𝑍1 and 𝑍2 are independent but

do not necessarily have the same distribution) then

𝐺𝑍1 +𝑍2 (𝑠) = 𝐺𝑍1 (𝑠)𝐺𝑍2 (𝑠).

By looking at the PGF of 𝑍1 + 𝑍2 , 𝐺𝑍1 +𝑍2 (𝑠), we can then deduce the distribution 𝑍1 + 𝑍2 . This can be an easier strategy than trying to deduce its

distribution directly.

Example 3.6. 𝑋 and 𝑌 are both Poisson random variables, with parameters

𝜆1 and 𝜆2 , respectively. Assume that 𝑋 and 𝑌 are also independent of each

other. What is the distribution of 𝑋 + 𝑌 ?

Solution: Use probability generating functions!

The PGF of a Poisson random variable with parameter 𝜆 is

𝐺(𝑠) = 𝑒𝜆(𝑠−1) .

Therefore, the PGF of 𝑋 + 𝑌 is

𝐺𝑋+𝑌 (𝑠) = 𝑒𝜆1 (𝑠−1) 𝑒𝜆2 (𝑠−1) = 𝑒(𝜆1 +𝜆2 )(𝑠−1) .

We recognise the format of this PGF as the PGF of a Poisson distribution with

parameter (𝜆1 + 𝜆2 ). Therefore,

𝑋 + 𝑌 ∼ Poisson(𝜆1 + 𝜆2 )

3.4. GENERATING FUNCTIONS

41

Calculating the PGF of a random sum

That is, the sum of a random number of random variables. Suppose that

𝑋1 , … , 𝑋𝑁 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠) and that 𝑁

has PGF 𝐺𝑁 (𝑠).

The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑁 is given by

𝐺𝑌 (𝑠) = E[E (𝑠𝑌 |𝑁 ) ].

We’ll see this technique of expanding an expectation again later (you have already seen it if you studied STAT0005), and it will prove to be a very useful

technique in solving problems. Using this, we can deduce that

E[E (𝑠𝑌 |𝑁 ) ] = E[[𝐺𝑋 (𝑠)]𝑁 ] = 𝐺𝑁 [𝐺𝑋 (𝑠)].

Similarly, it can be shown (challenge!) that the m.g.f. 𝑀𝑌 (𝑡) of 𝑌 is given by

𝑀𝑌 (𝑡) = E (exp𝑡𝑌 ) = 𝐺𝑁 [𝑀𝑋 (𝑡)].

Example 3.7. The number of people who enter a particular bookstore on

Gower Street, per day, follows a Poisson distribution with parameter 300. Of

those who visit, 60% will make a purchase. What is the distribution of the

number of people who make a purchase at the bookstore in one day?

Solution: Use probability generating functions!

Notice that whether a visitor makes a purchase is a Bernoulli trial, with probability of success 0.6. Let

𝑋𝑖 = {

1

0

if 𝑖th customer makes a purchase

otherwise

so that we are looking to find the distribution of 𝑌 = 𝑋1 + … + 𝑋𝑁 where

𝑁 ∼ Poisson(300). Notice that the PGF of the Bernoulli random variable 𝑋 is

𝐺𝑋 (𝑠) = (0.4 + 0.6𝑠), and the PGF of 𝑁 is 𝑒300(𝑠−1) .

By the result above, we know that the PGF of 𝑌 is 𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠)). Think

of this as using 𝐺𝑋 (𝑠) as the ‘in place of’ 𝑠 in the PGF of 𝑁 :

𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠))

= 𝑒300(0.4+0.6𝑠−1) = 𝑒300(0.6𝑠−0.6) = 𝑒180(𝑠−1)

We recognise this as the PGF of a Poisson random variable with mean 180.

Therefore, the distribution of those who make a purchase is Poisson(180).

42

CHAPTER 3. PRELIMINARIES

Part IV

Week 2

43

Chapter 4

Introduction to discrete

time Markov processes

4.1

The Markov property

A stochastic process {𝑋𝑡 , 𝑡 ∈ 𝑇 } is a Markov process if, for any sequence of

times 0 ≤ 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑛 < 𝑡𝑛+1 and any 𝑛 ≥ 0,
P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ) = P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 ),
for any 𝑗, 𝑖0 , … , 𝑖𝑛 in 𝑆.
4.1.1
The Markov property - an intuitive explanation.
The Markov property states that the value (or state) of the process at time 𝑡𝑛+1
depends only on its value (state) at time 𝑡𝑛 , and not on its behaviour previous
to time 𝑡𝑛 . That is, the probability of any future event depends only on the
most recent information we have about the process.
Let
• 𝑋𝑛 be the present state;
• 𝐴 be a past event (involving 𝑋0 , … , 𝑋𝑛−1 );
• 𝐵 be a future event (involving 𝑋𝑛+1 , 𝑋𝑛+2 , ….)
The Markov property (MP) states that given the present state (𝑋𝑛 ), the future
𝐵 is conditionally independent of the past 𝐴, denoted
𝐵 ⟂⟂ 𝐴|𝑋𝑛 .
Therefore,
45
46CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
P(𝐴 ∩ 𝐵|𝑋𝑛 = 𝑖) = P(𝐴|𝑋𝑛 = 𝑖) P(𝐵|𝑋𝑛 = 𝑖).
Similarly, on applying the Markov property,
P(𝐵|𝑋𝑛 = 𝑖, 𝐴) = P(𝐵|𝑋𝑛 = 𝑖),
for any 𝑛 ≥ 0, any 𝑖 and any 𝐴 and 𝐵. Notice here that we can use the Markov
property because the most recent information about the chain, its location at
time 𝑛, is given exactly as 𝑋𝑛 = 𝑖.
4.1.2
An important note about the Markov property.
Think carefully about the definition of the Markov property. It states that
P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ) = P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 ).
We can only use the Markov property if the most recent information gives us
specific information about the location of the process. We could not apply the
Markov property if, for example, we wanted to simplify
P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 or 𝑗𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ),
since the most recent information (the location of the process at time 𝑡𝑛 ) does
not tell us where exactly the Markov chain is at time 𝑡𝑛 .
4.1.3
Why is the Markov property useful?
We’ll be using the Markov property repeatedly throughout the course, so make
sure you are familiar with it!
The Markov property is a strong independence assumption. It is useful because it simplifies probability calculations. It enables joint probabilities
to be expressed as a product of simple conditional probabilities.
Example 4.1. Suppose that 𝑋 is a (discrete time) Markov process taking values
0, 1 and 2. Write
P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0)
as a product of conditional probabilities, simplifying your answer as much as
possible, justifying each step of your argument.
We first use Bayes’s theorem:
P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0) = P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2, 𝑋0 = 1)
= P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1)
4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS
47
Notice that we can simplify the first expression on the right using the Markov
property:
P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1)
=P(𝑋2 = 0|𝑋1 = 2) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1)
Therefore,
P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0) = P(𝑋2 = 0|𝑋1 = 2) P(𝑋1 = 2|𝑋0 = 1) P(𝑋0 = 1)
4.2
Introduction to discrete time Markov chains
Recall: a discrete-time Markov chain, often abbreviated to Markov chain,
is a sequence of random variables 𝑋0 , 𝑋1 , 𝑋2 , … taking values in a finite or
countable state space 𝑆 such that, for all 𝑛, 𝑖, 𝑗, 𝑖0 , 𝑖1 , … , 𝑖𝑛−1
P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, 𝑋𝑛−1 = 𝑖𝑛−1 , … , 𝑋0 = 𝑖0 ) = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖),
that is, a discrete-time stochastic process satisfying the Markov property.
4.2.1
Time homogeneity and transition probabilities
A Markov chain is time-homogeneous if
P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖)
does not depend on 𝑛 for all 𝑖, 𝑗 ∈ 𝑆.
For a time homogeneous Markov chain, we have
P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖) = P(𝑋𝑚+1 = 𝑗|𝑋𝑚 = 𝑖)
even if 𝑚 ≠ 𝑛. In effect, this allows us to forget about when the chain moved
from 𝑖 to 𝑗, all that matters is that this occurred in one time step. When a
chain is time homogeneous, we also have that for any integer 𝑟 ≥ 1,
P(𝑋𝑛+𝑟 = 𝑗|𝑋𝑛 = 𝑖) = P(𝑋𝑚+𝑟 = 𝑗|𝑋𝑚 = 𝑖).
That is, the probability of moving from state 𝑖 to state 𝑗 in 𝑟 time steps is the
same regardless of when this happened. In this course, we will assume
that all discrete time Markov processes are time homogeneous unless otherwise
stated.
The probabilities
𝑝𝑖𝑗 = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖),
for 𝑖, 𝑗, ∈ 𝑆,
are called the (1-step) transition probabilities of the Markov chain. These
are very important and will help us to solve many interesting questions about
the behaviour of Markov chains. Notice that 𝑝𝑖𝑗 does not depend on 𝑛 as we
are assuming that {𝑋𝑛 ; 𝑛 = 0, 1, 2, ...} is time homogeneous.
48CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
4.2.2
Transition matrices
The probabilities 𝑝𝑖𝑗 are often collected into a matrix 𝑃 called the transition
matrix. The transition probability 𝑝𝑖𝑗 is the element of 𝑃 in the 𝑖th row and
the 𝑗th column, that is, the (𝑖, 𝑗)th element of 𝑃 .
𝑝11
⎛
𝑝
⎜
⎜ 21
𝑝31
𝑃 =⎜
⎜
⎜
⎜ ⋮
⎝ 𝑝𝑛1
𝑝12
𝑝22
𝑝32
⋮
𝑝𝑛2
𝑝13
𝑝23
𝑝33
⋮
𝑝𝑛3
... 𝑝1𝑛
... 𝑝2𝑛
... 𝑝3𝑛
⋱
⋮
... 𝑝𝑛𝑛
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Important properties of a transition matrix include:
• All entries are non-negative (and, because they are probabilities, are ≤ 1).
• Each of the rows sum to 1, that is, ∑𝑗 𝑝𝑖𝑗 = 1, for all 𝑗 ∈ 𝑆.
Why must each row sum to 1? Why needn’t each column sum to 1?
4.2.3
The n-step transition probabilities
The 𝑛-step transition probabilities are
(𝑛)
𝑝𝑖𝑗 = P(𝑋𝑛 = 𝑗|𝑋0 = 𝑖)
= P(𝑋𝑚+𝑛 = 𝑗|𝑋𝑚 = 𝑖)
(𝑛)
for all 𝑚 if {𝑋𝑛 ; 𝑛 = 0, 1, 2, ...} is time-homogeneous. So 𝑝𝑖𝑗 is the probability
that a process in state 𝑖 will be in state 𝑗 after 𝑛 ‘steps’. Note that due to time
homogeneity, this does not depend on 𝑚.
(𝑛)
The 𝑛-step transition matrix 𝑃 (𝑛) has 𝑝𝑖𝑗 as its (𝑖, 𝑗)th element.
(𝑛)
𝑃
(𝑛)
⎛
⎜
⎜
⎜
=⎜
⎜
⎜
⎜
⎝
4.2.4
𝑝11
(𝑛)
𝑝21
(𝑛)
𝑝31
⋮
(𝑛)
𝑝𝑛1
(𝑛)
𝑝12
(𝑛)
𝑝22
(𝑛)
𝑝32
⋮
(𝑛)
𝑝𝑛2
(𝑛)
𝑝13
(𝑛)
𝑝23
(𝑛)
𝑝33
⋮
(𝑛)
𝑝𝑛3
(𝑛)
... 𝑝1𝑛
(𝑛)
... 𝑝2𝑛
(𝑛)
... 𝑝3𝑛
⋱
⋮
(𝑛)
... 𝑝𝑛𝑛
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
The Chapman-Kolmogorov equations
(𝑛+𝑚)
The Chapman-Kolmogorov (C-K) equations show us how to compute 𝑝𝑖𝑗
or
𝑃 (𝑛+𝑚) for 𝑚 ≥ 0, 𝑛 ≥ 0. These are useful equations which will enable us to
calculate these quantities quickly and easily.
4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS
49
Consider time points 0, 𝑚 and 𝑚 + 𝑛, where 𝑚 ≥ 0, 𝑛 ≥ 0 and suppose that we
want to compute the probability that we go from state 𝑖 to state 𝑗 in (𝑛 + 𝑚)
steps,
(𝑛+𝑚)
𝑝𝑖𝑗
= 𝑃 (𝑋𝑛+𝑚 = 𝑗|𝑋0 = 𝑖).
The C-K equations are derived using the simple observation that at time 𝑚, we
must be in some state (let’s call it 𝑘). By summing over all possibilities for 𝑘,
we derive the C-K equations.
Figure 4.1: Schematic of the CK equations
(𝑚+𝑛)
By conditioning on the state 𝑘 at time 𝑚 we derive 𝑝𝑖𝑗
(𝑛+𝑚)
𝑝𝑖𝑗
:
= P(𝑋𝑚+𝑛 = 𝑗|𝑋0 = 𝑖)
(4.1)
= ∑ P(𝑋𝑚+𝑛 = 𝑗, 𝑋𝑚 = 𝑘|𝑋0 = 𝑖)
(4.2)
𝑘∈𝑆
= ∑ P(𝑋𝑚+𝑛 = 𝑗|𝑋𝑚 = 𝑘, 𝑋0 = 𝑖) P(𝑋𝑚 = 𝑘|𝑋0 = 𝑖)
(4.3)
𝑘∈𝑆
(𝑚)
(𝑛)
= ∑ 𝑝𝑖𝑘 𝑝𝑘𝑗
(4.4)
𝑘∈𝑆
Where we use a version of Bayes’s theorem to find (4.3), and the Markov property and time homogeneity to reach equation (4.4).
The Chapman-Kolmogorov equations are
(𝑛+𝑚)
𝑝𝑖𝑗
(𝑚)
(𝑛)
= ∑ 𝑝𝑖𝑘 𝑝𝑘𝑗
𝑘∈𝑆
Note ‘equations’ not ‘equation’ - there is one equation for each (𝑖, 𝑗) pair.
Since:
(𝑚)
• {𝑝𝑖𝑘 , 𝑘 ∈ 𝑆} is the 𝑖th row of 𝑃 (𝑚)
(𝑛)
• {𝑝𝑘𝑗 , 𝑘 ∈ 𝑆} is the 𝑗th column of 𝑃 (𝑛)
then the Chapman-Kolmogorov equations can be written in matrix form as
𝑃 (𝑚+𝑛) = 𝑃 (𝑚) ⋅ 𝑃 (𝑛)
where ⋅ denotes matrix multiplication. Note that this single equation encodes
all the individual C-K equations above.
This may not seem particularly helpful, but note the following:
50CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
• 𝑃 (1) is the transition matrix 𝑃 .
• 𝑃 (0) is the identity matrix 𝐼.
Using this, note the following:
𝑃 (2) = 𝑃 (1+1) = 𝑃 ⋅ 𝑃 = 𝑃 2 .
𝑃 (3) = 𝑃 (1+2) = 𝑃 ⋅ 𝑃 (2) = 𝑃 3 .
Notice also that the last line above could have been equivalently written/computed as
𝑃 (3) = 𝑃 (2+1) = 𝑃 (2) ⋅ 𝑃 = 𝑃 3
So what does this tell us? By using the C-K equations repeatedly, we find that:
𝑃 (𝑛) = 𝑃 𝑛
That is, the 𝑛-step transition matrix 𝑃 (𝑛) is equal to the 𝑛th matrix power of
𝑃.
This is enormously helpful - given a Markov chain with (one-step) transition
matrix 𝑃 , we can now compute its n-step transition matrix by simply multiplying 𝑃 with itself 𝑛 times.
4.2.5
Marginal probabilities
So far, we have only considered conditional probabilities. However, we may
also be interested in marginal probabilities such as P(𝑋𝑛 = 𝑗).
(𝑛)
Let 𝑝𝑗
= P(𝑋𝑛 = 𝑗). If, for example, 𝑆 = N0 = {0, 1, 2, …}, then
(𝑛)
(𝑛)
𝑝(𝑛) = (𝑝0 , 𝑝1 , …) = (P(𝑋𝑛 = 0), P(𝑋𝑛 = 1), …)
is a probability row vector (i.e. a row vector with non-negative entries summing to 1) specifying the distribution of 𝑋𝑛 .
(𝑛)
(𝑛)
Do not confuse 𝑝𝑖𝑗 and 𝑝𝑗 . The former is a conditional probability, and the
latter a marginal probability.
When 𝑛 = 0, and the state space is 𝑆 = {0, 1, 2, ...}, then
(0)
(0)
(0)
𝑝(0) = (𝑝0 , 𝑝1 , 𝑝2 , …) = (P(𝑋0 = 0), P(𝑋0 = 1), P(𝑋0 = 2), …)
is the distribution of 𝑋0 . This is the initial distribution: the distribution of
states in which we start the Markov chain. This tells us the probability of the
chain starting in any particular state.
4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS
51
(𝑛)
If we look at 𝑝𝑗 , the probability that the chain is in state 𝑗 at time 𝑛, then
(𝑛)
𝑝𝑗
(0) (𝑛)
= P(𝑋𝑛 = 𝑗) = ∑ P(𝑋𝑛 = 𝑗|𝑋0 = 𝑖) P(𝑋0 = 𝑖) = ∑ 𝑝𝑖 𝑝𝑖𝑗 .
𝑖
𝑖
Equivalently, in matrix notation,
𝑝(𝑛) = 𝑝(0) ⋅ 𝑃 (𝑛) = 𝑝(0) ⋅ 𝑃 𝑛 .
Thus, the initial distribution, 𝑝(0) , and the transition matrix, 𝑃 , together
contain all the probabilistic information about the Markov chain. That
is, this is all we need in order to compute or derive any quantity of interest
related to the Markov chain.
For example,
(0)
P(𝑋0 = 𝑖0 , 𝑋1 = 𝑖1 , … , 𝑋𝑛 = 𝑖𝑛 ) = 𝑝𝑖0 𝑝𝑖0 𝑖1 𝑝𝑖1 𝑖2 ⋯ 𝑝𝑖𝑛−1 𝑖𝑛
is a useful identity to compute a joint probability using only the (one-step)
transition probabilities and initial distribution.
4.2.6
Duration of stay in a state
Remember that a Markov chain is memoryless in the sense that its future
is independent of its past, conditional on its current state (or the most recent
information that we have about its location). Therefore, if the chain is currently
in state 𝑖, the further amount of time that it will remain there does not depend
on how long it has already been in state 𝑖.
For any Markov chain the duration of stay in state 𝑖 satisfies
P(stay in state 𝑖 for 𝑛 steps | chain is currently in state 𝑖)
=P(stay in 𝑖 for a further 𝑛 − 1 steps, then leave 𝑖)
𝑛−1
=𝑝𝑖𝑖
(1 − 𝑝𝑖𝑖 ).
Therefore, the duration of stay in state 𝑖 has a geometric distribution with
parameter (1 − 𝑝𝑖𝑖 ), from which we can deduce that the expected amount of
1
.
time that the process remains in state 𝑖 is 1−𝑝
𝑖𝑖
Where will the Markov chain go when it leaves 𝑖? (Note: we are conditioning
on leaving state 𝑖, and so this isn’t 𝑝𝑖𝑗 unless 𝑝𝑖𝑖 = 0. If we do have 𝑝𝑖𝑖 = 0
then by definition the process will leave state 𝑖 at the next step!)
52CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
For 𝑗 ≠ 𝑖,
P( MC moves to 𝑗| MC leaves state 𝑖) = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, 𝑋𝑛+1 ≠ 𝑖)
(4.5)
=
P(𝑋𝑛+1 = 𝑗, 𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖)
P(𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖)
(4.6)
P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖)
P(𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖)
𝑝𝑖𝑗
=
.
1 − 𝑝𝑖𝑖
=
(4.7)
(4.8)
Identity (4.6) comes from using Bayes’s theorem, while identity (4.7) just applies
common sense (if we have that 𝑋𝑛+1 = 𝑗 then 𝑋𝑛+1 ≠ 𝑖 by default!). It’s important to realise that this answer makes sense: the probability is proportional
to 𝑝𝑖𝑗 and is scaled so that
𝑝𝑖𝑗
= 1.
∑
1
−
𝑝𝑖𝑖
𝑗≠𝑖
To summarise:
• The duration of stay of the Markov chain in state 𝑖 has a geometric
1
distribution with expected value 1−𝑝
steps.
𝑖𝑖
• When the chain moves from state 𝑖, it goes to state 𝑗 (≠ 𝑖) with proba𝑝
bility 1−𝑝𝑖𝑗 .
𝑖𝑖
4.3 The idea of ‘first step decomposition’
The ‘first step decomposition’ (FSD) technique is a very useful method for computing certain expectations and probabilities associated with Markov chains, for
example:
1. What’s the probability that the process reaches a particular absorbing
state?
2. What’s the probability that the process visits state 1 exactly once before
visiting state 3?
3. How long, on average, does it take the process to be absorbed?
4. How many visits to state 1 does the process make, on average, before
reaching state 5?
5. How long does it take, on average, to reach a particular state for the first
time?
6. What’s the probability that we ever reach/ never reach a particular point?
4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’
53
We will be using this method extensively throughout the course.
The idea is to use the following results, which you have already met in Section
3.3.4 to simplify calculations drastically.
1. Using the law of conditional (iterated) expectation to find expectations of
interest:
𝐸[𝑋] = 𝐸 [ 𝐸[𝑋|𝑌 ] ]
2. Using the (modification of the) law of total probability to find probabilities
of interest:
𝑃 (𝐴) = ∑ 𝑃 (𝐴|𝑌 = 𝑦)𝑃 (𝑌 = 𝑦)
𝑦
𝑃 (𝐴|𝑍 = 𝑧) = ∑ 𝑃 (𝐴|𝑌 = 𝑦, 𝑍 = 𝑧)𝑃 (𝑌 = 𝑦|𝑍 = 𝑧)
𝑦
In this course, we are mainly interested in applying this technique by conditioning on ‘where does our Markov process go next?’. That is, in the above, we
choose 𝑌 to be the position of the Markov process at a particular time. This
turns out to be incredibly powerful.
This week, we’ll concentrate just on computation of probabilities concerning
Markov processes using first step decomposition. Later in the course we’ll use
this idea to compute expectations.
4.3.1
FSD and probabilities
Consider the Markov chain that is defined by the following:
Figure 4.2: State-space diagram defining the Markov process
54CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
Example 4.2. The stats-space here is 𝑆 = {1, 2, 3, 4, 5, 6} and suppose that
𝑋0 = 2 so that the initial distribution is
𝑝(0) = (0, 1, 0, 0, 0, 0).
Notice that if the process eventually reaches state 1 then it will stay there forever.
However, the process may never reach state 1 (e.g. if it reaches state 6 then it
can never visit state 1). What is the probability that the chain is eventually
reach state state 1?
Solution
A handy way of solving this type of problem is to use first step decomposition.
The idea is to ask ‘where does the process go next’ and condition on that. This
is just the total law of probability, where we decide to condition on the next
step.
In our example, 𝑋0 = 2. From there, where can the process go? In other
words, what value(s) can 𝑋1 take? We can see from the state-space diagram
that 𝑋1 will be either 1, 2, 4 or 5, and each of these possibilities happens with
equal probability.
Using the total law of probability:
𝑃 (reach state 1|𝑋0 = 2) = ∑ 𝑃 (reach state 1|𝑋1 = 𝑖, 𝑋0 = 2)𝑃 (𝑋1 = 𝑖|𝑋0 = 2)
𝑖∈𝑆
Now, of course, we can simplify the right hand side by using the Markov property
and also use our notation for transition probabilities:
𝑃 (reach state 1|𝑋0 = 2) = ∑ 𝑃 (reach state 1|𝑋1 = 𝑖)𝑝2𝑖
𝑖∈𝑆
We also know the possibilities for 𝑋1 , so this is
𝑃 (reach state 1|𝑋1 = 1)𝑝21 + 𝑃 (reach state 1|𝑋1 = 2)𝑝22
(4.9)
+𝑃 (reach state 1|𝑋1 = 4)𝑝24 + 𝑃 (reach state 1|𝑋1 = 5)𝑝25
(4.10)
and substituing the relevant transition probabilities in we get that this is just:
1
𝑃 (reach state 1|𝑋1 = 1) +
4
1
+ 𝑃 (reach state 1|𝑋1 = 4) +
4
1
𝑃 (reach state 1|𝑋1 = 2)
4
1
𝑃 (reach state 1|𝑋1 = 5)
4
(4.11)
(4.12)
We can simplify this a little further: we know that 𝑃 (reach state 1|𝑋1 = 1) = 1
of course! So we have
4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’
1 1
+ 𝑃 (reach state 1|𝑋1 = 2)
4 4
1
1
+ 𝑃 (reach state 1|𝑋1 = 4) + 𝑃 (reach state 1|𝑋1 = 5)
4
4
55
(4.13)
(4.14)
But how does this help us? Notice the following: it doesn’t really matter
whether we write 𝑋1 or 𝑋0 in the conditioning in (4.14). All this is saying is
“now that we’re in state 𝑖, what’s the probability of getting to state 1”. In other
words,
𝑃 (reach state 1|𝑋1 = 𝑖) = 𝑃 (reach state 1|𝑋0 = 𝑖)
This is a handy observation because we can now use some notation that cover
all these situations. Let
𝑃𝑖 = 𝑃 (reach state 1|𝑋0 = 𝑖)
Now let’s apply this to the question we’re trying to solve (which is to find 𝑃2 ).
𝑃 (reach state 1|𝑋0 = 2) ≡ 𝑃2 =
1 1
1
1
+ 𝑃2 + 𝑃4 + 𝑃5
4 4
4
4
(4.15)
Let’s take it a step further. If we can write 𝑃4 and 𝑃5 in terms of 𝑃2 then we
can solve for 𝑃2 . This is the purpose of FSD for probabilities: in our case we
try to compute 𝑃2 so we use FSD to write an expression for 𝑃2 wholly in terms
of 𝑃2 which then allows us to solve for 𝑃2 … magic! So let’s try it out. Because
from state 4 we must return to state 2, we have:
𝑃4 = 𝑃 2
What about 𝑃5 ? You may want to write it out in full but the end result is the
following:
𝑃5 =
1
1
1
1
𝑃2 + 𝑃3 + 𝑃5 + 𝑃6
4
4
4
4
We can simplify this considerably: 𝑃6 = 0 (why?) and 𝑃3 = 𝑃5 (why?). So
𝑃5 =
1
1
𝑃 + 𝑃
4 2 2 5
Therefore,
𝑃5 =
Substituting all this in (4.15) we have:
1
𝑃
2 2
56CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
𝑃2 =
1 1
1
11
+ 𝑃2 + 𝑃2 +
𝑃
4 4
4
42 2
We can now solve to find 𝑃2 = 2/3.
Of course, the above seems like a very long-winded calculation, but once you
get the hang of it it really is very quick and painless. I could have written the
entire calculation as follows.
Example 4.3. Let
𝑃𝑖 = 𝑃 (reach state 1|𝑋0 = 𝑖)
We want to compute 𝑃2 , and we can see from the state-space diagram that
𝑃1 = 1, 𝑃4 = 𝑃2 , 𝑃3 = 𝑃5 , 𝑃6 = 0. Therefore,
1
1
1
1
𝑃2 = 𝑃1 + 𝑃2 + 𝑃4 + 𝑃5
4
4
4
4
1 1
1
1
= + 𝑃2 + 𝑃2 + 𝑃5
4 4
4
4
(4.16)
(4.17)
In addition,
1
𝑃5 = 𝑃2 +
4
1
= 𝑃2 +
4
1
𝑃 +
4 3
1
𝑃 +
4 5
1
1
𝑃 + 𝑃
4 5 4 6
1
𝑃
4 5
which implies that 𝑃5 = 12 𝑃2 . Substituting this into (4.17) we have that
𝑃2 =
1 1
1
1
+ 𝑃 + 𝑃 + 𝑃
4 4 2 4 2 8 2
Solving yields 𝑃2 = 2/3.
We can play around with this idea quite bit. In example 4.2, we were told where
the process starts. What if, instead, we are given an initial distribution?
4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’
57
Example 4.4. Suppose that 𝑝(0) = (0, 1/6, 0, 1/3, 1/2, 0). Find the probability
that the chain is eventually reach state state 1.
Using our notation from earlier, we have already established that:
𝑃2 ∶=𝑃 (reach state 1|𝑋0 = 2) = 2/3
𝑃4 ∶=𝑃 (reach state 1|𝑋0 = 4) = 2/3
𝑃5 ∶=𝑃 (reach state 1|𝑋0 = 5) = 1/3
(because we established that 𝑃4 = 𝑃2 and 𝑃5 = 1/2𝑃2 in Example 4.2). Let 𝑃𝐶
denote the probability of reaching state 1. Then
6
6
𝑃𝐶 = ∑ 𝑃 (reach state 1|𝑋0 = 𝑖)𝑃 (𝑋0 = 𝑖) ≡ ∑ 𝑃𝑖 ⋅ 𝑃 (𝑋0 = 𝑖)
𝑖=1
𝑖=1
Substituting in the probabilities from the initial distribution, and from previous
example, we have:
1
1
1
𝑃𝐶 = 𝑃2 + 𝑃4 + 𝑃5
6
3
2
and so 𝑃𝐶 = 1/2.
58CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES
Part V
Week 3
59
Chapter 5
Classifying discrete time
Markov processes
Suppose we wish to study a particular Markov chain, summarised by its transition matrix 𝑃 and an initial distribution 𝑝(0) . We may be interested in the
long-run probabilistic behaviour of the chain or in the probability that
some future event of interest happens.
Even if nobody has studied this particular Markov chain before, we are able to
use some general results on the behaviour of Markov chains to help us. In order
to use this theory we need to classify each state of the chain as being one
of a number types, and hence classify the type of Markov chain that we
have.
The first step in classifying Markov chains is to split the state space into nonoverlapping groups (classes) of states. We will see later that states in the same
class are of the same type. This simplifies the problem of classifying all the
states in the chain because we only need to classify one state in each class.
5.1
Irreducible classes of intercommunicating
states
State 𝑖 communicates with state 𝑗 (we write 𝑖 → 𝑗) if
(𝑛)
𝑝𝑖𝑗 > 0

for some 𝑛 ≥ 0,

that is, starting from state 𝑖, it is possible that the chain will eventually enter

state 𝑗.

States 𝑖 and 𝑗 intercommunicate (we write 𝑖 ↔ 𝑗) if 𝑖 → 𝑗 and 𝑗 → 𝑖.

Properties of ↔

61

62CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES

• 𝑖 ↔ 𝑖 for all states 𝑖.

• if 𝑖 ↔ 𝑗 then 𝑗 ↔ 𝑖.

• if 𝑖 ↔ 𝑗 and 𝑗 ↔ 𝑘 then 𝑖 ↔ 𝑘.

(i)+(ii)+(iii) mean that ↔ is an equivalence relation on 𝑆.

We can partition the state space 𝑆 into non-overlapping classes of intercommunicating states. Within a particular class, every pair of states intercommunicate;

states in different classes do not intercommunicate.

Each class is called an irreducible class of states.

If there is only one irreducible class, then the Markov chain is said to be irreducible.

Example 5.1. For the Markov chain with state space 𝑆 = {0, 1, 2} and transition matrix

1/4 1/4

1

⎜ 0

𝑃 =⎛

⎝ 1/2 0

1/2

0 ⎞

⎟,

1/2 ⎠

the state space can be partitioned in to the irreducible classes {0, 2} and {1}. In

particular, the Markov chain is not irreducible because there are two irreducible

classes. It might be helpful to see this by drawing a state-space diagram.

Figure 5.1: State-space diagram, which may help in deciding the irreducible

classes

5.2. RECURRENCE AND TRANSIENCE

5.2

63

Recurrence and transience

Let {𝑋𝑛 } be a Markov chain with state space 𝑆 and let 𝑖 ∈ 𝑆. Let

𝑓𝑖 = P(ever return to 𝑖|𝑋0 = 𝑖) = P(𝑋𝑛 = 𝑖 for some 𝑛 ≥ 1|𝑋0 = 𝑖).

If 𝑓𝑖 = 1 then 𝑖 is recurrent (eventual return to state 𝑖 is certain).

If 𝑓𝑖 < 1 then 𝑖 is transient (eventual return to state 𝑖 is uncertain).
Studying recurrence and transience enables us to ask further interesting questions of our Markov chain, for example, how many times will a Markov
chain hit state 𝑖, given it starts in state 𝑖?
5.2.1
Transience
Suppose that 𝑖 is transient. Let 𝑁 be the number of hits on 𝑖 (including the
hit from the fact that 𝑋0 = 𝑖). Then
P(𝑁 = 𝑛|𝑋0 = 𝑖) = P(return 𝑛 − 1 times to 𝑖, then never return|𝑋0 = 𝑖)
= 𝑓𝑖𝑛−1 (1 − 𝑓𝑖 ),
So, given 𝑋0 = 𝑖, 𝑁 has a geometric distribution with parameter 1 − 𝑓𝑖 . So 𝑁
is finite with probability 1 and
E(𝑁 |𝑋0 = 𝑖) =
5.2.2
1
(< ∞).
1 − 𝑓𝑖
Recurrence
Suppose that 𝑖 is recurrent. With probability 1 the chain will eventually return
to state 𝑖.
By time homogeneity and the Markov property, the chain ‘starts afresh’ on
return to the initial state, so that state 𝑖 will eventually be visited again (with
probability 1).
Repeating the argument shows that the chain will return to 𝑖 infinitely often
(with probability 1).
We also have
E(𝑁 |𝑋0 = 𝑖) is infinite.
5.2.3
First passage times
Another idea connected with recurrence and transience is that of first passage
times. This describes the time it takes for a Markov chain to return to state 𝑖,
given that it started there. Let
𝑇𝑖𝑖 = min{𝑛 ≥ 1 ∶ 𝑋𝑛 = 𝑖|𝑋0 = 𝑖}.
64CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES
𝑇𝑖𝑖 is the first passage time from state 𝑖 to itself, that is, the number of steps
until the chain first returns to state 𝑖 given that it starts in state 𝑖.
If the chain never returns to state 𝑖 we set 𝑇𝑖𝑖 = ∞. Note that:
𝑓𝑖 = P(ever return to 𝑖|𝑋0 = 𝑖) = P(𝑇𝑖𝑖 < ∞)
1 − 𝑓𝑖 = P(never return to 𝑖|𝑋0 = 𝑖) = P(𝑇𝑖𝑖 = ∞).
Connecting this with the idea of transience and recurrence yields:
• If 𝑖 is recurrent then P(𝑇𝑖𝑖 < ∞) = 1, that is, 𝑇𝑖𝑖 is finite with probability
1.
• If 𝑖 is transient then P(𝑇𝑖𝑖 < ∞) < 1, or equivalently P(𝑇𝑖𝑖 = ∞) =
1 − P(𝑇𝑖𝑖 < ∞) > 0, that is, 𝑇𝑖𝑖 is infinite with positive probability.

(Therefore, 𝜇𝑖 = 𝐸[𝑇𝑖𝑖 ] is infinite.)

5.2.4

Positive recurrence and null recurrence

In fact, there are two types of recurrent state. Recall that the first passage time

(also called the recurrence time) 𝑇𝑖𝑖 of a state 𝑖 is the number of steps until

the chain first returns to state 𝑖 given that it starts in state 𝑖.

Let 𝜇𝑖 = E(𝑇𝑖𝑖 ) be the mean recurrence time of state 𝑖. We have already

seen that if 𝑖 is transient then 𝜇𝑖 = ∞. We have not computed 𝜇𝑖 when 𝑖 is

recurrent. In fact, the value of 𝜇𝑖 when 𝑖 is recurrent enables us to distinguish

between two types of recurrence: positive recurrence and null recurrence.

A recurrent state 𝑖 is

• positive recurrent if 𝜇𝑖 < ∞ (that is, 𝜇𝑖 is finite),
• null recurrent if 𝜇𝑖 = ∞ (that is, 𝜇𝑖 is infinite).
At first glance it may seem strange that a state 𝑖 can be null recurrent: return
to state 𝑖 is certain, but the expected time (number of steps) to return to 𝑖 is
infinite. In other words, the random variable 𝑇𝑖𝑖 is finite with probability 1, but
its mean, E(𝑇𝑖𝑖 ), is infinite. Such states do exist and we will see examples later
in the course.
5.3 Periodicity
The period of a state 𝑖 is the greatest common divisor of the set of integers
(𝑛)
𝑛 ≥ 1 such that 𝑝𝑖𝑖 > 0, that is,

(𝑛)

𝑑𝑖 = gcd {𝑛 ≥ 1 ∶ 𝑝𝑖𝑖 > 0} .

• If 𝑑𝑖 = 1 then state 𝑖 is aperiodic.

5.4. CLASS PROPERTIES

65

(𝑛)

• If 𝑝𝑖𝑖 = 0 for all 𝑛 ≥ 1 (i.e. return to state 𝑖 is impossible) then, by

convention, state 𝑖 is also said to be aperiodic.

5.4

Class properties

A class property is a property which is shared by all the states in an

irreducible class. If it can be shown that one of the states in a class has a

certain class property then all the states in the irreducible class must also have

that same property.

I give sketch proofs for some of the following results relating to class properties.

The other proofs can be found in any good textbook on Markov processes.

Theorem 5.1. A pair of intercommunicating states are either both recurrent

or both transient. That is, recurrence/ transience is a class property. It is not

possible to have a mixture of recurrent and transient states in an irreducible

class.

Proof. Look in ‘Introduction to Probability Models’ (Sheldon Ross).

Theorem 5.2. Null recurrence is a class property.

Proof. Recall that 𝜇𝑖 = 𝐸(𝑇𝑖𝑖 ) is the expected first passage time or the mean

recurrence time of state 𝑖, and that

finite

if 𝑖 is positive recurrent

𝜇𝑖 is {

infinite if 𝑖 is null recurrent or transient.

It can be shown (proof omitted) that

(𝑛)

𝑖 aperiodic ⇒ 𝑝𝑖𝑖

(𝑛𝑑)

𝑖 periodic, period 𝑑 ⇒ 𝑝𝑖𝑖

1

as 𝑛 → ∞,

𝜇𝑖

𝑑

(𝑚)

→

as 𝑛 → ∞ (and 𝑝𝑖𝑖 = 0 if 𝑚 is not a multiple of 𝑑).

𝜇𝑖

→

So, as 𝑛 → ∞

(𝑛)

→

1

𝜇𝑖

(positive, finite),

(𝑛𝑑)

→

𝑑

𝜇𝑖

(positive, finite),

• if 𝑖 is aperiodic, positive recurrent then 𝑝𝑖𝑖

• if 𝑖 is periodic, positive recurrent then 𝑝𝑖𝑖

(𝑛)

• if 𝑖 is null recurrent then 𝑝𝑖𝑖 → 0,

66CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES

∞

(𝑛)

(𝑛)

• if 𝑖 is transient then (proof omitted) ∑𝑛=0 𝑝𝑖𝑖 < ∞, and 𝑝𝑖𝑖 → 0.
(𝑛)
(𝑚)
Now let 𝑖 ↔ 𝑗, with 𝑝𝑖𝑗 > 0, 𝑝𝑗𝑖 > 0. Suppose that 𝑖 is null recurrent. Then

(𝑚+𝑘+𝑛)

𝑝𝑖𝑖

As 𝑘 → ∞

(𝑛) (𝑘) (𝑚)

≥ 𝑝𝑖𝑗 𝑝𝑗𝑗 𝑝𝑗𝑖 .

(𝑚+𝑘+𝑛)

𝑝𝑖𝑖

→ 0

(because 𝑖 is null recurrent), and so

(𝑘)

𝑝𝑗𝑗 → 0

(𝑛)

(𝑚)

(because 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0). So 𝑗 is either transient or null recurrent. But 𝑖 is

recurrent and 𝑖 ↔ 𝑗, so 𝑗 must be recurrent as well (because recurrent/transience

is a class property). Hence 𝑗 must be null recurrent.

Theorem 5.3. Positive recurrence is a class property.

Proof. Suppose that 𝑖 ↔ 𝑗 and that 𝑖 is positive recurrent. Then 𝑗 must be

recurrent (because recurrence is a class property). If 𝑗 is null recurrent then

𝑖 is also null recurrent. This is a contradiction. Hence 𝑗 must be positive

recurrent.

Theorem 5.4. Intercommunicating states have the same period. That is, periodicity is a class property.

Proof. Let 𝑖 and 𝑗 be states with 𝑖 ↔ 𝑗. Let 𝑑𝑖 be the period of 𝑖 and 𝑑𝑗 be the

(𝑛)

(𝑚)

period of 𝑗. Let 𝑛 and 𝑚 be such that 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0. Then

(𝑛+𝑚)

𝑝𝑗𝑗

(𝑚) (𝑛)

≥ 𝑝𝑗𝑖 𝑝𝑖𝑗 > 0,

because there is at least one way of returning to 𝑗 in 𝑛 + 𝑚 steps: 𝑗 to 𝑖 in 𝑚

steps and then 𝑖 to 𝑗 in a further 𝑛 steps. Therefore,

𝑑𝑗 |𝑛 + 𝑚,

(𝑑𝑗 is a divisor of 𝑛 + 𝑚).

(𝑘)

Let 𝑘 be such that 𝑝𝑖𝑖 > 0. Then

(𝑛+𝑘+𝑚)

𝑝𝑗𝑗

(𝑚) (𝑘) (𝑛)

≥ 𝑝𝑗𝑖 𝑝𝑖𝑖 𝑝𝑖𝑗 > 0,

5.5. FURTHER PROPERTIES OF DISCRETE TIME MARKOV CHAINS67

so

𝑑𝑗 |𝑛 + 𝑘 + 𝑚,

(𝑑𝑗 is a divisor of 𝑛 + 𝑘 + 𝑚).

together imply that

𝑑𝑗 |𝑘,

(𝑑𝑗 is a divisor of 𝑘).

(𝑘)

That is, 𝑑𝑗 is a factor of any {𝑘 ≥ 1 ∶ 𝑝𝑖𝑖 > 0}.

However, by definition 𝑑𝑖 is the greatest common divisor of such 𝑘. Therefore,

𝑑𝑗 ≤ 𝑑𝑖 . Reversing the roles of 𝑖 and 𝑗 gives 𝑑𝑖 ≤ 𝑑𝑗 . Hence 𝑑𝑖 = 𝑑𝑗 .

Summary

The following are class properties:

• Null recurrence;

• Positive recurrence;

• Transience;

• Periodicity.

5.5

Further properties of discrete time Markov

chains

Theorem 5.5. A finite irreducible Markov chain cannot be null recurrent.

(𝑛)

Proof. Recall that 𝑝𝑗

= P(𝑋𝑛 = 𝑗). Suppose that the chain is null recurrent.

(𝑛)

It can be shown that 𝑝𝑗

→ 0 as 𝑛 → ∞ for all 𝑗 ∈ 𝑆.

We know that

(𝑛)

∑ 𝑝𝑗

= 1 for all 𝑛,

𝑗∈𝑆

so

(𝑛)

lim ∑ 𝑝𝑗

𝑛→∞

= 1.

𝑗∈𝑆

But 𝑆 is finite so if the chain is null recurrent

(𝑛)

lim ∑ 𝑝𝑗

𝑛→∞

𝑗∈𝑆

(𝑛)

= ∑ lim 𝑝𝑗

𝑗∈𝑆

𝑛→∞

= 0.

This is a contradiction, so the Markov chain cannot be null recurrent.

68CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES

Theorem 5.6. A finite Markov chain cannot contain any null recurrent states.

Proof. Suppose that the chain contains a state 𝑖 which is null recurrent. Then,

since null recurrence is a class property, state 𝑖 is in some irreducible finite closed

class of null recurrent states. This is not possible (see above).

Theorem 5.7. It is not possible for all states in a finite state space Markov

chain to be transient.

Note that this implies that all states in a finite irreducible Markov chain are

recurrent (because recurrence is a class property). That is, finite and irreducible

⇒ recurrent Markov chain.

Proof. Suppose all states are transient and 𝑆 = {0, 1, … , 𝑀 }. A transient state

will be visited only a finite number of times. Therefore for each state 𝑖 there

exists a time 𝑇𝑖 after which 𝑖 will never be visited again. Therefore, after time

𝑇 = max{𝑇0 , … , 𝑇𝑀 } no state will be visited again. BUT the Markov chain

must be in some state. Therefore we have a contradiction. Therefore, at least

one state must be recurrent.

5.6 Closed classes and absorption

A class 𝐶 of intercommunicating states is closed if 𝑝𝑖𝑗 = 0 for all 𝑖 ∈ 𝐶 and 𝑗 ∉

𝐶, that is, once the chain gets into 𝐶, it never leaves 𝐶.

If a single state forms a closed class then it is called an absorbing state.

Note the following important observations:

• A finite closed class must be recurrent: once entered, the class behaves

like a finite irreducible Markov chain.

• An absorbing state is positive recurrent: its mean recurrence time is 1.

Theorem 5.8. An irreducible class 𝐶 of recurrent states is closed.

Proof. Suppose that 𝐶 is not closed. Then there exist 𝑖 ∈ 𝐶 and 𝑗 ∉ 𝐶 such

that 𝑖 → 𝑗. But 𝑗 does not communicate with 𝑖, otherwise 𝑖 and 𝑗 would

intercommunicate and 𝑗 would be in 𝐶. Thus, there is positive probability that

the chain leaves 𝑖 and never returns to 𝑖. This means that 𝑖 is transient. This

is a contradiction, as 𝑖 ∈ 𝐶 and 𝐶 is recurrent.

5.6. CLOSED CLASSES AND ABSORPTION

5.6.1

69

Decomposition of the states of a Markov chain

Putting the previous results together shows that the state space 𝑆 of a Markov

chain can be decomposed as

𝑆 = 𝑇 ∪ 𝐶𝑚1 ∪ 𝐶𝑚2 ∪ ⋯ ,

where

• 𝑇 is a set of transient states;

• 𝐶𝑚1 , 𝐶𝑚2 , … are irreducible closed classes of recurrent states.

Each 𝐶𝑚𝑖 is either null recurrent or positive recurrent, and all states in a particular 𝐶𝑚𝑖 have the same period (different 𝐶𝑚𝑖 ’s can have different periods).

The following observations apply:

• If 𝑋0 ∈ 𝐶𝑚𝑖 then the chain stays in 𝐶𝑚𝑖 forever.

• If 𝑋0 ∈ 𝑇 then either:

– the chain always stays in 𝑇 (only possible if the set 𝑇 of states is

infinite); or

– the chain is eventually absorbed in one of the 𝐶𝑚𝑖 ’s (where it stays

forever).

• If 𝑆 is finite then it is impossible for the chain to remain forever in the

(finite) set 𝑇 of transient states.

• If 𝑆 is finite then at least one state must be visited infinitely often and so

there must be at least one recurrent state.

We also know that if 𝑆 is finite then there are no null recurrent states. The

consequences this are:

• If a Markov chain has a finite state space then there must be at least one

positive recurrent state.

• If a Markov chain has a finite state space and is irreducible, then it must

be positive recurrent.

• A finite closed irreducible class must be positive recurrent.

70CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES

Part VI

Week 4

71

Chapter 6

Long-run behaviour of

discrete time Markov

processes

This is the crux of the first half of the course: understanding a Markov chain’s

behaviour in the long run (limiting behaviour).

Our goal here is to find whether the distribution of states, 𝑝(𝑛) , settles down

(i.e. converges) as 𝑛 → ∞. If it does, what is this limit?

We now have enough tools at our disposal. It will turn out that to fully determine its long term behaviour we must classify the Markov chain in terms

of:

• Irreducible classes;

• Transience and positive/null recurrence;

• Periodicity.

Remember: what we’re interested in is

lim 𝑝(𝑛)

𝑛→∞

For example, if 𝑆 = {1, 2, 3, 4}, then we’re interested in how the vector

𝑝(𝑛) = (𝑃 (𝑋𝑛 = 1), 𝑃 (𝑋𝑛 = 2), 𝑃 (𝑋𝑛 = 3), 𝑃 (𝑋𝑛 = 4))

changes as we let 𝑛 → ∞. Does this vector settle down as 𝑛 → ∞ (does it

have a limit)? If is does then this is good news: we can use this information to

predict the future behaviour of the chain because we know what 𝑃 (𝑋𝑛 = 𝑖) is

for large 𝑛. This allows us answer questions such as “what’s the probability that

73

74CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES

the chain is in state 2 at some large time”. While this might not sound all that

interesting, it opens the door to computing other questions about long-term

behaviour.

6.1 Invariant distributions

We will start with a somewhat simpler question than that of limiting behaviour:

are there particular initial distributions for a Markov chain, such that when the

Markov chain starts in any of these distributions, we can determine the limit of

𝑝(𝑛) as 𝑛 → ∞?

6.1.1

Definition

The probability row vector 𝜋 is an invariant distribution of a discrete-time

process {𝑋𝑡 , 𝑡 ∈ 𝑇 } if

𝜋 = 𝜋 𝑃.

If 𝜋 is an invariant distribution, and the initial distribution of the Markov chain

is 𝜋, why does this imply that we can study its long term behaviour?

The idea is that if the chain starts in 𝜋 (or equivalently, we find at some later

time that the distribution of states is 𝜋), then this tells us that the distribution

of states is 𝜋 forever: we’ve uncovered the long term behaviour of the chain in

this specific case.

This works since, suppose that an invariant distribution 𝜋 exists and that 𝑝(0) =

𝜋, then

𝑝(1) = 𝑝(0) 𝑃 = 𝜋 𝑃 = 𝜋;

𝑝(2) = 𝑝(1) 𝑃 = 𝜋 𝑃 = 𝜋;

𝑝(3) = 𝑝(2) 𝑃 = 𝜋 𝑃 = 𝜋;

In fact 𝑝(𝑛) = 𝜋,

as required!

6.1.2

etc.

for 𝑛 = 1, 2, … and we have found the limit of 𝑝(𝑛) as 𝑛 → ∞,

Consequences

If our initial distribution is 𝜋, with 𝜋 = 𝜋𝑃 , then the distribution of the states

at any given future time 𝑛 is also 𝜋. This is why it is called an invariant

distribution. An invariant distribution always exists if the state space 𝑆 is

finite, but it may not be unique. If 𝑆 is infinite then an invariant distribution

may not exist.

Assuming that the transition matrix 𝑃 is a 𝑘 × 𝑘 matrix, and the state space is

𝑆 = {1, 2, …, 𝑘}, solving 𝜋 = 𝜋𝑃 is a matter of solving a set of 𝑘 simultaneous

equations, each of the form

𝜋𝑖 = 𝜋1 𝑝1𝑖 + 𝜋2 𝑝2𝑖 + … + 𝜋𝑘 𝑝𝑘𝑖 .

6.1. INVARIANT DISTRIBUTIONS

75

If you try this you will find that one equation is redundant (try it!): we are

therefore trying to solve for 𝑘 unknowns (𝜋1 , …, 𝜋𝑘 ) using (𝑘 − 1) equations.

This is clearly impossible, but we have the advantage that since 𝜋 is itself a

probability distribution, we must have 𝜋1 +…+ 𝜋𝑘 = 1. This added requirement

means that we can now solve for 𝜋1 , …, 𝜋𝑘 .

6.1.3

How many invariant distributions?

How many invariant distributions can a discrete time Markov process have?

To find an invariant distribution, if one exists, we solve a set of simultaneous

linear equations. Given such a set there are either:

• No solutions – no invariant distribution exists;

• Exactly one solution – exactly one invariant distribution exists;

• An infinite number of solutions – there are an infinite number of invariant

distributions.

If 𝑆 is irreducible then:

An invariant exists ⟺ all states are positive recurrent.

If the invariant distribution exists under these conditions, it must be unique.

Notes:

1. If 𝑆 is irreducible, then all states are of the same type.

2. If 𝑆 is finite (and irreducible), then all states must be positive recurrent,

so an invariant must exist.

3. If 𝑆 is infinite (and irreducible), then all states must be of the same type,

but they aren’t necessarily positive recurrent.

If 𝑆 is NOT irreducible then:

• Finite S:

Chain is guaranteed to enter a positive recurrent class eventually, so invariant exists.

• Infinite S:

– If there exists a positive recurrent class, then there must be an invariant distribution.

– If there does not exist a positive recurrent class, then there cannot

be an invariant distribution

In any case, if there are two or more positive recurrent classes, then an infinite

number of invariant distributions exist.

76CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES

6.2 Equilibrium distributions

Simply looking at invariant distributions in order to determine a Markov

chain’s long term behaviour is very restrictive. We want a far more general

idea of long term behaviour. Can we determine whether 𝑝(𝑛) converges as

𝑛 → ∞, regardless of the initial distribution? If this limit exists, what is it?

Example 6.1. Recall the weather example (Exercise 3.2), with state space

𝑆 = { 𝑅, 𝐹 } and transition matrix 𝑃 ,

0.6 0.4

).

0.5 0.5

𝑃 =(

We examine how 𝑃 (𝑛) = 𝑃 𝑛 , the 𝑛-step transition matrix changes as 𝑛 increases.

𝑃2 = (

0.56 0.44

),

0.55 0.45

𝑃3 = (

0.556

0.555

0.444

),

0.445

𝑃 6 = 𝑃 3 ⋅𝑃 3 = (

0.555556 0.444444

).

0.555555 0.444445

Looking at different initial distributions for this Markov chain and what happens

in the ‘long run’ from these starting distributions:

Initial distribution

Distribution of 𝑋1

(0,1)

(1,0)

(0.7,0.3)

(0.5,0.5)

(0.5,0.5)

(0.6,0.4)

(0.57,0.43)

(0.55,0.45)

Distribution of 𝑋2

(0.55,

(0.56,

(0.557,

(0.555,

0.45)

0.44)

0.443)

0.445)

…

Distribution of 𝑋6

…

…

…

…

(0.55556,

(0.55556,

(0.55556,

(0.55556,

0.44444)

0.44444)

0.44444)

0.44444)

Notice that as 𝑛 → ∞, 𝑃 (𝑛) is tending to a limiting matrix whose rows are

identical to each other, and 𝑝(𝑛) seems to be tending to a limiting probability

distribution which is the same regardless of the initial distribution. This

is a very important phenomena which we describe next.

6.2.1

Definition

A probability row vector 𝜋 = {𝜋𝑗 , 𝑗 ∈ 𝑆} is an equilibrium distribution for the

discrete-time Markov chain {𝑋𝑛 } if

𝑝(𝑛) ⟶ 𝜋

as 𝑛 ⟶ ∞,

independently of the initial distribution 𝑝(0) (i.e. for any initial distribution, the

long-run distribution of 𝑋𝑛 as 𝑛 → ∞ is the same).

6.3. INVARIANT VS EQUILIBRIUM DISTRIBUTIONS

6.2.2

77

Consequences

When an equilibrium distribution exists,

(𝑛)

𝜋𝑗 = lim 𝑝𝑖𝑗 ,

𝑛→∞

for all 𝑖, 𝑗 ∈ 𝑆.

Notice what this says: 𝜋 is an equilibrium distribution if 𝑝(𝑛) → 𝜋 as 𝑛 → ∞

regardless of the start distribution, 𝑝(0) .

It is important to note that there cannot be two or more of these limits: either

such a limit does not exist, or exactly one exists. Think about this carefully

because it tells you that there is at most one equilibrium distribution!

The following interpretations of the equilibrium distribution, 𝜋, are important:

• 𝜋𝑗 is the limiting probability that the chain is in state 𝑗 at time 𝑛.

• It can be shown that 𝜋𝑗 is also equal to the long-run proportion of time

that the chain spends in state 𝑗.

6.3

Invariant vs equilibrium distributions

We know that 𝑝(𝑛) = 𝑝(𝑛−1) 𝑃 . If this chain has an equilibrium distribution then

𝑝(𝑛) ⟶ 𝜋 as 𝑛 → ∞, and 𝜋 is the equilibrium distribution. Therefore, letting

𝑛 → ∞ we have 𝜋 = 𝜋𝑃 , so that 𝜋 is also an invariant distribution.

Though the concepts of an invariant distribution and an equilibrium distribution

are therefore closely related, they are NOT the same thing: an equilibrium distribution must also be an invariant distribution, BUT an invariant distribution

is not necessarily an equilibrium distribution.

• An equilibrium implies the existence of an invariant;

• An invariant does NOT guarantee the existence of an equilibrium.

Invariant distribution 𝜋:

• If the chain starts in (or equivalently, gets to) 𝜋, the distribution of states

at all further times is also 𝜋.

Equilibrium distribution 𝜋:

• If the chain is run for long enough (i.e. forever) its probabilistic behaviour

settles down to that of 𝜋.

• This is regardless of the state in which the chain starts.

6.3.1

Important consequences

• No invariant distribution ⇒ no equilibrium distribution.

78CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES

• Exactly 1 invariant distribution ⇒ there may or may not be an equilibrium

distribution.

• Two or more invariant distributions ⇒ no equilibrium distribution.

Some other important things to note:

• Not all Markov chains have equilibrium distributions.

• If a Markov chain has an equilibrium distribution, then it is unique (cannot

have more than one equilibrium distribution).

• If a Markov chain has an equilibrium distribution, then it also has an

invariant distribution and these distributions are identical.

• If a Markov chain does not have an equilibrium distribution, then it may

or may not have an invariant distribution.

• A Markov chain can have any number of invariant distributions (including

no invariant distribution).

• It is possible for a Markov chain to have more than one invariant distribution but not have an equilibrium distribution.

• It is possible for a Markov chain to have neither an invariant distribution

nor an equilibrium distribution.

6.4 Ergodicity and the equilibrium distribution

An ergodic state is one that is both positive recurrent and aperiodic. The

existence of an equilibrium distribution depends on ergodicity.

Theorem 6.1. An equilibrium distribution exists if and only if there is an

ergodic class 𝐶 and the Markov chain is certain to be absorbed into 𝐶 eventually,

wherever it starts.

We won’t be proving this result. Notice that it gives us a checklist for existence

of an equilibrium distribution:

1. There is a single ergodic class;

2. The process is guaranteed to enter the single ergodic class wherever it

starts.

6.4.1

A special case

What happens when {𝑋𝑛 } is an irreducible ergodic Markov chain? Applying

the theorem:

6.4. ERGODICITY AND THE EQUILIBRIUM DISTRIBUTION

79

1. There is a single ergodic class (by definition in this case);

2. We’re guaranteed to enter the ergodic class (because it’s irreducible, we

start there and always stay there!)

Therefore an equilibrium distribution exists.

Without loss of generality we assume that 𝑆 = {0, 1, 2, …}. Then in this special

case (we won’t prove this) the equilibrium distribution 𝜋 satisfies

𝜋𝑗 =

1

,

𝜇𝑗

for all 𝑗 ∈ 𝑆 ,

where 𝜇𝑗 is the mean recurrence time of state 𝑗.

This provides us with an alternative way to show that a Markov chain is positive

recurrent: if the chain is irreducible, aperiodic and has an invariant distribution

(that is, there exits a 𝜋 such that 𝜋 = 𝜋𝑃 ) then it must be positive recurrent.

Therefore, an irreducible aperiodic Markov chain has an invariant distribution

if and only if the chain is positive recurrent.

6.4.2

Summary

When does an equilibrium distribution exist?

• More than 1 closed class ⇒ no equilibrium distribution exists (the longterm behaviour of the chain depends on 𝑝(0) ).

• No closed classes (i.e. all states are transient) ⇒ no equilibrium distribution exists.

• If there exists exactly one closed class 𝐶, is it certain that the Markov

chain will eventually be absorbed into 𝐶?

– If “no” then no equilibrium distribution exists (the long-term behaviour of the chain depends on 𝑝(0) ).

– If “yes” then an equilibrium distribution exists if and only if 𝐶 is

ergodic.

80CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES

Part VII

Week 5

81

Chapter 7

Checking the Markov

property and solving

problems

7.1

The idea of ‘first step decomposition’

Recall from Section 4.3.1 that the ‘first step decomposition’ technique is a very

useful method for computing certain expectations and probabilities associated

with Markov chains. You have already seen how to use this idea to compute

probabilities. Here we’ll use the same idea to compute expectations that would

otherwise be tricky to calculate.

The main tool will be the use of the law of conditional (iterated) expectation to

find expectations of interest:

𝐸[𝑋] = 𝐸 [ 𝐸[𝑋|𝑌 ] ]

7.1.1

An example

‘Roads connecting three locations in a city are shown in the following diagram,

the time taken to cycle along any of the roads being 1 minute. Geraint is a

bicycle courier whose oﬀice is located at location A. He has a parcel that he

needs to deliver to location C. However, he is a law abiding cyclist who will only

follow the direction of travel permitted on each road, as indicated by the arrows.

Whenever he comes to a junction, he selects any of the routes available to him

with probabilities as given in the diagram below.

• How long, on average, will it take Geraint to deliver the parcel?

• How many times, on average, will Geraint visit location B before delivering

the parcel?

83

84CHAPTER 7. CHECKING THE MARKOV PROPERTY AND SOLVING PROBLEMS

1

1/3

C

1/2

2/3

A

B

D

1/2

Figure 7.1: State-space diagram

There are many ways of solving these questions, but a convenient method is to

use ‘first step decomposition’. The idea is the same as you saw in Section 4.3.1:

think about ‘where do we go next’?

It’s important that you can ‘spot’ the types of situations where first step decomposition can be applied.

Example 7.1. How long, on average, will it take Geraint to deliver

the parcel?

We’ll apply first step decomposition to solve this question. Let 𝑋𝑛 denote

Geraint’s location at time 𝑛, and let 𝑇 denote the time it takes for Geraint to

deliver the parcel to location C.

We need to calculate 𝐸[𝑇 ], and we will do this by conditioning on Geraint’s

next move. In the first instance, he starts in location A, so 𝑋0 = 𝐴. Using the

iterated law of conditional expectation, we get:

𝐸[𝑇 ] = 𝐸[𝐸[𝑇 |𝑋1 , 𝑋0 = 𝐴]] = 𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴]𝑝𝐴𝐵 +𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴]𝑝𝐴𝐶 .

Substituting in the relevant transition probabilities we get:

𝐸[𝑇 ] =

2

1

𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴] + 𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴].

3

3

We don’t really need to condition on both 𝑋1 and 𝑋0 (remember that the

process is Markov!) so we can write:

𝐸[𝑇 ] =

2

1

𝐸[𝑇 |𝑋1 = 𝐵] + 𝐸[𝑇 |𝑋1 = 𝐶].

3

3

Now notice that 𝐸[𝑇 |𝑋1 = 𝐶] = 1 because we are conditioning on Geraint

reaching location C for the first time at time 1, so this simplifies to

𝐸[𝑇 ] =

1

2

𝐸[𝑇 |𝑋1 = 𝐵] + .

3

3

(7.1)

Now compute 𝐸[𝑇 |𝑋1 = 𝐵] by applying first step decomposition again: from

B, where can Geraint go next?

𝐸[𝑇 |𝑋1 = 𝐵] = 𝐸[𝐸[𝑇 |𝑋2 , 𝑋1 = 𝐵]] =

1

1

𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵]+ 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵].

2

2

7.2. HOW CAN I TELL IF A PROCESS IS MARKOV?

85

Now notice that 𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵] = 2 and 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵] =

2 + 𝐸[𝑇 ] (the latter because Geraint used 2 minutes going to B and then back

to A, and once back in A it is as if the process starts over from scratch).

Therefore 𝐸[𝑇 |𝑋1 = 𝐵] = 1 + 12 (2 + 𝐸[𝑇 ]). Substituting this back in to (7.1),

we have:

2

1

1

5

𝐸[𝑇 ] = 𝐸[𝑇 |𝑋1 = 𝐵] + = 𝐸[𝑇 ] + .

3

3

3

3

Solving gives 𝐸[𝑇 ] = 2.5 minutes.

You may want to consider simplifying the notation of the solution to Example 7.1

above, by defining new quantities. For example, let 𝐿𝑖 denote the expected time

to reach C, given that we start in state 𝑖. Then, the solution above simplifies to

2

1

𝐿𝐴 = 1 + 𝐿𝐵 + 𝐿𝐶 .

3

3

(7.2)

We must add 1 since we are using one minute to move from state A to another

state. Now we know that 𝐿𝐶 = 0 (if we start in C, it takes no time to get

there!), and 𝐿𝐵 = 1 + 𝐿𝐴 /2 + 𝐿𝐶 /2 = 1 + 𝐿𝐴 /2. Therefore, substituting into

(7.2), we need to solve

1

5

2

𝐿𝐴 = 1 + (1 + 𝐿𝐴 /2) = 𝐿𝐴 + .

3

3

3

Solving gives 𝐿𝐴 = 2.5 minutes, as before.

Note that exactly the same technique – first step decomposition – can also be

applied to solve problems for continuous time Markov processes (which we will

cover later in the course).

7.2

How can I tell if a process is Markov?

You will come across many ICA and exam questions asking you to explain

whether or not a particular process is Markov. There are two ways that you

could establish whether a process is Markov.

1. Without using any mathematics, deduce that the process is/ isn’t Markov

logically.

2. Checking (mathematically) that the Markov property holds for all states

in the statespace at all times.

The second strategy is particularly hard, so we mainly use this technique when

we suspect that the Markov property does not hold by finding a counter example

which does not satisfy the Markov property. To show that the Markov property

does not hold, find one set of states 𝑎, 𝑏, 𝑐 and perhaps 𝑑 such that one of the

following holds (you do not need to show both hold – one will do!):

86CHAPTER 7. CHECKING THE MARKOV PROPERTY AND SOLVING PROBLEMS

• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏) for some 𝑎, 𝑏, 𝑐;

• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑑) for

some 𝑎, 𝑏, 𝑐, 𝑑 with 𝑐 ≠ 𝑑….

Don't use plagiarized sources. Get Your Custom Essay on

STAT 0007 Discrete Time Markov Process Questions

Just from $13/Page

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Delivering a high-quality product at a reasonable price is not enough anymore.

That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more