# STAT 0007 Discrete Time Markov Process Questions

STAT0007: Introduction to Applied Probability(Level 5) or Stochastic Processes (Level 6)
Dr. Elinor M Jones
2
Contents
I
Week 0
7
1 Overview of STAT0007
1.1 Content of the course . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 How this course will be run . . . . . . . . . . . . . . . . . . . . .
1.3 Useful books . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
II
Week 1
9
9
10
13
15
2 Preamble
17
2.1 Mathematical/ Statistical Writing . . . . . . . . . . . . . . . . . 17
III
Week 1 (Continued)
3 Preliminaries
3.1 What is a stochastic process?
3.2 Random variables . . . . . . .
3.3 Conditioning on events . . . .
3.4 Generating functions . . . . .
IV
19
.
.
.
.
Week 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
27
31
37
43
4 Introduction to discrete time Markov processes
45
4.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Introduction to discrete time Markov chains . . . . . . . . . . . . 47
4.3 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 52
V
Week 3
59
5 Classifying discrete time Markov processes
61
5.1 Irreducible classes of intercommunicating states . . . . . . . . . . 61
3
4
CONTENTS
5.2
5.3
5.4
5.5
5.6
VI
Recurrence and transience . . . . . . . . . . . . .
Periodicity . . . . . . . . . . . . . . . . . . . . . .
Class properties . . . . . . . . . . . . . . . . . . .
Further properties of discrete time Markov chains
Closed classes and absorption . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Week 4
71
6 Long-run behaviour of discrete time Markov
6.1 Invariant distributions . . . . . . . . . . . . .
6.2 Equilibrium distributions . . . . . . . . . . .
6.3 Invariant vs equilibrium distributions . . . . .
6.4 Ergodicity and the equilibrium distribution .
VII
63
64
65
67
68
processes
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Week 5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
74
76
77
78
81
7 Checking the Markov property and solving problems
83
7.1 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 83
7.2 How can I tell if a process is Markov? . . . . . . . . . . . . . . . 85
VIII
89
8 Preliminaries for continuous time Markov processes
91
8.1 Introduction to continuous-time Markov processes . . . . . . . . 91
8.2 The exponential distribution and continuous time Markov processes 92
IX
Week 6
97
9 Holding times, jump chains and transition probabilities for continuous time Markov processes
99
9.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 The jump chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . 104
X
Week 7
10 The
10.1
10.2
10.3
generator matrix and
Kolmogorov’s equations
The generator matrix, 𝑄
Limiting behaviour . . .
109
long run
. . . . . .
. . . . . .
. . . . . .
behaviour
111
. . . . . . . . . . . . . . . . . 111
. . . . . . . . . . . . . . . . . 114
. . . . . . . . . . . . . . . . . 116
CONTENTS
5
XI Week 8
11 Poisson processes
11.1 The Poisson process . . . . . . . . . . . . . . .
11.2 The generator matrix for the Poisson process .
11.3 Other properties of the Poisson process. . . . .
11.4 Superposition and thinning of a Poisson process
XII Week 9
121
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
123
125
126
130
133
12 Birth-death processes part 1
135
12.1 Birth and death processes . . . . . . . . . . . . . . . . . . . . . . 135
12.2 Transition rates and the jump chain . . . . . . . . . . . . . . . . 136
12.3 Important types of birth-death processes . . . . . . . . . . . . . . 138
6
CONTENTS
Part I
Week 0
7
Chapter 1
Overview of STAT0007
1.1
Content of the course
This module aims to provide an introduction to the study of systems which
change state stochastically with time and to facilitate the development of skills
in the application of probabilistic ideas. It is primarily intended for second and
third year students registered on the undergraduate degree programmes offered
by the Department of Statistical Science, or jointly with other departments.
On successful completion of the module a student should understand the Markov
property in discrete and continuous time; for discrete-time Markov chains, be
able to find and classify the irreducible classes of intercommunicating states,
calculate absorption or first passage times and probabilities, assess the equilibrium behaviour; for simple examples of continuous-time Markov chains, be
able to write down the forward equations, find and interpret the equilibrium
distribution.
Stochastic processes are vital to applications in finance and insurance, and have
many applications in biology and medicine, and in the social sciences. They
also play a fundamental role in areas such as queueing theory and the study
of system reliability. The material in this module can be applied to simplified
real-world situations, and provides the foundations for further study of more
complex systems.
The course will start by briefly revising of conditional probability (however you
are expected to be well-versed in this before starting the course). We’ll then
move on to look at Markov chains (discrete time and states), and in particular
long-run behaviour. The second half of the course looks at Markov processes
in continuous time, but still discrete state space, and again our ultimate aim
is to study the long-run behaviour of these processes. We’ll look at two very
special types of continuous time processes: Poisson processes and birth-death
9
10
CHAPTER 1. OVERVIEW OF STAT0007
processes.
1.2 How this course will be run
1.2.1
The course will be led by Dr. Elinor Jones, Associate Professor (Teaching) in
the Department of Statistical Science.
1.2.2
Level 5 and 6 versions of the course
The course is offered at two levels, but the chances are you won’t be able to
choose which level (this generally depends on your year of study etc).
The content and teaching is identical between the two levels, but in the main
exam period level 6 students will complete a more challenging paper.
1.2.3
How you’ll learn
You will work on the material in your own time with live sessions to complement
your learning. A typical weekly schedule will look like the following:
• Batch of work uploaded to Moodle each Monday at 9am.
• Finish the work by the following Monday.
• Each week, work will include:
– Watch videos;
– Read course notes;
– Complete guided exercises;
– Complete and submit homework each week (not in the first week);
– Quizzes;
– Other activities, as required.
You will also attend about three live sessions a week (exact number depends on
which week we’re in).
ONLINE large group sessions (all students attend):
• Tuesday 9-10am UK time.
• Friday 9-10am UK time.
These sessions typically include:
• Overview of material;
• Time for questions;
• Going through unseen questions.
FACE-TO-FACE small group sessions:
• Tutorial (attendance is monitored).
• All students must attend their allocated session
1.2. HOW THIS COURSE WILL BE RUN
1.2.4
11
Weekly exercises
On Friday each week I will release an exercise sheet. Note: the exercise sheet is
not released at the start of the week.
• Submit by the following Thursday 10am UK time.
• Solutions will be available at 9am each Friday.
– Those with a SoRA will have until 12:01am on Friday (i.e. an additional 14 hours according to UCL guidelines) to submit.
• NO LATE SUBMISSIONS. This rule will be strictly adhered to: no
exceptions (but see note above for students with a SoRA).
• Your work will be marked and individual feedback provided for SOME
questions.
These weekly exercises will feed into the weekly tutorial (tutorials start in the
third week).
• You should have been allocated to a tutorial group.
• You cannot move tutorial group unless you have a timetable clash.
• If you need to change your tutorial group, please email Ms Karen Leport.
– Please don’t email me: I’ve no control over tutorial allocation.
During the tutorial, your tutor will go through the questions that were not
marked. You should ‘mark’ your own work while they do this.
You will then discuss an unseen question which is connected to the work you’ve
been doing. The question will be more involved than a typical homework question. You may be asked to do short tasks connected to the question, e.g.
• Discuss in small groups;
• Respond to live quiz questions.
Please take part: the tutorials are designed to benefit your learning.
1.2.5
Weekly structure in more depth
Just to make sure you are clear on how the weekly workload will be managed:
Week 1

Material released on Monday 10th Jan.
Live session Tuesday 11th Jan (welcome session only).
Live session Friday 14th Jan (Q&A for week 1 material).
Exercise sheet 1 released Friday 14th Jan.
Week 2
• Material released on Monday 17th Jan.
• Live session Tuesday 18th Jan (Q&A for week 1 material, continued).
• Submit exercise sheet 1 by 10am on Thursday 20th Jan (unless you have
a SoRA).
• Live session Friday 21st Jan (Q&A for week 1 and 2 material).
12
CHAPTER 1. OVERVIEW OF STAT0007
• Exercise sheet 2 released Friday 21st Jan.
Week 3

Material released on Monday 24th Jan.
Live session Tuesday 25th Jan (Q&A for week 2 material, continued).
Attend your allocated tutorial.
Submit exercise sheet 2 by 10am on Thursday 27th Jan (unless you have
a SoRA).
• Live session Friday 28th Jan (Q&A for week 2 and 3 material).
• Exercise sheet 3 released Friday 28th Jan.
And so on, and so forth. There will be some exceptions to this, e.g. when you
sit an ICA for the course.
1.2.6
Getting help
You can get help in the following ways:
• Post to the Forum.
• Book an oﬀice hour slot.
• If your question is connected to the tutorial material, then ask your tutor
during the tutorial. Your tutor will not answer questions outside of your
tutorial slot.
Please note that I won’t answer questions by email (the exception to this is if
your question is more personal, e.g. organising additional time for your ICA).
1.2.7
Assessment
Assessment is split into four components:
• ICA 1: Tuesday 8th February (7.5% weight)
– Short Moodle quiz.
– Takes place instead of Tuesday live session.
– Mixture of MCQs and calculation questions.
• ICA 2: Tuesday 22nd March (7.5% weight)
– Short Moodle quiz
– Takes place instead of Tuesday live session.
– Mixture of MCQs and calculation questions.
• Participation (10% weight)
– Acceptable quality of weekly exercise submission.
– Peer exercise.
• Open book exam (75% weight)
– Sometime in main exam period.
– No choice of questions.
– Level 6 students will have more challenging questions.
1.3. USEFUL BOOKS
1.3
13
Useful books
• Introduction to Probability Models by S. Ross;
• Probability and Random Processes by Grimmett and Stirzaker (more
mathematical… be warned).
14
CHAPTER 1. OVERVIEW OF STAT0007
Part II
Week 1
15
Chapter 2
Preamble
2.1
Mathematical/ Statistical Writing
Mathematics is a language for which you must learn its grammar. Whether you
are handing in a tutorial sheet or writing an exam, you need to present your
work in an appropriate manner. You could start by thinking of the following:
• Think carefully about precisely what you are trying to calculate/ solve
and what information you have available in order to do so.
• Have you clearly defined all your notation? Be precise: are your definitions watertight?
• Is your notation appropriate?
• Does your solution flow logically, that is, does each step follow on logically
from the previous step? Again, be precise. The examiner is not a mind
reader. YOU need to guide the reader through your thought process.
• Have you explained your reasoning, not just written down some formulae?
• It is important to understand what you are doing; don’t just repeat a
list of steps. Always ask yourself: ‘do I know what I’m doing?’ and ‘how
would I explain this to a fellow student?’.
Being able to write clearly will help you to think clearly. It forces you to
understand the material: if you can’t write it well, you don’t understand it well
enough. Learning mathematics is not about repetition, as it was at A-Level,
and so past papers are NOT A GOOD REVISION TOOL. In particular, I write
ICA and exam questions that probe your understanding, and not your ability
to regurgitate methods. This reflects the fact that proper mathematics is not
about learning to apply a set of methods, but using logical reasoning to find
your way through a problem.
17
18
CHAPTER 2. PREAMBLE
Some tips for writing in STAT0007 (based on previous cohorts of students):
• Random variables form the basis for most, if not every, question you will
encounter. Make sure you define each and every random variable you use.
All random variables should be assigned to CAPITAL letters.
• Probabilities should be written out in full – if you want to calculate 𝑃 (𝑋 =
1) for some random variable 𝑋, then write it as such. NEVER use notation
such as 𝑃 (1) to denote this. And, as the previous bullet point states,
writing 𝑃 (𝑥 = 1) is plain wrong.
• The ‘equals sign’ has a very specific meaning. Please don’t abuse it.
• Do a reality check on your answer. For example, a probability should
be between 0 and 1. An expectation of how many time units it takes to
reach a particular point should be at least 1 (assuming moving between
locations takes 1 unit of time, and that you’re not already there!).
• If your answer doesn’t contain written explanation, then it is not a complete solution. Think about guiding the marker through your thought process, rather than expecting the marker to second guess what you meant
to say.
• Use tutorial sheets to practice good mathematical writing.
Part III
Week 1 (Continued)
19
Chapter 3
Preliminaries
3.1
3.1.1
What is a stochastic process?
A course overview
A stochastic process is a process which evolves randomly over time (or space,
or both):
• ‘stochastic’: equivalent to ‘random’.
• ‘process’: occurring over time (or space, or both).
We will consider some simple stochastic processes which are used to model reallife situations. Our main focus will be on developing mathematical tools to
study their properties and learn about long-term behaviour of such processes.
Though this is a mathematical course, the applications of stochastic processes
are wide-ranging, e.g. in physics, engineering, the environment, social science,
epidemics, genetics and finance. We’ll see examples from some of these areas
during the course.
3.1.2
Definitions and basic properties
A stochastic process is a collection of random variables {𝑋𝑡 , 𝑡 ∈ 𝑇 } taking
values in the state space 𝑆. The parameter, or index, set 𝑇 often represents
time. We will look at both discrete-time processes and continuous-time
processes. We’ll generally use the following notation, though you need to be
flexible in adapting to different notation from time to time.
• Discrete-time processes: {𝑋𝑡 , 𝑡 ∈ 𝑇 }, where 𝑇 = {0, 1, 2, …}. This is often
written in terms of the natural numbers (including zero) rather than 𝑇 ,
i.e. {𝑋𝑛 , 𝑛 ∈ N}, where N = {0, 1, 2, …}
• Continuous-time processes: {𝑋(𝑡), 𝑡 ∈ 𝑇 }, where 𝑇 = R+ (i.e. 𝑡 ≥ 0).
21
22
CHAPTER 3. PRELIMINARIES
We’ll often write this as {𝑋𝑡 , 𝑡 ∈ 𝑇 }, which is identical to the discrete time
notation above, but note that 𝑡 here can take any value in the positive
reals whereas for discrete time processes the set of values 𝑡 can take is
countable.
• Notice that the state space can be discrete or continuous. In this course
we will only consider discrete state spaces.
3.1.3
Examples of stochastic processes
Below are examples of stochastic processes, categorised according to whether
they are continuous or discrete time processes, and whether the state space is
discrete or continuous. We only consider discrete state space processes in this
course, in both discrete and continuous time.
Discrete time, discrete state space
Here we are tossing a coin and counting the number of excess heads (i.e. how
many more heads than tails we see). The ‘time’ on the x-axis represents the coin
tosses. For example, we see from this particular realisation that the first toss
must have landed tails (because the number of excess heads is now -1), followed
by a heads (we’re now back up to 0 excess heads). The state-space is countable
(integers, so 𝑆 = Z) and the time element here is the coin tosses themselves so
we’re in discrete ‘time’ with 𝑇 = N ∪ {0}. Notice here that one coin toss is
equivalent to one unit of time, so we need to be flexible in how we view ‘time’.
Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent: if we know 𝑋𝑘 say
then 𝑋𝑘+1 will either be 𝑋𝑘 + 1 (if we see a head at the next toss) or 𝑋𝑘 − 1
otherwise.
Discrete time, continuous state space
Here we have a gambler who starts off with £10. We’re tracking how much
money they have after each game. The ‘time’ on the x-axis represents the games
that the gambler bets on, so the state-space is 𝑇 = N ∪ {0}. We consider the
state-space to be continuous here, though technically you could consider it to be
discrete. The reason why it’s easier to consider this as continuous state-space
is that there are so many possible ‘values’ that the process could take (think:
each value corresponds to a particular value in pounds and pence). Notice again
here that one game is equivalent to one unit of time, so we need to be flexible in
how we view ‘time’. Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent:
if we know 𝑋𝑘 then though we can’t provide a short list of possible values for
𝑋𝑘+1 as we could for the previous example, we could take an educated guess
depending on the game being played.
Continuous time, discrete state space
Here we are counting occurrences of a specified event. At time zero we’ve seen
no events, but the first event happens roughly at time 2.5 (we can see this as
the number of occurrences steps up to 1), and the second event happens shortly
after. By time 5 we have seen 2 events. Notice here that the events can happen
at any instant in time so we think of time as continuous. The state-space is
3.1. WHAT IS A STOCHASTIC PROCESS?
23
Figure 3.1: Tracking the number of excess heads in 20 coin tosses
discrete as we’re just counting events (𝑆 = N ∪ {0}). Also notice that if we look
at the value of the process at time say 8.3 and 8.8, 𝑋(8.3) and 𝑋(8.8), then
these aren’t independent: if we know 𝑋(8.3) then 𝑋(8.8) must be at least as big
as 𝑋(8.3) as we can’t ‘unsee’ events. Under certain distributional assumptions
which we’ll discuss later in the course, these processes are known as ‘Poisson
processes’.
Continuous time, continuous state space
Finally an example of a stochastic process that we won’t be studying further:
this is a Brownian motion which has continuous state-space and continuous
time. It has continuous paths, but an extension to Brownian motion is the Lévy
process which allows jumps in values to occur. These types of processes are
used in Finance but are beyond the scope of this course.
NOTE: Continuous-time processes can change their value/state (‘jump’) at any
instant of time; discrete-time processes can only do this at a discrete set of time
points. For discrete-time processes, when we use the word ‘time’ we mean
‘number of transitions/steps’.
24
CHAPTER 3. PRELIMINARIES
Figure 3.2: Tracking a gambler’s wealth over 30 games
3.1. WHAT IS A STOCHASTIC PROCESS?
Figure 3.3: A realisation of a Poisson process
25
26
CHAPTER 3. PRELIMINARIES
Figure 3.4: A realisation of a Brownian motion
3.2. RANDOM VARIABLES
3.2
27
Random variables
Informally speaking, a random variable is a variable (i.e. can take different
values), whose numerical values are randomly chosen from a set of possible
values, governed by a probability distribution. For example, the time that a
passenger must wait until the next southbound tube train at Goodge street
station is a variable. Now suppose you arrive at Goodge street station – the
amount of time you must wait until the next southbound tube arrives is a
realisation of the random variable ‘waiting time’.
It is important that you can spot a random variable, and can also define appropriate random variables for a given scenario.
3.2.1
Definitions
Ideally, we should think about random variables as functions. In order to do so,
we must first build up some terminology. We’ll use a coin tossing example to
illustrate: suppose you toss a coin twice and record whether it lands heads or
tails at each toss. Note that the following is a simplification of ideas from Probability Theory and we won’t be using the technicalities in this course. However,
it does no harm to see these ideas.
• An experiment is any situation where there is a set of possible outcomes.
– Our coin-tossing example is an experiment.
– The possible outcomes are HH, HT, TH or TT.
• The set of possible outcomes is called the sample space, Ω.
– In our case, Ω = {𝐻𝐻, 𝐻𝑇 , 𝑇 𝐻, 𝑇 𝑇 }
• An outcome is denoted 𝜔 and is an element of the sample space.
– Suppose we conduct the experiment, and the outcome is TH. This is
our 𝜔.
• An event, 𝐴, is a subset of the sample space.
– e.g. define an event as ‘at least one head’.
– This is satisfied by HH, HT and TH.
– This is a subset of Ω.
Probability model
Real world
Sample space, Ω
Sample point, 𝜔
Event 𝐴 ⊆ Ω
‘Set of all possible outcomes of the experiment.’
‘Outcome of the experiment.’
‘Property that the outcome of the experiment may
possess.’
With a finite sample space Ω, we can assign probabilities to individual sample
points 𝜔 via a ‘weight function’, 𝑝 ∶ Ω → [0, 1]. This allocates a value 𝑝(𝜔) for
each possible outcome 𝜔, which we interpret as the probability that 𝜔 occurs.
28
CHAPTER 3. PRELIMINARIES
We need
∑ 𝑝(𝜔) = 1,
𝜔∈Ω
then we may calculate the probability of an event 𝐴 ⊆ Ω via
P(𝐴) = ∑ 𝑝(𝜔).
𝜔∈𝐴
Caution: this doesn’t always work with infinite sample spaces, but we won’t
consider this further.
Notice that:
1. P(𝐴) ≥ 0 for all events 𝐴,
2. if 𝐴1 , 𝐴2 , 𝐴3 , … are events with 𝐴𝑖 ∩𝐴𝑗 = ∅ for all 𝑖 ≠ 𝑗, then P(∪∞
𝑖=1 𝐴𝑖 ) =

∑𝑖=1 P(𝐴𝑖 ) i.e. P is countably additive,
3. P(Ω) = 1.
As the outcome of an experiment corresponds to a sample point 𝜔 ∈ Ω, we
can make some numerical measurements whose values depend on the outcome
𝜔. This gives rise to a random variable, let’s call it 𝑋, which is a function
𝑋 ∶ Ω → R. Its value at a sample point, 𝑋(𝜔), represents the numerical value
of the measurement when the outcome of the experiment is 𝜔.
In our set-up, there are different random variables you could define. Let’s define
𝑋 to be the number of heads in the two tosses of a coin. In this case 𝑋 maps
Ω to 0 (if we see TT), 1 (if we see TH or HT) or 2 (if we see HH).
𝑋(𝑇 𝑇 ) = 0
𝑋(𝑇 𝐻) = 𝑋(𝐻𝑇 ) = 1
𝑋(𝐻𝐻) = 2
All outcomes are equally likely here, and so 𝑝(𝜔) = 1/4 for all outcomes 𝜔. If
we define our event 𝐴 =“two heads”, then since the only sample point in 𝐴 is
HH,
∑ 𝑝(𝜔) = 1/4,
𝜔∈𝐴
and we say that P(𝐴) = 1/4.
Similarly, we can also define another event 𝐵 =“at least one tail”, and this is
given by P(𝑋 ≤ 1), which we can calculate as:
P(𝐵) = P(𝑋 ≤ 1) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 1})
(3.1)
Which 𝜔 satisfy (3.1) in this case? Now you should be able to calculate the
probability required.
Note: we have defined the distribution function of 𝑋 above. In general, the
distribution function of 𝑋 of a random variable is given, for 𝑥 ∈ R, by:
𝐹𝑋 (𝑥) = P(𝑋 ≤ 𝑥) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 𝑥}).
3.2. RANDOM VARIABLES
3.2.2
29
Discrete random variables
𝑋 is a discrete random variable if it takes only finitely many or countably
infinitely many values.
The probability mass function of a discrete random variable is
𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = P({𝜔 ∶ 𝑋(𝜔) = 𝑥}).
The expectation of 𝑋 is
E(𝑋) = ∑ 𝑥 P(𝑋 = 𝑥).
𝑥
For a function 𝑔 we have
E(𝑔(𝑋)) = ∑ 𝑔(𝑥) P(𝑋 = 𝑥).
𝑥
The variance of 𝑋 is
Var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .
3.2.3
Continuous random variables
𝑋 is a continuous random variable if it has a probability density func∞
tion, i.e., if there exists a non-negative function 𝑓𝑋 with ∫−∞ 𝑓𝑋 (𝑥)𝑑𝑥 = 1, such
that for all 𝑥,
𝑥
𝐹𝑋 (𝑥) = ∫ 𝑓𝑋 (𝑢)𝑑𝑢 .
−∞
Then
𝑓𝑋 (𝑥) =
𝑑𝐹𝑋 (𝑥)
.
𝑑𝑥
The expectation of 𝑋 is

E(𝑋) = ∫
𝑥𝑓𝑋 (𝑥)𝑑𝑥
−∞
and the expectation of 𝑔(𝑋), for some function 𝑔, is

E(𝑔(𝑋)) = ∫
𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥 ,
−∞
provided that these integrals exist. As before, the variance of 𝑋 is
var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .
30
CHAPTER 3. PRELIMINARIES
3.2.4
Important distributions
There are several probability distributions which we will use repeatedly during
the course. Make sure you are very familiar with the following.
Discrete probability distributions:
Distribution
Support
PMF
Expectation
Variance
Binomial
0, 1, …, 𝑛
P(𝑋 =
𝑘) =
𝑛𝑝
𝑛𝑝𝑞
𝑋 ∼ Bin(𝑛, 𝑝)
Poisson
𝑛!
𝑘 𝑛−𝑘
𝑘!(𝑛−𝑘)! 𝑝 𝑞
0, 1, 2, 3…
P(𝑋 = 𝜆
𝑘) =
𝑘
exp(−𝜆) 𝜆𝑘!
𝜆
1, 2, 3, 4…
P(𝑋 =
𝑘) =
(1 −
𝑝)𝑘−1 𝑝
1/𝑝
𝑞/𝑝2
Expectation
Variance
1/𝜆
1/𝜆2
𝑎 < 𝑥1 < 𝑏 2 (𝑏 − 𝑎) otherwise 1 12 (𝑏 𝑋 ∼ Po(𝜆) 𝜆>0
Geometric
𝑋 ∼ Geom(𝑝)
Continuous probability distributions:
Distribution
Exponential
Support
PMF
+
𝜆𝑒
[𝑎, 𝑏]
1
{
0
R
−𝜆𝑥
𝑋 ∼ Exp(𝜆)
𝜆>0
Uniform
𝑋 ∼ Unif(𝑎, 𝑏)
− 𝑎)2
3.3. CONDITIONING ON EVENTS
Distribution
Gamma
Support
R+
PMF
𝛼
31
Expectation
𝛼−1
𝛽 exp(−𝛽𝑥)𝑥
Γ(𝛼) 𝛼/𝛽
Variance
𝛼/𝛽 2
𝑋 ∼ Gamma(𝛼, 𝛽)
𝛼, 𝛽 > 0
3.3
Conditioning on events
Calculating probabilities that are conditional on a particular event or occurrence is a natural requirement. Make sure you understand the rationale behind
conditional probability, don’t just learn the formulae!
Easy exercise
You have a deck of playing cards (52 cards in total, split equally into four suits:
hearts, diamonds, spades and clubs, with the first two considered ‘red’ cards,
and the others considered ‘black’ cards).
• I pick a card at random. What is the probability that the chosen card is
from the diamond suit?
• I pick a card at random and tell you that it is a red card (i.e. either hearts
or diamonds). What is the probability that the chosen card is from the
diamond suit?
The second question requires a conditional probability (it is conditional on knowing that the card is red), but you most likely calculated the probability intuitively without using any formulae. How did you do that? You applied Bayes’s
theorem intuitively, without even realising it.
3.3.1
Intuition behind conditional probability
In fact, Bayes’s theorem for calculating conditional probabilities is perfectly
intuitive. Suppose we are calculating P(𝐴|𝐶) for some events 𝐴 and 𝐶:
• This is asking for the probability of 𝐴 occurring, given that 𝐶 occurs.
• Assuming 𝐶 occurs, then the contents of the statespace Ω which relate
to 𝐶 becomes the ‘new’ statespace, Ω′ (which is just the event 𝐶 for our
purposes).
• What we are asking here, in effect, is to calculate the probability of 𝐴 in
the ‘new’ sample space Ω′ (i.e. P(𝐴 ∩ 𝐶)), scaled to Ω′ (i.e. divided by
P(𝐶)).
• The final equation is therefore P(𝐴|𝐶) = P(𝐴 ∩ 𝐶)/P(𝐶)
But why scale by P(𝐶)? Simply because the probability of all disjoint events
in any sample space must sum to 1. Our sample space is now Ω′ = 𝐶, so the
32
CHAPTER 3. PRELIMINARIES
new (probability) weight function we apply to all events in Ω′ must sum to 1.
We scale all probabilities by P(𝐶) to ensure that this is the case.
The following graphic may help you to visualise why Bayes’s theorem ‘works’.
Figure 3.5: A schematic of conditional probability
3.3.2
Formal version of conditional probability
Let 𝐴 and 𝐶 be events with P(𝐶) > 0. The conditional probability of 𝐴
given 𝐶 is
P(𝐴 ∩ 𝐶)
P(𝐴 | 𝐶) =
.
P(𝐶)
3.3. CONDITIONING ON EVENTS
33
It is easy to verify that
1. P(𝐴 | 𝐶) ≥ 0 ,
2. if 𝐴1 , 𝐴2 , … are mutually exclusive events, then
P(𝐴1 ∪ 𝐴2 ∪ … | 𝐶) = ∑ P(𝐴𝑖 | 𝐶),
𝑖
(try drawing diagrams like the above to visualise this)
3. P(𝐶 | 𝐶) = 1.
Conditional probability for random variables instead of events works is the same
way.
Discrete RVs 𝑋 and 𝑌
Continuous RVs 𝑋 and 𝑌
Joint probability mass function
Joint density function
𝑝𝑋,𝑌 (𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑋 and 𝑌 independent if and only
if for all 𝑥, 𝑦
𝑋 and 𝑌 independent if and only if for
all 𝑥, 𝑦
𝑝𝑋,𝑌 (𝑥, 𝑦) = 𝑝𝑋 (𝑥)𝑝𝑌 (𝑦) = P(𝑋 = 𝑥)P(𝑌 = 𝑓𝑦)
𝑋,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦)
Marginal probability mass
function (of 𝑋)
Marginal density (of 𝑌 )

𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑌 (𝑦) = ∫
𝑓𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥
−∞
𝑦
Conditional PMF of 𝑋 given
𝑌 =𝑦
𝑝𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 = 𝑥 | 𝑌 = 𝑦) =
Conditional density of 𝑋 given 𝑌 = 𝑦
(𝑓𝑌 (𝑦) > 0)
P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑓𝑋∣𝑌 =𝑦 (𝑥) =
P(𝑌 = 𝑦)
𝑓𝑌 (𝑦)
Conditional distribution function
of 𝑋 given 𝑌 = 𝑦
Conditional distribution function of 𝑋
given 𝑌 = 𝑦
𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)
𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)
34
CHAPTER 3. PRELIMINARIES
3.3.3
Conditional expectation
Just like calculating an expectation of, say, 𝑋, requires knowing the distribution
of 𝑋, the conditional expectation of 𝑋, given that we know 𝑌 = 𝑦 requires
knowing the distribution of (𝑋|𝑌 = 𝑦).
The conditional expectation of 𝑋 given 𝑌 = 𝑦 is
E(𝑋|𝑌 = 𝑦) = {
∑𝑥 𝑥P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∫−∞ 𝑥𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥
for continuous RVs
The conditional expectation of 𝑔(𝑋) given 𝑌 = 𝑦, for some function 𝑔,
is
E(𝑔(𝑋)|𝑌 = 𝑦) = {
∑𝑥 𝑔(𝑥)P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∫−∞ 𝑔(𝑥)𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥
for continuous RVs
Example 3.1. Suppose that Ω = {𝜔1 , 𝜔2 , 𝜔3 } and P(𝜔𝑖 ) = 1/3 for 𝑖 = 1, 2, 3.
Suppose also that 𝑋 and 𝑌 are random variables with
𝑋(𝜔1 ) = 2, 𝑋(𝜔2 ) = 3, 𝑋(𝜔3 ) = 1
𝑌 (𝜔1 ) = 2, 𝑌 (𝜔2 ) = 2, 𝑌 (𝜔3 ) = 1
Find the conditional pmf 𝑝𝑋∣𝑌 =2 (𝑥) and the conditional expectation E(𝑋 | 𝑌 =
2).
Solution
Possible values of 𝑋 are 1, 2, 3.
𝑝𝑋∣𝑌 =2 (1) = P(𝑋 = 1 ∣ 𝑌 = 2) = 0.
𝑝𝑋∣𝑌=2 (2) = P(𝑋 = 2 ∣ 𝑌 = 2) =
P(𝑋 = 2, 𝑌 = 2)
P(𝜔1 )
1
=
= .
P(𝑌 = 2)
P(𝜔1 )+P(𝜔2 )
2
𝑝𝑋∣𝑌=2 (3) = P(𝑋 = 3 ∣ 𝑌 = 2) =
P(𝜔2 )
1
P(𝑋 = 3, 𝑌 = 2)
=
= .
P(𝑌 = 2)
P(𝜔1 )+P(𝜔2 )
2
Thus the conditional pmf of 𝑋 given 𝑌 = 2 equals
and is 0 otherwise. The conditional expectation is
1
2
for 𝑥 = 2 and for 𝑥 = 3
E(𝑋 | 𝑌 = 2) = ∑ 𝑥 P(𝑋 = 𝑥 ∣ 𝑌 = 2) = (1 × 0) + (2 × 1/2) + (3 × 1/2) = 5/2 .
𝑥
3.3. CONDITIONING ON EVENTS
35
So far, we have only calculated conditional expectations when conditioning on
specific values, e.g. E[𝑋|𝑌 = 2]. We can extend this idea to conditioning on
random variables without equating them to specific values.
The notation E(𝑋 | 𝑌 ) is used to denote a random variable that takes the
value E(𝑋 | 𝑌 = 𝑦) with probability P(𝑌 = 𝑦).
Note that E(𝑋 | 𝑌 ) is a function of the random variable 𝑌 , and is itself a
random variable.
Example 3.2. In Example 3.1 we found that
E[𝑋 | 𝑌 = 2] =
5
2
It can similarly be shown that
E[𝑋 | 𝑌 = 1] = 1
You can also check that
P(𝑌 = 1) = 1/3 and P(𝑌 = 2) = 2/3
Thus, the random variable
E[𝑋 | 𝑌 ]
has two possible values: 1 and 25 , with probabilities
is,
E[𝑋|𝑌 ] = {
3.3.4
1
5/2
1
3
and 23 , respectively. That
with probability 1/3
with probability 2/3
Useful conditioning formulae
Three important formulae:
Law of total probability Let 𝐴 be an event and 𝑌 be ANY random variable.
Then
∑ P(𝐴|𝑌 = 𝑦)P(𝑌 = 𝑦)
if 𝑌 discrete
P(𝐴) = { ∞𝑦
∫−∞ P(𝐴 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦 if 𝑌 continuous
Modification of the law of total probability: if 𝑍 is a third discrete random
variable,
36
CHAPTER 3. PRELIMINARIES
P(𝑋 = 𝑥 ∣ 𝑍 = 𝑧) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦 ∣ 𝑍 = 𝑧)
𝑦
=∑
𝑦
=∑
𝑦
P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑍 = 𝑧)
P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑍 = 𝑧)
= ∑ P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦 ∣ 𝑍 = 𝑧),
𝑦
assuming the conditional probabilities are all defined.
Law of conditional (iterated) expectations
E[𝑋] = E[E(𝑋 | 𝑌 )] = {
∑𝑦 E(𝑋 | 𝑌 = 𝑦) P(𝑌 = 𝑦)

∫−∞ E(𝑋 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦
if 𝑌 discrete
if 𝑌 continuous
and, more generally,
E[𝑔(𝑋)] = E(E[𝑔(𝑋) | 𝑌 ])
Note in particular:
• The law of total probability applies to ANY event 𝐴 and ANY random
variable 𝑌 .
• The law of the conditional (iterated) expectation, E[𝑋] = E[E[𝑋|𝑌 ]] applies to ANY random variables 𝑋 and 𝑌 , and we will be using this identity
heavily during the course.
Example 3.3. In Example 3.2 we found the distribution of the random variable
E(𝑋 | 𝑌 ) was
1
w.p. 1/3
E[𝑋|𝑌 ] = {
5/2 w.p. 2/3
The expectation of this random variable is
E[E(𝑋|𝑌 )] = 1 ×
1 5 2
+ × = 2.
3 2 3
Also, the expectation of 𝑋 is given by
E(𝑋) = 1 ×
1
1
1
+ 2 × + 3 × = 2,
3
3
3
so that we see that the law of the conditional (iterated) expectation, i.e. that
E[𝑋] = E[E[𝑋|𝑌 ]], holds.
3.4. GENERATING FUNCTIONS
37
Example 3.4. Let 𝑋 and 𝑌 have joint density
𝑓𝑋,𝑌 (𝑥, 𝑦) =
1 −𝑥/𝑦 −𝑦
𝑒
𝑒 , 0 < 𝑥, 𝑦 < ∞ . 𝑦 Find 𝑓𝑌 (𝑦), E(𝑋 | 𝑌 ) and hence E(𝑋). Solution For any 𝑦 > 0,

𝑓𝑌 (𝑦) = ∫
0
1 −𝑥/𝑦 −𝑦
−𝑦
𝑒
𝑒 𝑑𝑥 = 𝑒−𝑦 [−𝑒−𝑥/𝑦 ]∞
0 =𝑒 ,
𝑦
so 𝑌 has an exponential distribution with parameter 1.
Also, for any fixed 𝑦 > 0 and any 𝑥 > 0,
𝑓𝑋∣𝑌 (𝑥 | 𝑦) =
𝑓𝑋,𝑌 (𝑥, 𝑦)
1
𝑒−𝑥/𝑦 𝑒−𝑦
= 𝑒−𝑥/𝑦 ,
=
𝑓𝑌 (𝑦)
𝑦𝑒−𝑦
𝑦
so 𝑋 | 𝑌 = 𝑦 ∼ exp(1/𝑦).
It follows that E(𝑋 | 𝑌 = 𝑦) = 𝑦 and so E(𝑋 | 𝑌 ) = 𝑌 . Using the Law of
Conditional Expectations we then have
E[𝑋] = E[E(𝑋 | 𝑌 )] = E(𝑌 ) = 1 .
3.4
Generating functions
Generating functions are USEFUL. Don’t think of them as baffling mathematical constructs; instead think of them as marvelous time-saving devices.
Imagine you have a sequence of real numbers, 𝑧1 , 𝑧2 , 𝑧3 , …. It can be hard to
keep track of such numbers and all the information they contain. A generating
function provides a way of wrapping up the entire sequence into one expression.
When we want to extract information about the sequence, we simply interrogate
this one expression in a particular way in order to get what we want.
There are many different types of generating function. These are different in
the sense that we wrap up the sequence in different ways, and so extracting
particular pieces of information from them requires different techniques. The
two types of generating function that you should be familiar with are:
1. The probability generating function (PGF);
2. The moment generating function (MGF)
38
3.4.1
CHAPTER 3. PRELIMINARIES
Probability generating functions
The probability generating function (p.g.f) of 𝑋 is defined to be

𝐺𝑋 (𝑠) = E(𝑠𝑋 ) = ∑ 𝑠𝑘 P(𝑋 = 𝑘) for |𝑠| ≤ 1.
𝑘=0
Probability generating function fundamentals
The random variable
𝑋
Sequence to ‘wrap up’
Method
Information stored
Takes values in N0 = {0, 1, 2, …}
P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), P(𝑋 = 3), …

∑𝑘=0 𝑠𝑘 P(𝑋 = 𝑘) (for any 𝑠 ∈ [−1, 1])
The probabilities P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), …
and some moments (expectations).
The information stored in probability generating functions can be accessed as
follows.
Calculating moments
E(𝑋) = ∣
E[𝑋(𝑋 − 1)] = ∣
𝑑𝐺𝑋 (𝑠)

= 𝐺′𝑋 (1),
𝑑𝑠
𝑠=1
𝑑2 𝐺𝑋 (𝑠)

= 𝐺″𝑋 (1), and so on.
𝑑𝑠2
𝑠=1
Calculating probabilities
Suppose that 𝑋 takes values in N0 = {0, 1, 2, …}. Then the PGF evaluated at
0 is:

𝐺𝑋 (0) = ∑ 0𝑘 P(𝑋 = 𝑘) = P(𝑋 = 0).
𝑘=0
Furthermore, if we expand 𝐺𝑋 (𝑠) in powers of 𝑠 the coeﬀicient of 𝑠𝑘 is equal to
P(𝑋 = 𝑘), so we can also find P(𝑋 = 𝑘) for all 𝑘.
You can also work ‘backwards’: given a PGF for a random variable 𝑋, what is
the distribution of 𝑋?
3.4. GENERATING FUNCTIONS
39
Example 3.5. Suppose you are told that 𝐺𝑋 (𝑠) = 𝑝𝑠/[1 − (1 − 𝑝)𝑠]. We
can convert this expression into the summation format of a PGF and spot its
distribution:
𝑝𝑠
𝐺𝑋 (𝑠) =
,
1 − (1 − 𝑝)𝑠

= 𝑝𝑠 ∑(1 − 𝑝)𝑘 𝑠𝑘
𝑘=0

= ∑ 𝑠𝑗 (1 − 𝑝)𝑗−1 𝑝,
where 𝑗 = 𝑘 + 1.
𝑗=1
Therefore, P(𝑋 = 𝑗) = (1 − 𝑝)𝑗−1 𝑝, so 𝑋 ∼ Geometric(𝑝).
In fact, different distributions have different PGF formats but random variables
with the same distribution have a PGF of the same format. You can therefore
spot the distribution of a random variable instantly from the format of its PGF.
3.4.2
Moment generating functions
The moment generating function (m.g.f) of 𝑋 is defined to be
𝑀𝑋 (𝑠) = E [exp(𝑡𝑋)] = ∫ exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥
R
for any 𝑡 ∈ R such that the integral converges.
Probability generating function fundamentals
Random variable
Quantity to ‘wrap up’
Method
Information stored
Must be real valued
𝑓𝑋 (𝑥)
∫R exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥 (for any 𝑡 ∈ R such that the
integral converges)
Moments
We can access all moments of 𝑋 from 𝑀𝑋 (𝑡):
(𝑛)
E[𝑋 𝑛 ] = 𝑀𝑋 (0) =
𝑑 𝑛 𝑀𝑋

𝑑𝑡𝑛 𝑡=0
The same as for PGFs, you can spot the distribution of a random variable
instantly from the format of its MGF.
Notice that a discrete or continuous random variable can have a MGF, whereas
a PGF is for discrete random variables only (i.e. those with countable state
space).
40
3.4.3
CHAPTER 3. PRELIMINARIES
What else can we do with generating functions?
We’ll concentrate here on (useful) things you can do with probability generating functions, though similar conclusions can be reached using moment generating functions too.
Calculating the distribution of sums of random variables.
Suppose that 𝑋1 , … , 𝑋𝑛 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠).
The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑛 is given by
𝐺𝑌 (𝑠) = E (𝑠𝑋1 +⋯+𝑋𝑛 ) ,
𝑛
= ∏ E (𝑠𝑋𝑖 ) ,
𝑖=1
= [𝐺𝑋 (𝑠)]𝑛 ,
and by looking at [𝐺𝑋 (𝑠)]𝑛 , we can spot the distribution of 𝑌 , which may
otherwise be diﬀicult. Also, if 𝑍1 ⟂
⟂ 𝑍2 (that is, 𝑍1 and 𝑍2 are independent but
do not necessarily have the same distribution) then
𝐺𝑍1 +𝑍2 (𝑠) = 𝐺𝑍1 (𝑠)𝐺𝑍2 (𝑠).
By looking at the PGF of 𝑍1 + 𝑍2 , 𝐺𝑍1 +𝑍2 (𝑠), we can then deduce the distribution 𝑍1 + 𝑍2 . This can be an easier strategy than trying to deduce its
distribution directly.
Example 3.6. 𝑋 and 𝑌 are both Poisson random variables, with parameters
𝜆1 and 𝜆2 , respectively. Assume that 𝑋 and 𝑌 are also independent of each
other. What is the distribution of 𝑋 + 𝑌 ?
Solution: Use probability generating functions!
The PGF of a Poisson random variable with parameter 𝜆 is
𝐺(𝑠) = 𝑒𝜆(𝑠−1) .
Therefore, the PGF of 𝑋 + 𝑌 is
𝐺𝑋+𝑌 (𝑠) = 𝑒𝜆1 (𝑠−1) 𝑒𝜆2 (𝑠−1) = 𝑒(𝜆1 +𝜆2 )(𝑠−1) .
We recognise the format of this PGF as the PGF of a Poisson distribution with
parameter (𝜆1 + 𝜆2 ). Therefore,
𝑋 + 𝑌 ∼ Poisson(𝜆1 + 𝜆2 )
3.4. GENERATING FUNCTIONS
41
Calculating the PGF of a random sum
That is, the sum of a random number of random variables. Suppose that
𝑋1 , … , 𝑋𝑁 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠) and that 𝑁
has PGF 𝐺𝑁 (𝑠).
The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑁 is given by
𝐺𝑌 (𝑠) = E[E (𝑠𝑌 |𝑁 ) ].
We’ll see this technique of expanding an expectation again later (you have already seen it if you studied STAT0005), and it will prove to be a very useful
technique in solving problems. Using this, we can deduce that
E[E (𝑠𝑌 |𝑁 ) ] = E[[𝐺𝑋 (𝑠)]𝑁 ] = 𝐺𝑁 [𝐺𝑋 (𝑠)].
Similarly, it can be shown (challenge!) that the m.g.f. 𝑀𝑌 (𝑡) of 𝑌 is given by
𝑀𝑌 (𝑡) = E (exp𝑡𝑌 ) = 𝐺𝑁 [𝑀𝑋 (𝑡)].
Example 3.7. The number of people who enter a particular bookstore on
Gower Street, per day, follows a Poisson distribution with parameter 300. Of
those who visit, 60% will make a purchase. What is the distribution of the
number of people who make a purchase at the bookstore in one day?
Solution: Use probability generating functions!
Notice that whether a visitor makes a purchase is a Bernoulli trial, with probability of success 0.6. Let
𝑋𝑖 = {
1
0
if 𝑖th customer makes a purchase
otherwise
so that we are looking to find the distribution of 𝑌 = 𝑋1 + … + 𝑋𝑁 where
𝑁 ∼ Poisson(300). Notice that the PGF of the Bernoulli random variable 𝑋 is
𝐺𝑋 (𝑠) = (0.4 + 0.6𝑠), and the PGF of 𝑁 is 𝑒300(𝑠−1) .
By the result above, we know that the PGF of 𝑌 is 𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠)). Think
of this as using 𝐺𝑋 (𝑠) as the ‘in place of’ 𝑠 in the PGF of 𝑁 :
𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠))
= 𝑒300(0.4+0.6𝑠−1) = 𝑒300(0.6𝑠−0.6) = 𝑒180(𝑠−1)
We recognise this as the PGF of a Poisson random variable with mean 180.
Therefore, the distribution of those who make a purchase is Poisson(180).
42
CHAPTER 3. PRELIMINARIES
Part IV
Week 2
43
Chapter 4
Introduction to discrete
time Markov processes
4.1
The Markov property
A stochastic process {𝑋𝑡 , 𝑡 ∈ 𝑇 } is a Markov process if, for any sequence of
for some 𝑛 ≥ 0,
that is, starting from state 𝑖, it is possible that the chain will eventually enter
state 𝑗.
States 𝑖 and 𝑗 intercommunicate (we write 𝑖 ↔ 𝑗) if 𝑖 → 𝑗 and 𝑗 → 𝑖.
Properties of ↔
61
62CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES
• 𝑖 ↔ 𝑖 for all states 𝑖.
• if 𝑖 ↔ 𝑗 then 𝑗 ↔ 𝑖.
• if 𝑖 ↔ 𝑗 and 𝑗 ↔ 𝑘 then 𝑖 ↔ 𝑘.
(i)+(ii)+(iii) mean that ↔ is an equivalence relation on 𝑆.
We can partition the state space 𝑆 into non-overlapping classes of intercommunicating states. Within a particular class, every pair of states intercommunicate;
states in different classes do not intercommunicate.
Each class is called an irreducible class of states.
If there is only one irreducible class, then the Markov chain is said to be irreducible.
Example 5.1. For the Markov chain with state space 𝑆 = {0, 1, 2} and transition matrix
1/4 1/4
1
⎜ 0
𝑃 =⎛
⎝ 1/2 0
1/2
0 ⎞
⎟,
1/2 ⎠
the state space can be partitioned in to the irreducible classes {0, 2} and {1}. In
particular, the Markov chain is not irreducible because there are two irreducible
classes. It might be helpful to see this by drawing a state-space diagram.
Figure 5.1: State-space diagram, which may help in deciding the irreducible
classes
5.2. RECURRENCE AND TRANSIENCE
5.2
63
Recurrence and transience
Let {𝑋𝑛 } be a Markov chain with state space 𝑆 and let 𝑖 ∈ 𝑆. Let
𝑓𝑖 = P(ever return to 𝑖|𝑋0 = 𝑖) = P(𝑋𝑛 = 𝑖 for some 𝑛 ≥ 1|𝑋0 = 𝑖).
If 𝑓𝑖 = 1 then 𝑖 is recurrent (eventual return to state 𝑖 is certain).
(Therefore, 𝜇𝑖 = 𝐸[𝑇𝑖𝑖 ] is infinite.)
5.2.4
Positive recurrence and null recurrence
In fact, there are two types of recurrent state. Recall that the first passage time
(also called the recurrence time) 𝑇𝑖𝑖 of a state 𝑖 is the number of steps until
the chain first returns to state 𝑖 given that it starts in state 𝑖.
Let 𝜇𝑖 = E(𝑇𝑖𝑖 ) be the mean recurrence time of state 𝑖. We have already
seen that if 𝑖 is transient then 𝜇𝑖 = ∞. We have not computed 𝜇𝑖 when 𝑖 is
recurrent. In fact, the value of 𝜇𝑖 when 𝑖 is recurrent enables us to distinguish
between two types of recurrence: positive recurrence and null recurrence.
A recurrent state 𝑖 is
• positive recurrent if 𝜇𝑖 < ∞ (that is, 𝜇𝑖 is finite), • null recurrent if 𝜇𝑖 = ∞ (that is, 𝜇𝑖 is infinite). At first glance it may seem strange that a state 𝑖 can be null recurrent: return to state 𝑖 is certain, but the expected time (number of steps) to return to 𝑖 is infinite. In other words, the random variable 𝑇𝑖𝑖 is finite with probability 1, but its mean, E(𝑇𝑖𝑖 ), is infinite. Such states do exist and we will see examples later in the course. 5.3 Periodicity The period of a state 𝑖 is the greatest common divisor of the set of integers (𝑛) 𝑛 ≥ 1 such that 𝑝𝑖𝑖 > 0, that is,
(𝑛)
𝑑𝑖 = gcd {𝑛 ≥ 1 ∶ 𝑝𝑖𝑖 > 0} .
• If 𝑑𝑖 = 1 then state 𝑖 is aperiodic.
5.4. CLASS PROPERTIES
65
(𝑛)
• If 𝑝𝑖𝑖 = 0 for all 𝑛 ≥ 1 (i.e. return to state 𝑖 is impossible) then, by
convention, state 𝑖 is also said to be aperiodic.
5.4
Class properties
A class property is a property which is shared by all the states in an
irreducible class. If it can be shown that one of the states in a class has a
certain class property then all the states in the irreducible class must also have
that same property.
I give sketch proofs for some of the following results relating to class properties.
The other proofs can be found in any good textbook on Markov processes.
Theorem 5.1. A pair of intercommunicating states are either both recurrent
or both transient. That is, recurrence/ transience is a class property. It is not
possible to have a mixture of recurrent and transient states in an irreducible
class.
Proof. Look in ‘Introduction to Probability Models’ (Sheldon Ross).
Theorem 5.2. Null recurrence is a class property.
Proof. Recall that 𝜇𝑖 = 𝐸(𝑇𝑖𝑖 ) is the expected first passage time or the mean
recurrence time of state 𝑖, and that
finite
if 𝑖 is positive recurrent
𝜇𝑖 is {
infinite if 𝑖 is null recurrent or transient.
It can be shown (proof omitted) that
(𝑛)
𝑖 aperiodic ⇒ 𝑝𝑖𝑖
(𝑛𝑑)
𝑖 periodic, period 𝑑 ⇒ 𝑝𝑖𝑖
1
as 𝑛 → ∞,
𝜇𝑖
𝑑
(𝑚)

as 𝑛 → ∞ (and 𝑝𝑖𝑖 = 0 if 𝑚 is not a multiple of 𝑑).
𝜇𝑖

So, as 𝑛 → ∞
(𝑛)

1
𝜇𝑖
(positive, finite),
(𝑛𝑑)

𝑑
𝜇𝑖
(positive, finite),
• if 𝑖 is aperiodic, positive recurrent then 𝑝𝑖𝑖
• if 𝑖 is periodic, positive recurrent then 𝑝𝑖𝑖
(𝑛)
• if 𝑖 is null recurrent then 𝑝𝑖𝑖 → 0,
66CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES

(𝑛)
(𝑛)
• if 𝑖 is transient then (proof omitted) ∑𝑛=0 𝑝𝑖𝑖 < ∞, and 𝑝𝑖𝑖 → 0. (𝑛) (𝑚) Now let 𝑖 ↔ 𝑗, with 𝑝𝑖𝑗 > 0, 𝑝𝑗𝑖 > 0. Suppose that 𝑖 is null recurrent. Then
(𝑚+𝑘+𝑛)
𝑝𝑖𝑖
As 𝑘 → ∞
(𝑛) (𝑘) (𝑚)
≥ 𝑝𝑖𝑗 𝑝𝑗𝑗 𝑝𝑗𝑖 .
(𝑚+𝑘+𝑛)
𝑝𝑖𝑖
→ 0
(because 𝑖 is null recurrent), and so
(𝑘)
𝑝𝑗𝑗 → 0
(𝑛)
(𝑚)
(because 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0). So 𝑗 is either transient or null recurrent. But 𝑖 is
recurrent and 𝑖 ↔ 𝑗, so 𝑗 must be recurrent as well (because recurrent/transience
is a class property). Hence 𝑗 must be null recurrent.
Theorem 5.3. Positive recurrence is a class property.
Proof. Suppose that 𝑖 ↔ 𝑗 and that 𝑖 is positive recurrent. Then 𝑗 must be
recurrent (because recurrence is a class property). If 𝑗 is null recurrent then
𝑖 is also null recurrent. This is a contradiction. Hence 𝑗 must be positive
recurrent.
Theorem 5.4. Intercommunicating states have the same period. That is, periodicity is a class property.
Proof. Let 𝑖 and 𝑗 be states with 𝑖 ↔ 𝑗. Let 𝑑𝑖 be the period of 𝑖 and 𝑑𝑗 be the
(𝑛)
(𝑚)
period of 𝑗. Let 𝑛 and 𝑚 be such that 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0. Then
(𝑛+𝑚)
𝑝𝑗𝑗
(𝑚) (𝑛)
≥ 𝑝𝑗𝑖 𝑝𝑖𝑗 > 0,
because there is at least one way of returning to 𝑗 in 𝑛 + 𝑚 steps: 𝑗 to 𝑖 in 𝑚
steps and then 𝑖 to 𝑗 in a further 𝑛 steps. Therefore,
𝑑𝑗 |𝑛 + 𝑚,
(𝑑𝑗 is a divisor of 𝑛 + 𝑚).
(𝑘)
Let 𝑘 be such that 𝑝𝑖𝑖 > 0. Then
(𝑛+𝑘+𝑚)
𝑝𝑗𝑗
(𝑚) (𝑘) (𝑛)
≥ 𝑝𝑗𝑖 𝑝𝑖𝑖 𝑝𝑖𝑗 > 0,
5.5. FURTHER PROPERTIES OF DISCRETE TIME MARKOV CHAINS67
so
𝑑𝑗 |𝑛 + 𝑘 + 𝑚,
(𝑑𝑗 is a divisor of 𝑛 + 𝑘 + 𝑚).
together imply that
𝑑𝑗 |𝑘,
(𝑑𝑗 is a divisor of 𝑘).
(𝑘)
That is, 𝑑𝑗 is a factor of any {𝑘 ≥ 1 ∶ 𝑝𝑖𝑖 > 0}.
However, by definition 𝑑𝑖 is the greatest common divisor of such 𝑘. Therefore,
𝑑𝑗 ≤ 𝑑𝑖 . Reversing the roles of 𝑖 and 𝑗 gives 𝑑𝑖 ≤ 𝑑𝑗 . Hence 𝑑𝑖 = 𝑑𝑗 .
Summary
The following are class properties:
• Null recurrence;
• Positive recurrence;
• Transience;
• Periodicity.
5.5
Further properties of discrete time Markov
chains
Theorem 5.5. A finite irreducible Markov chain cannot be null recurrent.
(𝑛)
Proof. Recall that 𝑝𝑗
= P(𝑋𝑛 = 𝑗). Suppose that the chain is null recurrent.
(𝑛)
It can be shown that 𝑝𝑗
→ 0 as 𝑛 → ∞ for all 𝑗 ∈ 𝑆.
We know that
(𝑛)
∑ 𝑝𝑗
= 1 for all 𝑛,
𝑗∈𝑆
so
(𝑛)
lim ∑ 𝑝𝑗
𝑛→∞
= 1.
𝑗∈𝑆
But 𝑆 is finite so if the chain is null recurrent
(𝑛)
lim ∑ 𝑝𝑗
𝑛→∞
𝑗∈𝑆
(𝑛)
= ∑ lim 𝑝𝑗
𝑗∈𝑆
𝑛→∞
= 0.
This is a contradiction, so the Markov chain cannot be null recurrent.
68CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES
Theorem 5.6. A finite Markov chain cannot contain any null recurrent states.
Proof. Suppose that the chain contains a state 𝑖 which is null recurrent. Then,
since null recurrence is a class property, state 𝑖 is in some irreducible finite closed
class of null recurrent states. This is not possible (see above).
Theorem 5.7. It is not possible for all states in a finite state space Markov
chain to be transient.
Note that this implies that all states in a finite irreducible Markov chain are
recurrent (because recurrence is a class property). That is, finite and irreducible
⇒ recurrent Markov chain.
Proof. Suppose all states are transient and 𝑆 = {0, 1, … , 𝑀 }. A transient state
will be visited only a finite number of times. Therefore for each state 𝑖 there
exists a time 𝑇𝑖 after which 𝑖 will never be visited again. Therefore, after time
𝑇 = max{𝑇0 , … , 𝑇𝑀 } no state will be visited again. BUT the Markov chain
must be in some state. Therefore we have a contradiction. Therefore, at least
one state must be recurrent.
5.6 Closed classes and absorption
A class 𝐶 of intercommunicating states is closed if 𝑝𝑖𝑗 = 0 for all 𝑖 ∈ 𝐶 and 𝑗 ∉
𝐶, that is, once the chain gets into 𝐶, it never leaves 𝐶.
If a single state forms a closed class then it is called an absorbing state.
Note the following important observations:
• A finite closed class must be recurrent: once entered, the class behaves
like a finite irreducible Markov chain.
• An absorbing state is positive recurrent: its mean recurrence time is 1.
Theorem 5.8. An irreducible class 𝐶 of recurrent states is closed.
Proof. Suppose that 𝐶 is not closed. Then there exist 𝑖 ∈ 𝐶 and 𝑗 ∉ 𝐶 such
that 𝑖 → 𝑗. But 𝑗 does not communicate with 𝑖, otherwise 𝑖 and 𝑗 would
intercommunicate and 𝑗 would be in 𝐶. Thus, there is positive probability that
the chain leaves 𝑖 and never returns to 𝑖. This means that 𝑖 is transient. This
is a contradiction, as 𝑖 ∈ 𝐶 and 𝐶 is recurrent.
5.6. CLOSED CLASSES AND ABSORPTION
5.6.1
69
Decomposition of the states of a Markov chain
Putting the previous results together shows that the state space 𝑆 of a Markov
chain can be decomposed as
𝑆 = 𝑇 ∪ 𝐶𝑚1 ∪ 𝐶𝑚2 ∪ ⋯ ,
where
• 𝑇 is a set of transient states;
• 𝐶𝑚1 , 𝐶𝑚2 , … are irreducible closed classes of recurrent states.
Each 𝐶𝑚𝑖 is either null recurrent or positive recurrent, and all states in a particular 𝐶𝑚𝑖 have the same period (different 𝐶𝑚𝑖 ’s can have different periods).
The following observations apply:
• If 𝑋0 ∈ 𝐶𝑚𝑖 then the chain stays in 𝐶𝑚𝑖 forever.
• If 𝑋0 ∈ 𝑇 then either:
– the chain always stays in 𝑇 (only possible if the set 𝑇 of states is
infinite); or
– the chain is eventually absorbed in one of the 𝐶𝑚𝑖 ’s (where it stays
forever).
• If 𝑆 is finite then it is impossible for the chain to remain forever in the
(finite) set 𝑇 of transient states.
• If 𝑆 is finite then at least one state must be visited infinitely often and so
there must be at least one recurrent state.
We also know that if 𝑆 is finite then there are no null recurrent states. The
consequences this are:
• If a Markov chain has a finite state space then there must be at least one
positive recurrent state.
• If a Markov chain has a finite state space and is irreducible, then it must
be positive recurrent.
• A finite closed irreducible class must be positive recurrent.
70CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES
Part VI
Week 4
71
Chapter 6
Long-run behaviour of
discrete time Markov
processes
This is the crux of the first half of the course: understanding a Markov chain’s
behaviour in the long run (limiting behaviour).
Our goal here is to find whether the distribution of states, 𝑝(𝑛) , settles down
(i.e. converges) as 𝑛 → ∞. If it does, what is this limit?
We now have enough tools at our disposal. It will turn out that to fully determine its long term behaviour we must classify the Markov chain in terms
of:
• Irreducible classes;
• Transience and positive/null recurrence;
• Periodicity.
Remember: what we’re interested in is
lim 𝑝(𝑛)
𝑛→∞
For example, if 𝑆 = {1, 2, 3, 4}, then we’re interested in how the vector
𝑝(𝑛) = (𝑃 (𝑋𝑛 = 1), 𝑃 (𝑋𝑛 = 2), 𝑃 (𝑋𝑛 = 3), 𝑃 (𝑋𝑛 = 4))
changes as we let 𝑛 → ∞. Does this vector settle down as 𝑛 → ∞ (does it
have a limit)? If is does then this is good news: we can use this information to
predict the future behaviour of the chain because we know what 𝑃 (𝑋𝑛 = 𝑖) is
for large 𝑛. This allows us answer questions such as “what’s the probability that
73
74CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES
the chain is in state 2 at some large time”. While this might not sound all that
interesting, it opens the door to computing other questions about long-term
behaviour.
6.1 Invariant distributions
We will start with a somewhat simpler question than that of limiting behaviour:
are there particular initial distributions for a Markov chain, such that when the
Markov chain starts in any of these distributions, we can determine the limit of
𝑝(𝑛) as 𝑛 → ∞?
6.1.1
Definition
The probability row vector 𝜋 is an invariant distribution of a discrete-time
process {𝑋𝑡 , 𝑡 ∈ 𝑇 } if
𝜋 = 𝜋 𝑃.
If 𝜋 is an invariant distribution, and the initial distribution of the Markov chain
is 𝜋, why does this imply that we can study its long term behaviour?
The idea is that if the chain starts in 𝜋 (or equivalently, we find at some later
time that the distribution of states is 𝜋), then this tells us that the distribution
of states is 𝜋 forever: we’ve uncovered the long term behaviour of the chain in
this specific case.
This works since, suppose that an invariant distribution 𝜋 exists and that 𝑝(0) =
𝜋, then
𝑝(1) = 𝑝(0) 𝑃 = 𝜋 𝑃 = 𝜋;
𝑝(2) = 𝑝(1) 𝑃 = 𝜋 𝑃 = 𝜋;
𝑝(3) = 𝑝(2) 𝑃 = 𝜋 𝑃 = 𝜋;
In fact 𝑝(𝑛) = 𝜋,
as required!
6.1.2
etc.
for 𝑛 = 1, 2, … and we have found the limit of 𝑝(𝑛) as 𝑛 → ∞,
Consequences
If our initial distribution is 𝜋, with 𝜋 = 𝜋𝑃 , then the distribution of the states
at any given future time 𝑛 is also 𝜋. This is why it is called an invariant
distribution. An invariant distribution always exists if the state space 𝑆 is
finite, but it may not be unique. If 𝑆 is infinite then an invariant distribution
may not exist.
Assuming that the transition matrix 𝑃 is a 𝑘 × 𝑘 matrix, and the state space is
𝑆 = {1, 2, …, 𝑘}, solving 𝜋 = 𝜋𝑃 is a matter of solving a set of 𝑘 simultaneous
equations, each of the form
𝜋𝑖 = 𝜋1 𝑝1𝑖 + 𝜋2 𝑝2𝑖 + … + 𝜋𝑘 𝑝𝑘𝑖 .
6.1. INVARIANT DISTRIBUTIONS
75
If you try this you will find that one equation is redundant (try it!): we are
therefore trying to solve for 𝑘 unknowns (𝜋1 , …, 𝜋𝑘 ) using (𝑘 − 1) equations.
This is clearly impossible, but we have the advantage that since 𝜋 is itself a
probability distribution, we must have 𝜋1 +…+ 𝜋𝑘 = 1. This added requirement
means that we can now solve for 𝜋1 , …, 𝜋𝑘 .
6.1.3
How many invariant distributions?
How many invariant distributions can a discrete time Markov process have?
To find an invariant distribution, if one exists, we solve a set of simultaneous
linear equations. Given such a set there are either:
• No solutions – no invariant distribution exists;
• Exactly one solution – exactly one invariant distribution exists;
• An infinite number of solutions – there are an infinite number of invariant
distributions.
If 𝑆 is irreducible then:
An invariant exists ⟺ all states are positive recurrent.
If the invariant distribution exists under these conditions, it must be unique.
Notes:
1. If 𝑆 is irreducible, then all states are of the same type.
2. If 𝑆 is finite (and irreducible), then all states must be positive recurrent,
so an invariant must exist.
3. If 𝑆 is infinite (and irreducible), then all states must be of the same type,
but they aren’t necessarily positive recurrent.
If 𝑆 is NOT irreducible then:
• Finite S:
Chain is guaranteed to enter a positive recurrent class eventually, so invariant exists.
• Infinite S:
– If there exists a positive recurrent class, then there must be an invariant distribution.
– If there does not exist a positive recurrent class, then there cannot
be an invariant distribution
In any case, if there are two or more positive recurrent classes, then an infinite
number of invariant distributions exist.
76CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES
6.2 Equilibrium distributions
Simply looking at invariant distributions in order to determine a Markov
chain’s long term behaviour is very restrictive. We want a far more general
idea of long term behaviour. Can we determine whether 𝑝(𝑛) converges as
𝑛 → ∞, regardless of the initial distribution? If this limit exists, what is it?
Example 6.1. Recall the weather example (Exercise 3.2), with state space
𝑆 = { 𝑅, 𝐹 } and transition matrix 𝑃 ,
0.6 0.4
).
0.5 0.5
𝑃 =(
We examine how 𝑃 (𝑛) = 𝑃 𝑛 , the 𝑛-step transition matrix changes as 𝑛 increases.
𝑃2 = (
0.56 0.44
),
0.55 0.45
𝑃3 = (
0.556
0.555
0.444
),
0.445
𝑃 6 = 𝑃 3 ⋅𝑃 3 = (
0.555556 0.444444
).
0.555555 0.444445
Looking at different initial distributions for this Markov chain and what happens
in the ‘long run’ from these starting distributions:
Initial distribution
Distribution of 𝑋1
(0,1)
(1,0)
(0.7,0.3)
(0.5,0.5)
(0.5,0.5)
(0.6,0.4)
(0.57,0.43)
(0.55,0.45)
Distribution of 𝑋2
(0.55,
(0.56,
(0.557,
(0.555,
0.45)
0.44)
0.443)
0.445)

Distribution of 𝑋6

(0.55556,
(0.55556,
(0.55556,
(0.55556,
0.44444)
0.44444)
0.44444)
0.44444)
Notice that as 𝑛 → ∞, 𝑃 (𝑛) is tending to a limiting matrix whose rows are
identical to each other, and 𝑝(𝑛) seems to be tending to a limiting probability
distribution which is the same regardless of the initial distribution. This
is a very important phenomena which we describe next.
6.2.1
Definition
A probability row vector 𝜋 = {𝜋𝑗 , 𝑗 ∈ 𝑆} is an equilibrium distribution for the
discrete-time Markov chain {𝑋𝑛 } if
𝑝(𝑛) ⟶ 𝜋
as 𝑛 ⟶ ∞,
independently of the initial distribution 𝑝(0) (i.e. for any initial distribution, the
long-run distribution of 𝑋𝑛 as 𝑛 → ∞ is the same).
6.3. INVARIANT VS EQUILIBRIUM DISTRIBUTIONS
6.2.2
77
Consequences
When an equilibrium distribution exists,
(𝑛)
𝜋𝑗 = lim 𝑝𝑖𝑗 ,
𝑛→∞
for all 𝑖, 𝑗 ∈ 𝑆.
Notice what this says: 𝜋 is an equilibrium distribution if 𝑝(𝑛) → 𝜋 as 𝑛 → ∞
regardless of the start distribution, 𝑝(0) .
It is important to note that there cannot be two or more of these limits: either
such a limit does not exist, or exactly one exists. Think about this carefully
because it tells you that there is at most one equilibrium distribution!
The following interpretations of the equilibrium distribution, 𝜋, are important:
• 𝜋𝑗 is the limiting probability that the chain is in state 𝑗 at time 𝑛.
• It can be shown that 𝜋𝑗 is also equal to the long-run proportion of time
that the chain spends in state 𝑗.
6.3
Invariant vs equilibrium distributions
We know that 𝑝(𝑛) = 𝑝(𝑛−1) 𝑃 . If this chain has an equilibrium distribution then
𝑝(𝑛) ⟶ 𝜋 as 𝑛 → ∞, and 𝜋 is the equilibrium distribution. Therefore, letting
𝑛 → ∞ we have 𝜋 = 𝜋𝑃 , so that 𝜋 is also an invariant distribution.
Though the concepts of an invariant distribution and an equilibrium distribution
are therefore closely related, they are NOT the same thing: an equilibrium distribution must also be an invariant distribution, BUT an invariant distribution
is not necessarily an equilibrium distribution.
• An equilibrium implies the existence of an invariant;
• An invariant does NOT guarantee the existence of an equilibrium.
Invariant distribution 𝜋:
• If the chain starts in (or equivalently, gets to) 𝜋, the distribution of states
at all further times is also 𝜋.
Equilibrium distribution 𝜋:
• If the chain is run for long enough (i.e. forever) its probabilistic behaviour
settles down to that of 𝜋.
• This is regardless of the state in which the chain starts.
6.3.1
Important consequences
• No invariant distribution ⇒ no equilibrium distribution.
78CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES
• Exactly 1 invariant distribution ⇒ there may or may not be an equilibrium
distribution.
• Two or more invariant distributions ⇒ no equilibrium distribution.
Some other important things to note:
• Not all Markov chains have equilibrium distributions.
• If a Markov chain has an equilibrium distribution, then it is unique (cannot
have more than one equilibrium distribution).
• If a Markov chain has an equilibrium distribution, then it also has an
invariant distribution and these distributions are identical.
• If a Markov chain does not have an equilibrium distribution, then it may
or may not have an invariant distribution.
• A Markov chain can have any number of invariant distributions (including
no invariant distribution).
• It is possible for a Markov chain to have more than one invariant distribution but not have an equilibrium distribution.
• It is possible for a Markov chain to have neither an invariant distribution
nor an equilibrium distribution.
6.4 Ergodicity and the equilibrium distribution
An ergodic state is one that is both positive recurrent and aperiodic. The
existence of an equilibrium distribution depends on ergodicity.
Theorem 6.1. An equilibrium distribution exists if and only if there is an
ergodic class 𝐶 and the Markov chain is certain to be absorbed into 𝐶 eventually,
wherever it starts.
We won’t be proving this result. Notice that it gives us a checklist for existence
of an equilibrium distribution:
1. There is a single ergodic class;
2. The process is guaranteed to enter the single ergodic class wherever it
starts.
6.4.1
A special case
What happens when {𝑋𝑛 } is an irreducible ergodic Markov chain? Applying
the theorem:
6.4. ERGODICITY AND THE EQUILIBRIUM DISTRIBUTION
79
1. There is a single ergodic class (by definition in this case);
2. We’re guaranteed to enter the ergodic class (because it’s irreducible, we
start there and always stay there!)
Therefore an equilibrium distribution exists.
Without loss of generality we assume that 𝑆 = {0, 1, 2, …}. Then in this special
case (we won’t prove this) the equilibrium distribution 𝜋 satisfies
𝜋𝑗 =
1
,
𝜇𝑗
for all 𝑗 ∈ 𝑆 ,
where 𝜇𝑗 is the mean recurrence time of state 𝑗.
This provides us with an alternative way to show that a Markov chain is positive
recurrent: if the chain is irreducible, aperiodic and has an invariant distribution
(that is, there exits a 𝜋 such that 𝜋 = 𝜋𝑃 ) then it must be positive recurrent.
Therefore, an irreducible aperiodic Markov chain has an invariant distribution
if and only if the chain is positive recurrent.
6.4.2
Summary
When does an equilibrium distribution exist?
• More than 1 closed class ⇒ no equilibrium distribution exists (the longterm behaviour of the chain depends on 𝑝(0) ).
• No closed classes (i.e. all states are transient) ⇒ no equilibrium distribution exists.
• If there exists exactly one closed class 𝐶, is it certain that the Markov
chain will eventually be absorbed into 𝐶?
– If “no” then no equilibrium distribution exists (the long-term behaviour of the chain depends on 𝑝(0) ).
– If “yes” then an equilibrium distribution exists if and only if 𝐶 is
ergodic.
80CHAPTER 6. LONG-RUN BEHAVIOUR OF DISCRETE TIME MARKOV PROCESSES
Part VII
Week 5
81
Chapter 7
Checking the Markov
property and solving
problems
7.1
The idea of ‘first step decomposition’
Recall from Section 4.3.1 that the ‘first step decomposition’ technique is a very
useful method for computing certain expectations and probabilities associated
with Markov chains. You have already seen how to use this idea to compute
probabilities. Here we’ll use the same idea to compute expectations that would
otherwise be tricky to calculate.
The main tool will be the use of the law of conditional (iterated) expectation to
find expectations of interest:
𝐸[𝑋] = 𝐸 [ 𝐸[𝑋|𝑌 ] ]
7.1.1
An example
‘Roads connecting three locations in a city are shown in the following diagram,
the time taken to cycle along any of the roads being 1 minute. Geraint is a
bicycle courier whose oﬀice is located at location A. He has a parcel that he
needs to deliver to location C. However, he is a law abiding cyclist who will only
follow the direction of travel permitted on each road, as indicated by the arrows.
Whenever he comes to a junction, he selects any of the routes available to him
with probabilities as given in the diagram below.
• How long, on average, will it take Geraint to deliver the parcel?
• How many times, on average, will Geraint visit location B before delivering
the parcel?
83
84CHAPTER 7. CHECKING THE MARKOV PROPERTY AND SOLVING PROBLEMS
1
1/3
C
1/2
2/3
A
B
D
1/2
Figure 7.1: State-space diagram
There are many ways of solving these questions, but a convenient method is to
use ‘first step decomposition’. The idea is the same as you saw in Section 4.3.1:
think about ‘where do we go next’?
It’s important that you can ‘spot’ the types of situations where first step decomposition can be applied.
Example 7.1. How long, on average, will it take Geraint to deliver
the parcel?
We’ll apply first step decomposition to solve this question. Let 𝑋𝑛 denote
Geraint’s location at time 𝑛, and let 𝑇 denote the time it takes for Geraint to
deliver the parcel to location C.
We need to calculate 𝐸[𝑇 ], and we will do this by conditioning on Geraint’s
next move. In the first instance, he starts in location A, so 𝑋0 = 𝐴. Using the
iterated law of conditional expectation, we get:
𝐸[𝑇 ] = 𝐸[𝐸[𝑇 |𝑋1 , 𝑋0 = 𝐴]] = 𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴]𝑝𝐴𝐵 +𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴]𝑝𝐴𝐶 .
Substituting in the relevant transition probabilities we get:
𝐸[𝑇 ] =
2
1
𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴] + 𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴].
3
3
We don’t really need to condition on both 𝑋1 and 𝑋0 (remember that the
process is Markov!) so we can write:
𝐸[𝑇 ] =
2
1
𝐸[𝑇 |𝑋1 = 𝐵] + 𝐸[𝑇 |𝑋1 = 𝐶].
3
3
Now notice that 𝐸[𝑇 |𝑋1 = 𝐶] = 1 because we are conditioning on Geraint
reaching location C for the first time at time 1, so this simplifies to
𝐸[𝑇 ] =
1
2
𝐸[𝑇 |𝑋1 = 𝐵] + .
3
3
(7.1)
Now compute 𝐸[𝑇 |𝑋1 = 𝐵] by applying first step decomposition again: from
B, where can Geraint go next?
𝐸[𝑇 |𝑋1 = 𝐵] = 𝐸[𝐸[𝑇 |𝑋2 , 𝑋1 = 𝐵]] =
1
1
𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵]+ 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵].
2
2
7.2. HOW CAN I TELL IF A PROCESS IS MARKOV?
85
Now notice that 𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵] = 2 and 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵] =
2 + 𝐸[𝑇 ] (the latter because Geraint used 2 minutes going to B and then back
to A, and once back in A it is as if the process starts over from scratch).
Therefore 𝐸[𝑇 |𝑋1 = 𝐵] = 1 + 12 (2 + 𝐸[𝑇 ]). Substituting this back in to (7.1),
we have:
2
1
1
5
𝐸[𝑇 ] = 𝐸[𝑇 |𝑋1 = 𝐵] + = 𝐸[𝑇 ] + .
3
3
3
3
Solving gives 𝐸[𝑇 ] = 2.5 minutes.
You may want to consider simplifying the notation of the solution to Example 7.1
above, by defining new quantities. For example, let 𝐿𝑖 denote the expected time
to reach C, given that we start in state 𝑖. Then, the solution above simplifies to
2
1
𝐿𝐴 = 1 + 𝐿𝐵 + 𝐿𝐶 .
3
3
(7.2)
We must add 1 since we are using one minute to move from state A to another
state. Now we know that 𝐿𝐶 = 0 (if we start in C, it takes no time to get
there!), and 𝐿𝐵 = 1 + 𝐿𝐴 /2 + 𝐿𝐶 /2 = 1 + 𝐿𝐴 /2. Therefore, substituting into
(7.2), we need to solve
1
5
2
𝐿𝐴 = 1 + (1 + 𝐿𝐴 /2) = 𝐿𝐴 + .
3
3
3
Solving gives 𝐿𝐴 = 2.5 minutes, as before.
Note that exactly the same technique – first step decomposition – can also be
applied to solve problems for continuous time Markov processes (which we will
cover later in the course).
7.2
How can I tell if a process is Markov?
You will come across many ICA and exam questions asking you to explain
whether or not a particular process is Markov. There are two ways that you
could establish whether a process is Markov.
1. Without using any mathematics, deduce that the process is/ isn’t Markov
logically.
2. Checking (mathematically) that the Markov property holds for all states
in the statespace at all times.
The second strategy is particularly hard, so we mainly use this technique when
we suspect that the Markov property does not hold by finding a counter example
which does not satisfy the Markov property. To show that the Markov property
does not hold, find one set of states 𝑎, 𝑏, 𝑐 and perhaps 𝑑 such that one of the
following holds (you do not need to show both hold – one will do!):
86CHAPTER 7. CHECKING THE MARKOV PROPERTY AND SOLVING PROBLEMS
• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏) for some 𝑎, 𝑏, 𝑐;
• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑑) for
some 𝑎, 𝑏, 𝑐, 𝑑 with 𝑐 ≠ 𝑑….

Don't use plagiarized sources. Get Your Custom Essay on
STAT 0007 Discrete Time Markov Process Questions
Just from \$13/Page

## Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26
The price is based on these factors:
Number of pages
Urgency
Basic features
• Free title page and bibliography
• Unlimited revisions
• Plagiarism-free guarantee
• Money-back guarantee
On-demand options
• Writer’s samples
• Part-by-part delivery
• Overnight delivery
• Copies of used sources
Paper format
• 275 words per page
• 12 pt Arial/Times New Roman
• Double line spacing
• Any citation style (APA, MLA, Chicago/Turabian, Harvard)

## Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

### Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

### Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

### Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.