STAT 0007 Discrete Time Markov Process Questions

STAT0007: Introduction to Applied Probability(Level 5) or Stochastic Processes (Level 6)
Dr. Elinor M Jones
Week 0
1 Overview of STAT0007
1.1 Content of the course . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 How this course will be run . . . . . . . . . . . . . . . . . . . . .
1.3 Useful books . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Week 1
2 Preamble
2.1 Mathematical/ Statistical Writing . . . . . . . . . . . . . . . . . 17
Week 1 (Continued)
3 Preliminaries
3.1 What is a stochastic process?
3.2 Random variables . . . . . . .
3.3 Conditioning on events . . . .
3.4 Generating functions . . . . .
Week 2
4 Introduction to discrete time Markov processes
4.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Introduction to discrete time Markov chains . . . . . . . . . . . . 47
4.3 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 52
Week 3
5 Classifying discrete time Markov processes
5.1 Irreducible classes of intercommunicating states . . . . . . . . . . 61
Recurrence and transience . . . . . . . . . . . . .
Periodicity . . . . . . . . . . . . . . . . . . . . . .
Class properties . . . . . . . . . . . . . . . . . . .
Further properties of discrete time Markov chains
Closed classes and absorption . . . . . . . . . . .
Week 4
6 Long-run behaviour of discrete time Markov
6.1 Invariant distributions . . . . . . . . . . . . .
6.2 Equilibrium distributions . . . . . . . . . . .
6.3 Invariant vs equilibrium distributions . . . . .
6.4 Ergodicity and the equilibrium distribution .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Week 5
7 Checking the Markov property and solving problems
7.1 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . 83
7.2 How can I tell if a process is Markov? . . . . . . . . . . . . . . . 85
Get ahead material (Reading week)
8 Preliminaries for continuous time Markov processes
8.1 Introduction to continuous-time Markov processes . . . . . . . . 91
8.2 The exponential distribution and continuous time Markov processes 92
Week 6
9 Holding times, jump chains and transition probabilities for continuous time Markov processes
9.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 The jump chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . 104
Week 7
10 The
generator matrix and
Kolmogorov’s equations
The generator matrix, 𝑄
Limiting behaviour . . .
long run
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . . 111
. . . . . . . . . . . . . . . . . 114
. . . . . . . . . . . . . . . . . 116
XI Week 8
11 Poisson processes
11.1 The Poisson process . . . . . . . . . . . . . . .
11.2 The generator matrix for the Poisson process .
11.3 Other properties of the Poisson process. . . . .
11.4 Superposition and thinning of a Poisson process
XII Week 9
12 Birth-death processes part 1
12.1 Birth and death processes . . . . . . . . . . . . . . . . . . . . . . 135
12.2 Transition rates and the jump chain . . . . . . . . . . . . . . . . 136
12.3 Important types of birth-death processes . . . . . . . . . . . . . . 138
Part I
Week 0
Chapter 1
Overview of STAT0007
Content of the course
This module aims to provide an introduction to the study of systems which
change state stochastically with time and to facilitate the development of skills
in the application of probabilistic ideas. It is primarily intended for second and
third year students registered on the undergraduate degree programmes offered
by the Department of Statistical Science, or jointly with other departments.
On successful completion of the module a student should understand the Markov
property in discrete and continuous time; for discrete-time Markov chains, be
able to find and classify the irreducible classes of intercommunicating states,
calculate absorption or first passage times and probabilities, assess the equilibrium behaviour; for simple examples of continuous-time Markov chains, be
able to write down the forward equations, find and interpret the equilibrium
Stochastic processes are vital to applications in finance and insurance, and have
many applications in biology and medicine, and in the social sciences. They
also play a fundamental role in areas such as queueing theory and the study
of system reliability. The material in this module can be applied to simplified
real-world situations, and provides the foundations for further study of more
complex systems.
The course will start by briefly revising of conditional probability (however you
are expected to be well-versed in this before starting the course). We’ll then
move on to look at Markov chains (discrete time and states), and in particular
long-run behaviour. The second half of the course looks at Markov processes
in continuous time, but still discrete state space, and again our ultimate aim
is to study the long-run behaviour of these processes. We’ll look at two very
special types of continuous time processes: Poisson processes and birth-death
1.2 How this course will be run
Course leader
The course will be led by Dr. Elinor Jones, Associate Professor (Teaching) in
the Department of Statistical Science.
Level 5 and 6 versions of the course
The course is offered at two levels, but the chances are you won’t be able to
choose which level (this generally depends on your year of study etc).
The content and teaching is identical between the two levels, but in the main
exam period level 6 students will complete a more challenging paper.
How you’ll learn
You will work on the material in your own time with live sessions to complement
your learning. A typical weekly schedule will look like the following:
• Batch of work uploaded to Moodle each Monday at 9am.
• Finish the work by the following Monday.
• Each week, work will include:
– Watch videos;
– Read course notes;
– Complete guided exercises;
– Complete and submit homework each week (not in the first week);
– Quizzes;
– Other activities, as required.
You will also attend about three live sessions a week (exact number depends on
which week we’re in).
ONLINE large group sessions (all students attend):
• Tuesday 9-10am UK time.
• Friday 9-10am UK time.
These sessions typically include:
• Overview of material;
• Time for questions;
• Going through unseen questions.
FACE-TO-FACE small group sessions:
• Tutorial (attendance is monitored).
• All students must attend their allocated session
Weekly exercises
On Friday each week I will release an exercise sheet. Note: the exercise sheet is
not released at the start of the week.
• Submit by the following Thursday 10am UK time.
• Solutions will be available at 9am each Friday.
– Those with a SoRA will have until 12:01am on Friday (i.e. an additional 14 hours according to UCL guidelines) to submit.
• NO LATE SUBMISSIONS. This rule will be strictly adhered to: no
exceptions (but see note above for students with a SoRA).
• Your work will be marked and individual feedback provided for SOME
These weekly exercises will feed into the weekly tutorial (tutorials start in the
third week).
• You should have been allocated to a tutorial group.
• You cannot move tutorial group unless you have a timetable clash.
• If you need to change your tutorial group, please email Ms Karen Leport.
– Please don’t email me: I’ve no control over tutorial allocation.
During the tutorial, your tutor will go through the questions that were not
marked. You should ‘mark’ your own work while they do this.
You will then discuss an unseen question which is connected to the work you’ve
been doing. The question will be more involved than a typical homework question. You may be asked to do short tasks connected to the question, e.g.
• Discuss in small groups;
• Respond to live quiz questions.
Please take part: the tutorials are designed to benefit your learning.
Weekly structure in more depth
Just to make sure you are clear on how the weekly workload will be managed:
Week 1

Material released on Monday 10th Jan.
Live session Tuesday 11th Jan (welcome session only).
Live session Friday 14th Jan (Q&A for week 1 material).
Exercise sheet 1 released Friday 14th Jan.
Week 2
• Material released on Monday 17th Jan.
• Live session Tuesday 18th Jan (Q&A for week 1 material, continued).
• Submit exercise sheet 1 by 10am on Thursday 20th Jan (unless you have
a SoRA).
• Live session Friday 21st Jan (Q&A for week 1 and 2 material).
• Exercise sheet 2 released Friday 21st Jan.
Week 3

Material released on Monday 24th Jan.
Live session Tuesday 25th Jan (Q&A for week 2 material, continued).
Attend your allocated tutorial.
Submit exercise sheet 2 by 10am on Thursday 27th Jan (unless you have
a SoRA).
• Live session Friday 28th Jan (Q&A for week 2 and 3 material).
• Exercise sheet 3 released Friday 28th Jan.
And so on, and so forth. There will be some exceptions to this, e.g. when you
sit an ICA for the course.
Getting help
You can get help in the following ways:
• Post to the Forum.
• Book an office hour slot.
• If your question is connected to the tutorial material, then ask your tutor
during the tutorial. Your tutor will not answer questions outside of your
tutorial slot.
Please note that I won’t answer questions by email (the exception to this is if
your question is more personal, e.g. organising additional time for your ICA).
Assessment is split into four components:
• ICA 1: Tuesday 8th February (7.5% weight)
– Short Moodle quiz.
– Takes place instead of Tuesday live session.
– Mixture of MCQs and calculation questions.
• ICA 2: Tuesday 22nd March (7.5% weight)
– Short Moodle quiz
– Takes place instead of Tuesday live session.
– Mixture of MCQs and calculation questions.
• Participation (10% weight)
– Acceptable quality of weekly exercise submission.
– Peer exercise.
• Open book exam (75% weight)
– Sometime in main exam period.
– No choice of questions.
– Level 6 students will have more challenging questions.
Useful books
• Introduction to Probability Models by S. Ross;
• Probability and Random Processes by Grimmett and Stirzaker (more
mathematical… be warned).
Part II
Week 1
Chapter 2
Mathematical/ Statistical Writing
Mathematics is a language for which you must learn its grammar. Whether you
are handing in a tutorial sheet or writing an exam, you need to present your
work in an appropriate manner. You could start by thinking of the following:
• Think carefully about precisely what you are trying to calculate/ solve
and what information you have available in order to do so.
• Have you clearly defined all your notation? Be precise: are your definitions watertight?
• Is your notation appropriate?
• Does your solution flow logically, that is, does each step follow on logically
from the previous step? Again, be precise. The examiner is not a mind
reader. YOU need to guide the reader through your thought process.
• Have you explained your reasoning, not just written down some formulae?
• It is important to understand what you are doing; don’t just repeat a
list of steps. Always ask yourself: ‘do I know what I’m doing?’ and ‘how
would I explain this to a fellow student?’.
Being able to write clearly will help you to think clearly. It forces you to
understand the material: if you can’t write it well, you don’t understand it well
enough. Learning mathematics is not about repetition, as it was at A-Level,
and so past papers are NOT A GOOD REVISION TOOL. In particular, I write
ICA and exam questions that probe your understanding, and not your ability
to regurgitate methods. This reflects the fact that proper mathematics is not
about learning to apply a set of methods, but using logical reasoning to find
your way through a problem.
Some tips for writing in STAT0007 (based on previous cohorts of students):
• Random variables form the basis for most, if not every, question you will
encounter. Make sure you define each and every random variable you use.
All random variables should be assigned to CAPITAL letters.
• Probabilities should be written out in full – if you want to calculate 𝑃 (𝑋 =
1) for some random variable 𝑋, then write it as such. NEVER use notation
such as 𝑃 (1) to denote this. And, as the previous bullet point states,
writing 𝑃 (𝑥 = 1) is plain wrong.
• The ‘equals sign’ has a very specific meaning. Please don’t abuse it.
• Do a reality check on your answer. For example, a probability should
be between 0 and 1. An expectation of how many time units it takes to
reach a particular point should be at least 1 (assuming moving between
locations takes 1 unit of time, and that you’re not already there!).
• If your answer doesn’t contain written explanation, then it is not a complete solution. Think about guiding the marker through your thought process, rather than expecting the marker to second guess what you meant
to say.
• Use tutorial sheets to practice good mathematical writing.
Part III
Week 1 (Continued)
Chapter 3
What is a stochastic process?
A course overview
A stochastic process is a process which evolves randomly over time (or space,
or both):
• ‘stochastic’: equivalent to ‘random’.
• ‘process’: occurring over time (or space, or both).
We will consider some simple stochastic processes which are used to model reallife situations. Our main focus will be on developing mathematical tools to
study their properties and learn about long-term behaviour of such processes.
Though this is a mathematical course, the applications of stochastic processes
are wide-ranging, e.g. in physics, engineering, the environment, social science,
epidemics, genetics and finance. We’ll see examples from some of these areas
during the course.
Definitions and basic properties
A stochastic process is a collection of random variables {𝑋𝑡 , 𝑡 ∈ 𝑇 } taking
values in the state space 𝑆. The parameter, or index, set 𝑇 often represents
time. We will look at both discrete-time processes and continuous-time
processes. We’ll generally use the following notation, though you need to be
flexible in adapting to different notation from time to time.
• Discrete-time processes: {𝑋𝑡 , 𝑡 ∈ 𝑇 }, where 𝑇 = {0, 1, 2, …}. This is often
written in terms of the natural numbers (including zero) rather than 𝑇 ,
i.e. {𝑋𝑛 , 𝑛 ∈ N}, where N = {0, 1, 2, …}
• Continuous-time processes: {𝑋(𝑡), 𝑡 ∈ 𝑇 }, where 𝑇 = R+ (i.e. 𝑡 ≥ 0).
We’ll often write this as {𝑋𝑡 , 𝑡 ∈ 𝑇 }, which is identical to the discrete time
notation above, but note that 𝑡 here can take any value in the positive
reals whereas for discrete time processes the set of values 𝑡 can take is
• Notice that the state space can be discrete or continuous. In this course
we will only consider discrete state spaces.
Examples of stochastic processes
Below are examples of stochastic processes, categorised according to whether
they are continuous or discrete time processes, and whether the state space is
discrete or continuous. We only consider discrete state space processes in this
course, in both discrete and continuous time.
Discrete time, discrete state space
Here we are tossing a coin and counting the number of excess heads (i.e. how
many more heads than tails we see). The ‘time’ on the x-axis represents the coin
tosses. For example, we see from this particular realisation that the first toss
must have landed tails (because the number of excess heads is now -1), followed
by a heads (we’re now back up to 0 excess heads). The state-space is countable
(integers, so 𝑆 = Z) and the time element here is the coin tosses themselves so
we’re in discrete ‘time’ with 𝑇 = N ∪ {0}. Notice here that one coin toss is
equivalent to one unit of time, so we need to be flexible in how we view ‘time’.
Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent: if we know 𝑋𝑘 say
then 𝑋𝑘+1 will either be 𝑋𝑘 + 1 (if we see a head at the next toss) or 𝑋𝑘 − 1
Discrete time, continuous state space
Here we have a gambler who starts off with £10. We’re tracking how much
money they have after each game. The ‘time’ on the x-axis represents the games
that the gambler bets on, so the state-space is 𝑇 = N ∪ {0}. We consider the
state-space to be continuous here, though technically you could consider it to be
discrete. The reason why it’s easier to consider this as continuous state-space
is that there are so many possible ‘values’ that the process could take (think:
each value corresponds to a particular value in pounds and pence). Notice again
here that one game is equivalent to one unit of time, so we need to be flexible in
how we view ‘time’. Also notice that the sequence 𝑋1 , 𝑋2 , … isn’t independent:
if we know 𝑋𝑘 then though we can’t provide a short list of possible values for
𝑋𝑘+1 as we could for the previous example, we could take an educated guess
depending on the game being played.
Continuous time, discrete state space
Here we are counting occurrences of a specified event. At time zero we’ve seen
no events, but the first event happens roughly at time 2.5 (we can see this as
the number of occurrences steps up to 1), and the second event happens shortly
after. By time 5 we have seen 2 events. Notice here that the events can happen
at any instant in time so we think of time as continuous. The state-space is
Figure 3.1: Tracking the number of excess heads in 20 coin tosses
discrete as we’re just counting events (𝑆 = N ∪ {0}). Also notice that if we look
at the value of the process at time say 8.3 and 8.8, 𝑋(8.3) and 𝑋(8.8), then
these aren’t independent: if we know 𝑋(8.3) then 𝑋(8.8) must be at least as big
as 𝑋(8.3) as we can’t ‘unsee’ events. Under certain distributional assumptions
which we’ll discuss later in the course, these processes are known as ‘Poisson
Continuous time, continuous state space
Finally an example of a stochastic process that we won’t be studying further:
this is a Brownian motion which has continuous state-space and continuous
time. It has continuous paths, but an extension to Brownian motion is the Lévy
process which allows jumps in values to occur. These types of processes are
used in Finance but are beyond the scope of this course.
NOTE: Continuous-time processes can change their value/state (‘jump’) at any
instant of time; discrete-time processes can only do this at a discrete set of time
points. For discrete-time processes, when we use the word ‘time’ we mean
‘number of transitions/steps’.
Figure 3.2: Tracking a gambler’s wealth over 30 games
Figure 3.3: A realisation of a Poisson process
Figure 3.4: A realisation of a Brownian motion
Random variables
Informally speaking, a random variable is a variable (i.e. can take different
values), whose numerical values are randomly chosen from a set of possible
values, governed by a probability distribution. For example, the time that a
passenger must wait until the next southbound tube train at Goodge street
station is a variable. Now suppose you arrive at Goodge street station – the
amount of time you must wait until the next southbound tube arrives is a
realisation of the random variable ‘waiting time’.
It is important that you can spot a random variable, and can also define appropriate random variables for a given scenario.
Ideally, we should think about random variables as functions. In order to do so,
we must first build up some terminology. We’ll use a coin tossing example to
illustrate: suppose you toss a coin twice and record whether it lands heads or
tails at each toss. Note that the following is a simplification of ideas from Probability Theory and we won’t be using the technicalities in this course. However,
it does no harm to see these ideas.
• An experiment is any situation where there is a set of possible outcomes.
– Our coin-tossing example is an experiment.
– The possible outcomes are HH, HT, TH or TT.
• The set of possible outcomes is called the sample space, Ω.
– In our case, Ω = {𝐻𝐻, 𝐻𝑇 , 𝑇 𝐻, 𝑇 𝑇 }
• An outcome is denoted 𝜔 and is an element of the sample space.
– Suppose we conduct the experiment, and the outcome is TH. This is
our 𝜔.
• An event, 𝐴, is a subset of the sample space.
– e.g. define an event as ‘at least one head’.
– This is satisfied by HH, HT and TH.
– This is a subset of Ω.
Probability model
Real world
Sample space, Ω
Sample point, 𝜔
Event 𝐴 ⊆ Ω
‘Set of all possible outcomes of the experiment.’
‘Outcome of the experiment.’
‘Property that the outcome of the experiment may
With a finite sample space Ω, we can assign probabilities to individual sample
points 𝜔 via a ‘weight function’, 𝑝 ∶ Ω → [0, 1]. This allocates a value 𝑝(𝜔) for
each possible outcome 𝜔, which we interpret as the probability that 𝜔 occurs.
We need
∑ 𝑝(𝜔) = 1,
then we may calculate the probability of an event 𝐴 ⊆ Ω via
P(𝐴) = ∑ 𝑝(𝜔).
Caution: this doesn’t always work with infinite sample spaces, but we won’t
consider this further.
Notice that:
1. P(𝐴) ≥ 0 for all events 𝐴,
2. if 𝐴1 , 𝐴2 , 𝐴3 , … are events with 𝐴𝑖 ∩𝐴𝑗 = ∅ for all 𝑖 ≠ 𝑗, then P(∪∞
𝑖=1 𝐴𝑖 ) =

∑𝑖=1 P(𝐴𝑖 ) i.e. P is countably additive,
3. P(Ω) = 1.
As the outcome of an experiment corresponds to a sample point 𝜔 ∈ Ω, we
can make some numerical measurements whose values depend on the outcome
𝜔. This gives rise to a random variable, let’s call it 𝑋, which is a function
𝑋 ∶ Ω → R. Its value at a sample point, 𝑋(𝜔), represents the numerical value
of the measurement when the outcome of the experiment is 𝜔.
In our set-up, there are different random variables you could define. Let’s define
𝑋 to be the number of heads in the two tosses of a coin. In this case 𝑋 maps
Ω to 0 (if we see TT), 1 (if we see TH or HT) or 2 (if we see HH).
𝑋(𝑇 𝑇 ) = 0
𝑋(𝑇 𝐻) = 𝑋(𝐻𝑇 ) = 1
𝑋(𝐻𝐻) = 2
All outcomes are equally likely here, and so 𝑝(𝜔) = 1/4 for all outcomes 𝜔. If
we define our event 𝐴 =“two heads”, then since the only sample point in 𝐴 is
∑ 𝑝(𝜔) = 1/4,
and we say that P(𝐴) = 1/4.
Similarly, we can also define another event 𝐵 =“at least one tail”, and this is
given by P(𝑋 ≤ 1), which we can calculate as:
P(𝐵) = P(𝑋 ≤ 1) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 1})
Which 𝜔 satisfy (3.1) in this case? Now you should be able to calculate the
probability required.
Note: we have defined the distribution function of 𝑋 above. In general, the
distribution function of 𝑋 of a random variable is given, for 𝑥 ∈ R, by:
𝐹𝑋 (𝑥) = P(𝑋 ≤ 𝑥) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ≤ 𝑥}).
Discrete random variables
𝑋 is a discrete random variable if it takes only finitely many or countably
infinitely many values.
The probability mass function of a discrete random variable is
𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = P({𝜔 ∶ 𝑋(𝜔) = 𝑥}).
The expectation of 𝑋 is
E(𝑋) = ∑ 𝑥 P(𝑋 = 𝑥).
For a function 𝑔 we have
E(𝑔(𝑋)) = ∑ 𝑔(𝑥) P(𝑋 = 𝑥).
The variance of 𝑋 is
Var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .
Continuous random variables
𝑋 is a continuous random variable if it has a probability density func∞
tion, i.e., if there exists a non-negative function 𝑓𝑋 with ∫−∞ 𝑓𝑋 (𝑥)𝑑𝑥 = 1, such
that for all 𝑥,
𝐹𝑋 (𝑥) = ∫ 𝑓𝑋 (𝑢)𝑑𝑢 .
𝑓𝑋 (𝑥) =
𝑑𝐹𝑋 (𝑥)
The expectation of 𝑋 is

E(𝑋) = ∫
𝑥𝑓𝑋 (𝑥)𝑑𝑥
and the expectation of 𝑔(𝑋), for some function 𝑔, is

E(𝑔(𝑋)) = ∫
𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥 ,
provided that these integrals exist. As before, the variance of 𝑋 is
var(𝑋) = E[(𝑋 − E(𝑋))2 ] = E(𝑋 2 ) − [E(𝑋)]2 .
Important distributions
There are several probability distributions which we will use repeatedly during
the course. Make sure you are very familiar with the following.
Discrete probability distributions:
0, 1, …, 𝑛
P(𝑋 =
𝑘) =
𝑋 ∼ Bin(𝑛, 𝑝)
𝑘 𝑛−𝑘
𝑘!(𝑛−𝑘)! 𝑝 𝑞
0, 1, 2, 3…
P(𝑋 = 𝜆
𝑘) =
exp(−𝜆) 𝜆𝑘!
1, 2, 3, 4…
P(𝑋 =
𝑘) =
(1 −
𝑝)𝑘−1 𝑝
𝑎 < 𝑥1 < 𝑏 2 (𝑏 − 𝑎) otherwise 1 12 (𝑏 𝑋 ∼ Po(𝜆) 𝜆>0
𝑋 ∼ Geom(𝑝)
Continuous probability distributions:
[𝑎, 𝑏]
𝑋 ∼ Exp(𝜆)
𝑋 ∼ Unif(𝑎, 𝑏)
− 𝑎)2
𝛽 exp(−𝛽𝑥)𝑥
Γ(𝛼) 𝛼/𝛽
𝛼/𝛽 2
𝑋 ∼ Gamma(𝛼, 𝛽)
𝛼, 𝛽 > 0
Conditioning on events
Calculating probabilities that are conditional on a particular event or occurrence is a natural requirement. Make sure you understand the rationale behind
conditional probability, don’t just learn the formulae!
Easy exercise
You have a deck of playing cards (52 cards in total, split equally into four suits:
hearts, diamonds, spades and clubs, with the first two considered ‘red’ cards,
and the others considered ‘black’ cards).
• I pick a card at random. What is the probability that the chosen card is
from the diamond suit?
• I pick a card at random and tell you that it is a red card (i.e. either hearts
or diamonds). What is the probability that the chosen card is from the
diamond suit?
The second question requires a conditional probability (it is conditional on knowing that the card is red), but you most likely calculated the probability intuitively without using any formulae. How did you do that? You applied Bayes’s
theorem intuitively, without even realising it.
Intuition behind conditional probability
In fact, Bayes’s theorem for calculating conditional probabilities is perfectly
intuitive. Suppose we are calculating P(𝐴|𝐶) for some events 𝐴 and 𝐶:
• This is asking for the probability of 𝐴 occurring, given that 𝐶 occurs.
• Assuming 𝐶 occurs, then the contents of the statespace Ω which relate
to 𝐶 becomes the ‘new’ statespace, Ω′ (which is just the event 𝐶 for our
• What we are asking here, in effect, is to calculate the probability of 𝐴 in
the ‘new’ sample space Ω′ (i.e. P(𝐴 ∩ 𝐶)), scaled to Ω′ (i.e. divided by
• The final equation is therefore P(𝐴|𝐶) = P(𝐴 ∩ 𝐶)/P(𝐶)
But why scale by P(𝐶)? Simply because the probability of all disjoint events
in any sample space must sum to 1. Our sample space is now Ω′ = 𝐶, so the
new (probability) weight function we apply to all events in Ω′ must sum to 1.
We scale all probabilities by P(𝐶) to ensure that this is the case.
The following graphic may help you to visualise why Bayes’s theorem ‘works’.
Figure 3.5: A schematic of conditional probability
Formal version of conditional probability
Let 𝐴 and 𝐶 be events with P(𝐶) > 0. The conditional probability of 𝐴
given 𝐶 is
P(𝐴 ∩ 𝐶)
P(𝐴 | 𝐶) =
It is easy to verify that
1. P(𝐴 | 𝐶) ≥ 0 ,
2. if 𝐴1 , 𝐴2 , … are mutually exclusive events, then
P(𝐴1 ∪ 𝐴2 ∪ … | 𝐶) = ∑ P(𝐴𝑖 | 𝐶),
(try drawing diagrams like the above to visualise this)
3. P(𝐶 | 𝐶) = 1.
Conditional probability for random variables instead of events works is the same
Discrete RVs 𝑋 and 𝑌
Continuous RVs 𝑋 and 𝑌
Joint probability mass function
Joint density function
𝑝𝑋,𝑌 (𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑋 and 𝑌 independent if and only
if for all 𝑥, 𝑦
𝑋 and 𝑌 independent if and only if for
all 𝑥, 𝑦
𝑝𝑋,𝑌 (𝑥, 𝑦) = 𝑝𝑋 (𝑥)𝑝𝑌 (𝑦) = P(𝑋 = 𝑥)P(𝑌 = 𝑓𝑦)
𝑋,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦)
Marginal probability mass
function (of 𝑋)
Marginal density (of 𝑌 )

𝑝𝑋 (𝑥) = P(𝑋 = 𝑥) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑌 (𝑦) = ∫
𝑓𝑋,𝑌 (𝑥, 𝑦)𝑑𝑥
Conditional PMF of 𝑋 given
𝑌 =𝑦
𝑝𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 = 𝑥 | 𝑌 = 𝑦) =
Conditional density of 𝑋 given 𝑌 = 𝑦
(𝑓𝑌 (𝑦) > 0)
P(𝑋 = 𝑥, 𝑌 = 𝑦)
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑓𝑋∣𝑌 =𝑦 (𝑥) =
P(𝑌 = 𝑦)
𝑓𝑌 (𝑦)
Conditional distribution function
of 𝑋 given 𝑌 = 𝑦
Conditional distribution function of 𝑋
given 𝑌 = 𝑦
𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)
𝐹𝑋∣𝑌 =𝑦 (𝑥) = P(𝑋 ≤ 𝑥 ∣ 𝑌 = 𝑦)
Conditional expectation
Just like calculating an expectation of, say, 𝑋, requires knowing the distribution
of 𝑋, the conditional expectation of 𝑋, given that we know 𝑌 = 𝑦 requires
knowing the distribution of (𝑋|𝑌 = 𝑦).
The conditional expectation of 𝑋 given 𝑌 = 𝑦 is
E(𝑋|𝑌 = 𝑦) = {
∑𝑥 𝑥P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∫−∞ 𝑥𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥
for continuous RVs
The conditional expectation of 𝑔(𝑋) given 𝑌 = 𝑦, for some function 𝑔,
E(𝑔(𝑋)|𝑌 = 𝑦) = {
∑𝑥 𝑔(𝑥)P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦) for discrete RVs

∫−∞ 𝑔(𝑥)𝑓𝑋∣𝑌 =𝑦 (𝑥)𝑑𝑥
for continuous RVs
Example 3.1. Suppose that Ω = {𝜔1 , 𝜔2 , 𝜔3 } and P(𝜔𝑖 ) = 1/3 for 𝑖 = 1, 2, 3.
Suppose also that 𝑋 and 𝑌 are random variables with
𝑋(𝜔1 ) = 2, 𝑋(𝜔2 ) = 3, 𝑋(𝜔3 ) = 1
𝑌 (𝜔1 ) = 2, 𝑌 (𝜔2 ) = 2, 𝑌 (𝜔3 ) = 1
Find the conditional pmf 𝑝𝑋∣𝑌 =2 (𝑥) and the conditional expectation E(𝑋 | 𝑌 =
Possible values of 𝑋 are 1, 2, 3.
𝑝𝑋∣𝑌 =2 (1) = P(𝑋 = 1 ∣ 𝑌 = 2) = 0.
𝑝𝑋∣𝑌=2 (2) = P(𝑋 = 2 ∣ 𝑌 = 2) =
P(𝑋 = 2, 𝑌 = 2)
P(𝜔1 )
= .
P(𝑌 = 2)
P(𝜔1 )+P(𝜔2 )
𝑝𝑋∣𝑌=2 (3) = P(𝑋 = 3 ∣ 𝑌 = 2) =
P(𝜔2 )
P(𝑋 = 3, 𝑌 = 2)
= .
P(𝑌 = 2)
P(𝜔1 )+P(𝜔2 )
Thus the conditional pmf of 𝑋 given 𝑌 = 2 equals
and is 0 otherwise. The conditional expectation is
for 𝑥 = 2 and for 𝑥 = 3
E(𝑋 | 𝑌 = 2) = ∑ 𝑥 P(𝑋 = 𝑥 ∣ 𝑌 = 2) = (1 × 0) + (2 × 1/2) + (3 × 1/2) = 5/2 .
So far, we have only calculated conditional expectations when conditioning on
specific values, e.g. E[𝑋|𝑌 = 2]. We can extend this idea to conditioning on
random variables without equating them to specific values.
The notation E(𝑋 | 𝑌 ) is used to denote a random variable that takes the
value E(𝑋 | 𝑌 = 𝑦) with probability P(𝑌 = 𝑦).
Note that E(𝑋 | 𝑌 ) is a function of the random variable 𝑌 , and is itself a
random variable.
Example 3.2. In Example 3.1 we found that
E[𝑋 | 𝑌 = 2] =
It can similarly be shown that
E[𝑋 | 𝑌 = 1] = 1
You can also check that
P(𝑌 = 1) = 1/3 and P(𝑌 = 2) = 2/3
Thus, the random variable
E[𝑋 | 𝑌 ]
has two possible values: 1 and 25 , with probabilities
E[𝑋|𝑌 ] = {
and 23 , respectively. That
with probability 1/3
with probability 2/3
Useful conditioning formulae
Three important formulae:
Law of total probability Let 𝐴 be an event and 𝑌 be ANY random variable.
∑ P(𝐴|𝑌 = 𝑦)P(𝑌 = 𝑦)
if 𝑌 discrete
P(𝐴) = { ∞𝑦
∫−∞ P(𝐴 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦 if 𝑌 continuous
Modification of the law of total probability: if 𝑍 is a third discrete random
P(𝑋 = 𝑥 ∣ 𝑍 = 𝑧) = ∑ P(𝑋 = 𝑥, 𝑌 = 𝑦 ∣ 𝑍 = 𝑧)
P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑍 = 𝑧)
P(𝑋 = 𝑥, 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑌 = 𝑦, 𝑍 = 𝑧)
P(𝑍 = 𝑧)
= ∑ P(𝑋 = 𝑥 ∣ 𝑌 = 𝑦, 𝑍 = 𝑧) P(𝑌 = 𝑦 ∣ 𝑍 = 𝑧),
assuming the conditional probabilities are all defined.
Law of conditional (iterated) expectations
E[𝑋] = E[E(𝑋 | 𝑌 )] = {
∑𝑦 E(𝑋 | 𝑌 = 𝑦) P(𝑌 = 𝑦)

∫−∞ E(𝑋 | 𝑌 = 𝑦)𝑓𝑌 (𝑦)𝑑𝑦
if 𝑌 discrete
if 𝑌 continuous
and, more generally,
E[𝑔(𝑋)] = E(E[𝑔(𝑋) | 𝑌 ])
Note in particular:
• The law of total probability applies to ANY event 𝐴 and ANY random
variable 𝑌 .
• The law of the conditional (iterated) expectation, E[𝑋] = E[E[𝑋|𝑌 ]] applies to ANY random variables 𝑋 and 𝑌 , and we will be using this identity
heavily during the course.
Example 3.3. In Example 3.2 we found the distribution of the random variable
E(𝑋 | 𝑌 ) was
w.p. 1/3
E[𝑋|𝑌 ] = {
5/2 w.p. 2/3
The expectation of this random variable is
E[E(𝑋|𝑌 )] = 1 ×
1 5 2
+ × = 2.
3 2 3
Also, the expectation of 𝑋 is given by
E(𝑋) = 1 ×
+ 2 × + 3 × = 2,
so that we see that the law of the conditional (iterated) expectation, i.e. that
E[𝑋] = E[E[𝑋|𝑌 ]], holds.
Example 3.4. Let 𝑋 and 𝑌 have joint density
𝑓𝑋,𝑌 (𝑥, 𝑦) =
1 −𝑥/𝑦 −𝑦
𝑒 , 0 < 𝑥, 𝑦 < ∞ . 𝑦 Find 𝑓𝑌 (𝑦), E(𝑋 | 𝑌 ) and hence E(𝑋). Solution For any 𝑦 > 0,

𝑓𝑌 (𝑦) = ∫
1 −𝑥/𝑦 −𝑦
𝑒 𝑑𝑥 = 𝑒−𝑦 [−𝑒−𝑥/𝑦 ]∞
0 =𝑒 ,
so 𝑌 has an exponential distribution with parameter 1.
Also, for any fixed 𝑦 > 0 and any 𝑥 > 0,
𝑓𝑋∣𝑌 (𝑥 | 𝑦) =
𝑓𝑋,𝑌 (𝑥, 𝑦)
𝑒−𝑥/𝑦 𝑒−𝑦
= 𝑒−𝑥/𝑦 ,
𝑓𝑌 (𝑦)
so 𝑋 | 𝑌 = 𝑦 ∼ exp(1/𝑦).
It follows that E(𝑋 | 𝑌 = 𝑦) = 𝑦 and so E(𝑋 | 𝑌 ) = 𝑌 . Using the Law of
Conditional Expectations we then have
E[𝑋] = E[E(𝑋 | 𝑌 )] = E(𝑌 ) = 1 .
Generating functions
Generating functions are USEFUL. Don’t think of them as baffling mathematical constructs; instead think of them as marvelous time-saving devices.
Imagine you have a sequence of real numbers, 𝑧1 , 𝑧2 , 𝑧3 , …. It can be hard to
keep track of such numbers and all the information they contain. A generating
function provides a way of wrapping up the entire sequence into one expression.
When we want to extract information about the sequence, we simply interrogate
this one expression in a particular way in order to get what we want.
There are many different types of generating function. These are different in
the sense that we wrap up the sequence in different ways, and so extracting
particular pieces of information from them requires different techniques. The
two types of generating function that you should be familiar with are:
1. The probability generating function (PGF);
2. The moment generating function (MGF)
Probability generating functions
The probability generating function (p.g.f) of 𝑋 is defined to be

𝐺𝑋 (𝑠) = E(𝑠𝑋 ) = ∑ 𝑠𝑘 P(𝑋 = 𝑘) for |𝑠| ≤ 1.
Probability generating function fundamentals
The random variable
Sequence to ‘wrap up’
Information stored
Takes values in N0 = {0, 1, 2, …}
P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), P(𝑋 = 3), …

∑𝑘=0 𝑠𝑘 P(𝑋 = 𝑘) (for any 𝑠 ∈ [−1, 1])
The probabilities P(𝑋 = 0), P(𝑋 = 1), P(𝑋 = 2), …
and some moments (expectations).
The information stored in probability generating functions can be accessed as
Calculating moments
E(𝑋) = ∣
E[𝑋(𝑋 − 1)] = ∣
𝑑𝐺𝑋 (𝑠)

= 𝐺′𝑋 (1),
𝑑2 𝐺𝑋 (𝑠)

= 𝐺″𝑋 (1), and so on.
Calculating probabilities
Suppose that 𝑋 takes values in N0 = {0, 1, 2, …}. Then the PGF evaluated at
0 is:

𝐺𝑋 (0) = ∑ 0𝑘 P(𝑋 = 𝑘) = P(𝑋 = 0).
Furthermore, if we expand 𝐺𝑋 (𝑠) in powers of 𝑠 the coefficient of 𝑠𝑘 is equal to
P(𝑋 = 𝑘), so we can also find P(𝑋 = 𝑘) for all 𝑘.
You can also work ‘backwards’: given a PGF for a random variable 𝑋, what is
the distribution of 𝑋?
Example 3.5. Suppose you are told that 𝐺𝑋 (𝑠) = 𝑝𝑠/[1 − (1 − 𝑝)𝑠]. We
can convert this expression into the summation format of a PGF and spot its
𝐺𝑋 (𝑠) =
1 − (1 − 𝑝)𝑠

= 𝑝𝑠 ∑(1 − 𝑝)𝑘 𝑠𝑘

= ∑ 𝑠𝑗 (1 − 𝑝)𝑗−1 𝑝,
where 𝑗 = 𝑘 + 1.
Therefore, P(𝑋 = 𝑗) = (1 − 𝑝)𝑗−1 𝑝, so 𝑋 ∼ Geometric(𝑝).
In fact, different distributions have different PGF formats but random variables
with the same distribution have a PGF of the same format. You can therefore
spot the distribution of a random variable instantly from the format of its PGF.
Moment generating functions
The moment generating function (m.g.f) of 𝑋 is defined to be
𝑀𝑋 (𝑠) = E [exp(𝑡𝑋)] = ∫ exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥
for any 𝑡 ∈ R such that the integral converges.
Probability generating function fundamentals
Random variable
Quantity to ‘wrap up’
Information stored
Must be real valued
𝑓𝑋 (𝑥)
∫R exp(𝑡𝑥)𝑓𝑋 (𝑥)𝑑𝑥 (for any 𝑡 ∈ R such that the
integral converges)
We can access all moments of 𝑋 from 𝑀𝑋 (𝑡):
E[𝑋 𝑛 ] = 𝑀𝑋 (0) =
𝑑 𝑛 𝑀𝑋

𝑑𝑡𝑛 𝑡=0
The same as for PGFs, you can spot the distribution of a random variable
instantly from the format of its MGF.
Notice that a discrete or continuous random variable can have a MGF, whereas
a PGF is for discrete random variables only (i.e. those with countable state
What else can we do with generating functions?
We’ll concentrate here on (useful) things you can do with probability generating functions, though similar conclusions can be reached using moment generating functions too.
Calculating the distribution of sums of random variables.
Suppose that 𝑋1 , … , 𝑋𝑛 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠).
The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑛 is given by
𝐺𝑌 (𝑠) = E (𝑠𝑋1 +⋯+𝑋𝑛 ) ,
= ∏ E (𝑠𝑋𝑖 ) ,
= [𝐺𝑋 (𝑠)]𝑛 ,
and by looking at [𝐺𝑋 (𝑠)]𝑛 , we can spot the distribution of 𝑌 , which may
otherwise be difficult. Also, if 𝑍1 ⟂
⟂ 𝑍2 (that is, 𝑍1 and 𝑍2 are independent but
do not necessarily have the same distribution) then
𝐺𝑍1 +𝑍2 (𝑠) = 𝐺𝑍1 (𝑠)𝐺𝑍2 (𝑠).
By looking at the PGF of 𝑍1 + 𝑍2 , 𝐺𝑍1 +𝑍2 (𝑠), we can then deduce the distribution 𝑍1 + 𝑍2 . This can be an easier strategy than trying to deduce its
distribution directly.
Example 3.6. 𝑋 and 𝑌 are both Poisson random variables, with parameters
𝜆1 and 𝜆2 , respectively. Assume that 𝑋 and 𝑌 are also independent of each
other. What is the distribution of 𝑋 + 𝑌 ?
Solution: Use probability generating functions!
The PGF of a Poisson random variable with parameter 𝜆 is
𝐺(𝑠) = 𝑒𝜆(𝑠−1) .
Therefore, the PGF of 𝑋 + 𝑌 is
𝐺𝑋+𝑌 (𝑠) = 𝑒𝜆1 (𝑠−1) 𝑒𝜆2 (𝑠−1) = 𝑒(𝜆1 +𝜆2 )(𝑠−1) .
We recognise the format of this PGF as the PGF of a Poisson distribution with
parameter (𝜆1 + 𝜆2 ). Therefore,
𝑋 + 𝑌 ∼ Poisson(𝜆1 + 𝜆2 )
Calculating the PGF of a random sum
That is, the sum of a random number of random variables. Suppose that
𝑋1 , … , 𝑋𝑁 are i.i.d. random variables with common PGF 𝐺𝑋 (𝑠) and that 𝑁
has PGF 𝐺𝑁 (𝑠).
The PGF 𝐺𝑌 (𝑠) of 𝑌 = 𝑋1 + ⋯ + 𝑋𝑁 is given by
𝐺𝑌 (𝑠) = E[E (𝑠𝑌 |𝑁 ) ].
We’ll see this technique of expanding an expectation again later (you have already seen it if you studied STAT0005), and it will prove to be a very useful
technique in solving problems. Using this, we can deduce that
E[E (𝑠𝑌 |𝑁 ) ] = E[[𝐺𝑋 (𝑠)]𝑁 ] = 𝐺𝑁 [𝐺𝑋 (𝑠)].
Similarly, it can be shown (challenge!) that the m.g.f. 𝑀𝑌 (𝑡) of 𝑌 is given by
𝑀𝑌 (𝑡) = E (exp𝑡𝑌 ) = 𝐺𝑁 [𝑀𝑋 (𝑡)].
Example 3.7. The number of people who enter a particular bookstore on
Gower Street, per day, follows a Poisson distribution with parameter 300. Of
those who visit, 60% will make a purchase. What is the distribution of the
number of people who make a purchase at the bookstore in one day?
Solution: Use probability generating functions!
Notice that whether a visitor makes a purchase is a Bernoulli trial, with probability of success 0.6. Let
𝑋𝑖 = {
if 𝑖th customer makes a purchase
so that we are looking to find the distribution of 𝑌 = 𝑋1 + … + 𝑋𝑁 where
𝑁 ∼ Poisson(300). Notice that the PGF of the Bernoulli random variable 𝑋 is
𝐺𝑋 (𝑠) = (0.4 + 0.6𝑠), and the PGF of 𝑁 is 𝑒300(𝑠−1) .
By the result above, we know that the PGF of 𝑌 is 𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠)). Think
of this as using 𝐺𝑋 (𝑠) as the ‘in place of’ 𝑠 in the PGF of 𝑁 :
𝐺𝑌 (𝑠) = 𝐺𝑁 (𝐺𝑋 (𝑠))
= 𝑒300(0.4+0.6𝑠−1) = 𝑒300(0.6𝑠−0.6) = 𝑒180(𝑠−1)
We recognise this as the PGF of a Poisson random variable with mean 180.
Therefore, the distribution of those who make a purchase is Poisson(180).
Part IV
Week 2
Chapter 4
Introduction to discrete
time Markov processes
The Markov property
A stochastic process {𝑋𝑡 , 𝑡 ∈ 𝑇 } is a Markov process if, for any sequence of
times 0 ≤ 𝑡0 < 𝑡1 < ⋯ < 𝑡𝑛 < 𝑡𝑛+1 and any 𝑛 ≥ 0, P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ) = P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 ), for any 𝑗, 𝑖0 , … , 𝑖𝑛 in 𝑆. 4.1.1 The Markov property - an intuitive explanation. The Markov property states that the value (or state) of the process at time 𝑡𝑛+1 depends only on its value (state) at time 𝑡𝑛 , and not on its behaviour previous to time 𝑡𝑛 . That is, the probability of any future event depends only on the most recent information we have about the process. Let • 𝑋𝑛 be the present state; • 𝐴 be a past event (involving 𝑋0 , … , 𝑋𝑛−1 ); • 𝐵 be a future event (involving 𝑋𝑛+1 , 𝑋𝑛+2 , ….) The Markov property (MP) states that given the present state (𝑋𝑛 ), the future 𝐵 is conditionally independent of the past 𝐴, denoted 𝐵 ⟂⟂ 𝐴|𝑋𝑛 . Therefore, 45 46CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES P(𝐴 ∩ 𝐵|𝑋𝑛 = 𝑖) = P(𝐴|𝑋𝑛 = 𝑖) P(𝐵|𝑋𝑛 = 𝑖). Similarly, on applying the Markov property, P(𝐵|𝑋𝑛 = 𝑖, 𝐴) = P(𝐵|𝑋𝑛 = 𝑖), for any 𝑛 ≥ 0, any 𝑖 and any 𝐴 and 𝐵. Notice here that we can use the Markov property because the most recent information about the chain, its location at time 𝑛, is given exactly as 𝑋𝑛 = 𝑖. 4.1.2 An important note about the Markov property. Think carefully about the definition of the Markov property. It states that P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ) = P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 ). We can only use the Markov property if the most recent information gives us specific information about the location of the process. We could not apply the Markov property if, for example, we wanted to simplify P(𝑋𝑡𝑛+1 = 𝑗|𝑋𝑡𝑛 = 𝑖𝑛 or 𝑗𝑛 , ⋯ , 𝑋𝑡0 = 𝑖0 ), since the most recent information (the location of the process at time 𝑡𝑛 ) does not tell us where exactly the Markov chain is at time 𝑡𝑛 . 4.1.3 Why is the Markov property useful? We’ll be using the Markov property repeatedly throughout the course, so make sure you are familiar with it! The Markov property is a strong independence assumption. It is useful because it simplifies probability calculations. It enables joint probabilities to be expressed as a product of simple conditional probabilities. Example 4.1. Suppose that 𝑋 is a (discrete time) Markov process taking values 0, 1 and 2. Write P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0) as a product of conditional probabilities, simplifying your answer as much as possible, justifying each step of your argument. We first use Bayes’s theorem: P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0) = P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2, 𝑋0 = 1) = P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1) 4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS 47 Notice that we can simplify the first expression on the right using the Markov property: P(𝑋2 = 0|𝑋1 = 2, 𝑋0 = 1) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1) =P(𝑋2 = 0|𝑋1 = 2) P(𝑋1 = 2|𝑋0 = 1)P(𝑋0 = 1) Therefore, P(𝑋0 = 1, 𝑋1 = 2, 𝑋2 = 0) = P(𝑋2 = 0|𝑋1 = 2) P(𝑋1 = 2|𝑋0 = 1) P(𝑋0 = 1) 4.2 Introduction to discrete time Markov chains Recall: a discrete-time Markov chain, often abbreviated to Markov chain, is a sequence of random variables 𝑋0 , 𝑋1 , 𝑋2 , … taking values in a finite or countable state space 𝑆 such that, for all 𝑛, 𝑖, 𝑗, 𝑖0 , 𝑖1 , … , 𝑖𝑛−1 P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, 𝑋𝑛−1 = 𝑖𝑛−1 , … , 𝑋0 = 𝑖0 ) = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖), that is, a discrete-time stochastic process satisfying the Markov property. 4.2.1 Time homogeneity and transition probabilities A Markov chain is time-homogeneous if P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖) does not depend on 𝑛 for all 𝑖, 𝑗 ∈ 𝑆. For a time homogeneous Markov chain, we have P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖) = P(𝑋𝑚+1 = 𝑗|𝑋𝑚 = 𝑖) even if 𝑚 ≠ 𝑛. In effect, this allows us to forget about when the chain moved from 𝑖 to 𝑗, all that matters is that this occurred in one time step. When a chain is time homogeneous, we also have that for any integer 𝑟 ≥ 1, P(𝑋𝑛+𝑟 = 𝑗|𝑋𝑛 = 𝑖) = P(𝑋𝑚+𝑟 = 𝑗|𝑋𝑚 = 𝑖). That is, the probability of moving from state 𝑖 to state 𝑗 in 𝑟 time steps is the same regardless of when this happened. In this course, we will assume that all discrete time Markov processes are time homogeneous unless otherwise stated. The probabilities 𝑝𝑖𝑗 = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖), for 𝑖, 𝑗, ∈ 𝑆, are called the (1-step) transition probabilities of the Markov chain. These are very important and will help us to solve many interesting questions about the behaviour of Markov chains. Notice that 𝑝𝑖𝑗 does not depend on 𝑛 as we are assuming that {𝑋𝑛 ; 𝑛 = 0, 1, 2, ...} is time homogeneous. 48CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES 4.2.2 Transition matrices The probabilities 𝑝𝑖𝑗 are often collected into a matrix 𝑃 called the transition matrix. The transition probability 𝑝𝑖𝑗 is the element of 𝑃 in the 𝑖th row and the 𝑗th column, that is, the (𝑖, 𝑗)th element of 𝑃 . 𝑝11 ⎛ 𝑝 ⎜ ⎜ 21 𝑝31 𝑃 =⎜ ⎜ ⎜ ⎜ ⋮ ⎝ 𝑝𝑛1 𝑝12 𝑝22 𝑝32 ⋮ 𝑝𝑛2 𝑝13 𝑝23 𝑝33 ⋮ 𝑝𝑛3 ... 𝑝1𝑛 ... 𝑝2𝑛 ... 𝑝3𝑛 ⋱ ⋮ ... 𝑝𝑛𝑛 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Important properties of a transition matrix include: • All entries are non-negative (and, because they are probabilities, are ≤ 1). • Each of the rows sum to 1, that is, ∑𝑗 𝑝𝑖𝑗 = 1, for all 𝑗 ∈ 𝑆. Why must each row sum to 1? Why needn’t each column sum to 1? 4.2.3 The n-step transition probabilities The 𝑛-step transition probabilities are (𝑛) 𝑝𝑖𝑗 = P(𝑋𝑛 = 𝑗|𝑋0 = 𝑖) = P(𝑋𝑚+𝑛 = 𝑗|𝑋𝑚 = 𝑖) (𝑛) for all 𝑚 if {𝑋𝑛 ; 𝑛 = 0, 1, 2, ...} is time-homogeneous. So 𝑝𝑖𝑗 is the probability that a process in state 𝑖 will be in state 𝑗 after 𝑛 ‘steps’. Note that due to time homogeneity, this does not depend on 𝑚. (𝑛) The 𝑛-step transition matrix 𝑃 (𝑛) has 𝑝𝑖𝑗 as its (𝑖, 𝑗)th element. (𝑛) 𝑃 (𝑛) ⎛ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎝ 4.2.4 𝑝11 (𝑛) 𝑝21 (𝑛) 𝑝31 ⋮ (𝑛) 𝑝𝑛1 (𝑛) 𝑝12 (𝑛) 𝑝22 (𝑛) 𝑝32 ⋮ (𝑛) 𝑝𝑛2 (𝑛) 𝑝13 (𝑛) 𝑝23 (𝑛) 𝑝33 ⋮ (𝑛) 𝑝𝑛3 (𝑛) ... 𝑝1𝑛 (𝑛) ... 𝑝2𝑛 (𝑛) ... 𝑝3𝑛 ⋱ ⋮ (𝑛) ... 𝑝𝑛𝑛 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ The Chapman-Kolmogorov equations (𝑛+𝑚) The Chapman-Kolmogorov (C-K) equations show us how to compute 𝑝𝑖𝑗 or 𝑃 (𝑛+𝑚) for 𝑚 ≥ 0, 𝑛 ≥ 0. These are useful equations which will enable us to calculate these quantities quickly and easily. 4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS 49 Consider time points 0, 𝑚 and 𝑚 + 𝑛, where 𝑚 ≥ 0, 𝑛 ≥ 0 and suppose that we want to compute the probability that we go from state 𝑖 to state 𝑗 in (𝑛 + 𝑚) steps, (𝑛+𝑚) 𝑝𝑖𝑗 = 𝑃 (𝑋𝑛+𝑚 = 𝑗|𝑋0 = 𝑖). The C-K equations are derived using the simple observation that at time 𝑚, we must be in some state (let’s call it 𝑘). By summing over all possibilities for 𝑘, we derive the C-K equations. Figure 4.1: Schematic of the CK equations (𝑚+𝑛) By conditioning on the state 𝑘 at time 𝑚 we derive 𝑝𝑖𝑗 (𝑛+𝑚) 𝑝𝑖𝑗 : = P(𝑋𝑚+𝑛 = 𝑗|𝑋0 = 𝑖) (4.1) = ∑ P(𝑋𝑚+𝑛 = 𝑗, 𝑋𝑚 = 𝑘|𝑋0 = 𝑖) (4.2) 𝑘∈𝑆 = ∑ P(𝑋𝑚+𝑛 = 𝑗|𝑋𝑚 = 𝑘, 𝑋0 = 𝑖) P(𝑋𝑚 = 𝑘|𝑋0 = 𝑖) (4.3) 𝑘∈𝑆 (𝑚) (𝑛) = ∑ 𝑝𝑖𝑘 𝑝𝑘𝑗 (4.4) 𝑘∈𝑆 Where we use a version of Bayes’s theorem to find (4.3), and the Markov property and time homogeneity to reach equation (4.4). The Chapman-Kolmogorov equations are (𝑛+𝑚) 𝑝𝑖𝑗 (𝑚) (𝑛) = ∑ 𝑝𝑖𝑘 𝑝𝑘𝑗 𝑘∈𝑆 Note ‘equations’ not ‘equation’ - there is one equation for each (𝑖, 𝑗) pair. Since: (𝑚) • {𝑝𝑖𝑘 , 𝑘 ∈ 𝑆} is the 𝑖th row of 𝑃 (𝑚) (𝑛) • {𝑝𝑘𝑗 , 𝑘 ∈ 𝑆} is the 𝑗th column of 𝑃 (𝑛) then the Chapman-Kolmogorov equations can be written in matrix form as 𝑃 (𝑚+𝑛) = 𝑃 (𝑚) ⋅ 𝑃 (𝑛) where ⋅ denotes matrix multiplication. Note that this single equation encodes all the individual C-K equations above. This may not seem particularly helpful, but note the following: 50CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES • 𝑃 (1) is the transition matrix 𝑃 . • 𝑃 (0) is the identity matrix 𝐼. Using this, note the following: 𝑃 (2) = 𝑃 (1+1) = 𝑃 ⋅ 𝑃 = 𝑃 2 . 𝑃 (3) = 𝑃 (1+2) = 𝑃 ⋅ 𝑃 (2) = 𝑃 3 . Notice also that the last line above could have been equivalently written/computed as 𝑃 (3) = 𝑃 (2+1) = 𝑃 (2) ⋅ 𝑃 = 𝑃 3 So what does this tell us? By using the C-K equations repeatedly, we find that: 𝑃 (𝑛) = 𝑃 𝑛 That is, the 𝑛-step transition matrix 𝑃 (𝑛) is equal to the 𝑛th matrix power of 𝑃. This is enormously helpful - given a Markov chain with (one-step) transition matrix 𝑃 , we can now compute its n-step transition matrix by simply multiplying 𝑃 with itself 𝑛 times. 4.2.5 Marginal probabilities So far, we have only considered conditional probabilities. However, we may also be interested in marginal probabilities such as P(𝑋𝑛 = 𝑗). (𝑛) Let 𝑝𝑗 = P(𝑋𝑛 = 𝑗). If, for example, 𝑆 = N0 = {0, 1, 2, …}, then (𝑛) (𝑛) 𝑝(𝑛) = (𝑝0 , 𝑝1 , …) = (P(𝑋𝑛 = 0), P(𝑋𝑛 = 1), …) is a probability row vector (i.e. a row vector with non-negative entries summing to 1) specifying the distribution of 𝑋𝑛 . (𝑛) (𝑛) Do not confuse 𝑝𝑖𝑗 and 𝑝𝑗 . The former is a conditional probability, and the latter a marginal probability. When 𝑛 = 0, and the state space is 𝑆 = {0, 1, 2, ...}, then (0) (0) (0) 𝑝(0) = (𝑝0 , 𝑝1 , 𝑝2 , …) = (P(𝑋0 = 0), P(𝑋0 = 1), P(𝑋0 = 2), …) is the distribution of 𝑋0 . This is the initial distribution: the distribution of states in which we start the Markov chain. This tells us the probability of the chain starting in any particular state. 4.2. INTRODUCTION TO DISCRETE TIME MARKOV CHAINS 51 (𝑛) If we look at 𝑝𝑗 , the probability that the chain is in state 𝑗 at time 𝑛, then (𝑛) 𝑝𝑗 (0) (𝑛) = P(𝑋𝑛 = 𝑗) = ∑ P(𝑋𝑛 = 𝑗|𝑋0 = 𝑖) P(𝑋0 = 𝑖) = ∑ 𝑝𝑖 𝑝𝑖𝑗 . 𝑖 𝑖 Equivalently, in matrix notation, 𝑝(𝑛) = 𝑝(0) ⋅ 𝑃 (𝑛) = 𝑝(0) ⋅ 𝑃 𝑛 . Thus, the initial distribution, 𝑝(0) , and the transition matrix, 𝑃 , together contain all the probabilistic information about the Markov chain. That is, this is all we need in order to compute or derive any quantity of interest related to the Markov chain. For example, (0) P(𝑋0 = 𝑖0 , 𝑋1 = 𝑖1 , … , 𝑋𝑛 = 𝑖𝑛 ) = 𝑝𝑖0 𝑝𝑖0 𝑖1 𝑝𝑖1 𝑖2 ⋯ 𝑝𝑖𝑛−1 𝑖𝑛 is a useful identity to compute a joint probability using only the (one-step) transition probabilities and initial distribution. 4.2.6 Duration of stay in a state Remember that a Markov chain is memoryless in the sense that its future is independent of its past, conditional on its current state (or the most recent information that we have about its location). Therefore, if the chain is currently in state 𝑖, the further amount of time that it will remain there does not depend on how long it has already been in state 𝑖. For any Markov chain the duration of stay in state 𝑖 satisfies P(stay in state 𝑖 for 𝑛 steps | chain is currently in state 𝑖) =P(stay in 𝑖 for a further 𝑛 − 1 steps, then leave 𝑖) 𝑛−1 =𝑝𝑖𝑖 (1 − 𝑝𝑖𝑖 ). Therefore, the duration of stay in state 𝑖 has a geometric distribution with parameter (1 − 𝑝𝑖𝑖 ), from which we can deduce that the expected amount of 1 . time that the process remains in state 𝑖 is 1−𝑝 𝑖𝑖 Where will the Markov chain go when it leaves 𝑖? (Note: we are conditioning on leaving state 𝑖, and so this isn’t 𝑝𝑖𝑗 unless 𝑝𝑖𝑖 = 0. If we do have 𝑝𝑖𝑖 = 0 then by definition the process will leave state 𝑖 at the next step!) 52CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES For 𝑗 ≠ 𝑖, P( MC moves to 𝑗| MC leaves state 𝑖) = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, 𝑋𝑛+1 ≠ 𝑖) (4.5) = P(𝑋𝑛+1 = 𝑗, 𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖) P(𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖) (4.6) P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖) P(𝑋𝑛+1 ≠ 𝑖|𝑋𝑛 = 𝑖) 𝑝𝑖𝑗 = . 1 − 𝑝𝑖𝑖 = (4.7) (4.8) Identity (4.6) comes from using Bayes’s theorem, while identity (4.7) just applies common sense (if we have that 𝑋𝑛+1 = 𝑗 then 𝑋𝑛+1 ≠ 𝑖 by default!). It’s important to realise that this answer makes sense: the probability is proportional to 𝑝𝑖𝑗 and is scaled so that 𝑝𝑖𝑗 = 1. ∑ 1 − 𝑝𝑖𝑖 𝑗≠𝑖 To summarise: • The duration of stay of the Markov chain in state 𝑖 has a geometric 1 distribution with expected value 1−𝑝 steps. 𝑖𝑖 • When the chain moves from state 𝑖, it goes to state 𝑗 (≠ 𝑖) with proba𝑝 bility 1−𝑝𝑖𝑗 . 𝑖𝑖 4.3 The idea of ‘first step decomposition’ The ‘first step decomposition’ (FSD) technique is a very useful method for computing certain expectations and probabilities associated with Markov chains, for example: 1. What’s the probability that the process reaches a particular absorbing state? 2. What’s the probability that the process visits state 1 exactly once before visiting state 3? 3. How long, on average, does it take the process to be absorbed? 4. How many visits to state 1 does the process make, on average, before reaching state 5? 5. How long does it take, on average, to reach a particular state for the first time? 6. What’s the probability that we ever reach/ never reach a particular point? 4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’ 53 We will be using this method extensively throughout the course. The idea is to use the following results, which you have already met in Section 3.3.4 to simplify calculations drastically. 1. Using the law of conditional (iterated) expectation to find expectations of interest: 𝐸[𝑋] = 𝐸 [ 𝐸[𝑋|𝑌 ] ] 2. Using the (modification of the) law of total probability to find probabilities of interest: 𝑃 (𝐴) = ∑ 𝑃 (𝐴|𝑌 = 𝑦)𝑃 (𝑌 = 𝑦) 𝑦 𝑃 (𝐴|𝑍 = 𝑧) = ∑ 𝑃 (𝐴|𝑌 = 𝑦, 𝑍 = 𝑧)𝑃 (𝑌 = 𝑦|𝑍 = 𝑧) 𝑦 In this course, we are mainly interested in applying this technique by conditioning on ‘where does our Markov process go next?’. That is, in the above, we choose 𝑌 to be the position of the Markov process at a particular time. This turns out to be incredibly powerful. This week, we’ll concentrate just on computation of probabilities concerning Markov processes using first step decomposition. Later in the course we’ll use this idea to compute expectations. 4.3.1 FSD and probabilities Consider the Markov chain that is defined by the following: Figure 4.2: State-space diagram defining the Markov process 54CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES Example 4.2. The stats-space here is 𝑆 = {1, 2, 3, 4, 5, 6} and suppose that 𝑋0 = 2 so that the initial distribution is 𝑝(0) = (0, 1, 0, 0, 0, 0). Notice that if the process eventually reaches state 1 then it will stay there forever. However, the process may never reach state 1 (e.g. if it reaches state 6 then it can never visit state 1). What is the probability that the chain is eventually reach state state 1? Solution A handy way of solving this type of problem is to use first step decomposition. The idea is to ask ‘where does the process go next’ and condition on that. This is just the total law of probability, where we decide to condition on the next step. In our example, 𝑋0 = 2. From there, where can the process go? In other words, what value(s) can 𝑋1 take? We can see from the state-space diagram that 𝑋1 will be either 1, 2, 4 or 5, and each of these possibilities happens with equal probability. Using the total law of probability: 𝑃 (reach state 1|𝑋0 = 2) = ∑ 𝑃 (reach state 1|𝑋1 = 𝑖, 𝑋0 = 2)𝑃 (𝑋1 = 𝑖|𝑋0 = 2) 𝑖∈𝑆 Now, of course, we can simplify the right hand side by using the Markov property and also use our notation for transition probabilities: 𝑃 (reach state 1|𝑋0 = 2) = ∑ 𝑃 (reach state 1|𝑋1 = 𝑖)𝑝2𝑖 𝑖∈𝑆 We also know the possibilities for 𝑋1 , so this is 𝑃 (reach state 1|𝑋1 = 1)𝑝21 + 𝑃 (reach state 1|𝑋1 = 2)𝑝22 (4.9) +𝑃 (reach state 1|𝑋1 = 4)𝑝24 + 𝑃 (reach state 1|𝑋1 = 5)𝑝25 (4.10) and substituing the relevant transition probabilities in we get that this is just: 1 𝑃 (reach state 1|𝑋1 = 1) + 4 1 + 𝑃 (reach state 1|𝑋1 = 4) + 4 1 𝑃 (reach state 1|𝑋1 = 2) 4 1 𝑃 (reach state 1|𝑋1 = 5) 4 (4.11) (4.12) We can simplify this a little further: we know that 𝑃 (reach state 1|𝑋1 = 1) = 1 of course! So we have 4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’ 1 1 + 𝑃 (reach state 1|𝑋1 = 2) 4 4 1 1 + 𝑃 (reach state 1|𝑋1 = 4) + 𝑃 (reach state 1|𝑋1 = 5) 4 4 55 (4.13) (4.14) But how does this help us? Notice the following: it doesn’t really matter whether we write 𝑋1 or 𝑋0 in the conditioning in (4.14). All this is saying is “now that we’re in state 𝑖, what’s the probability of getting to state 1”. In other words, 𝑃 (reach state 1|𝑋1 = 𝑖) = 𝑃 (reach state 1|𝑋0 = 𝑖) This is a handy observation because we can now use some notation that cover all these situations. Let 𝑃𝑖 = 𝑃 (reach state 1|𝑋0 = 𝑖) Now let’s apply this to the question we’re trying to solve (which is to find 𝑃2 ). 𝑃 (reach state 1|𝑋0 = 2) ≡ 𝑃2 = 1 1 1 1 + 𝑃2 + 𝑃4 + 𝑃5 4 4 4 4 (4.15) Let’s take it a step further. If we can write 𝑃4 and 𝑃5 in terms of 𝑃2 then we can solve for 𝑃2 . This is the purpose of FSD for probabilities: in our case we try to compute 𝑃2 so we use FSD to write an expression for 𝑃2 wholly in terms of 𝑃2 which then allows us to solve for 𝑃2 … magic! So let’s try it out. Because from state 4 we must return to state 2, we have: 𝑃4 = 𝑃 2 What about 𝑃5 ? You may want to write it out in full but the end result is the following: 𝑃5 = 1 1 1 1 𝑃2 + 𝑃3 + 𝑃5 + 𝑃6 4 4 4 4 We can simplify this considerably: 𝑃6 = 0 (why?) and 𝑃3 = 𝑃5 (why?). So 𝑃5 = 1 1 𝑃 + 𝑃 4 2 2 5 Therefore, 𝑃5 = Substituting all this in (4.15) we have: 1 𝑃 2 2 56CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES 𝑃2 = 1 1 1 11 + 𝑃2 + 𝑃2 + 𝑃 4 4 4 42 2 We can now solve to find 𝑃2 = 2/3. Of course, the above seems like a very long-winded calculation, but once you get the hang of it it really is very quick and painless. I could have written the entire calculation as follows. Example 4.3. Let 𝑃𝑖 = 𝑃 (reach state 1|𝑋0 = 𝑖) We want to compute 𝑃2 , and we can see from the state-space diagram that 𝑃1 = 1, 𝑃4 = 𝑃2 , 𝑃3 = 𝑃5 , 𝑃6 = 0. Therefore, 1 1 1 1 𝑃2 = 𝑃1 + 𝑃2 + 𝑃4 + 𝑃5 4 4 4 4 1 1 1 1 = + 𝑃2 + 𝑃2 + 𝑃5 4 4 4 4 (4.16) (4.17) In addition, 1 𝑃5 = 𝑃2 + 4 1 = 𝑃2 + 4 1 𝑃 + 4 3 1 𝑃 + 4 5 1 1 𝑃 + 𝑃 4 5 4 6 1 𝑃 4 5 which implies that 𝑃5 = 12 𝑃2 . Substituting this into (4.17) we have that 𝑃2 = 1 1 1 1 + 𝑃 + 𝑃 + 𝑃 4 4 2 4 2 8 2 Solving yields 𝑃2 = 2/3. We can play around with this idea quite bit. In example 4.2, we were told where the process starts. What if, instead, we are given an initial distribution? 4.3. THE IDEA OF ‘FIRST STEP DECOMPOSITION’ 57 Example 4.4. Suppose that 𝑝(0) = (0, 1/6, 0, 1/3, 1/2, 0). Find the probability that the chain is eventually reach state state 1. Using our notation from earlier, we have already established that: 𝑃2 ∶=𝑃 (reach state 1|𝑋0 = 2) = 2/3 𝑃4 ∶=𝑃 (reach state 1|𝑋0 = 4) = 2/3 𝑃5 ∶=𝑃 (reach state 1|𝑋0 = 5) = 1/3 (because we established that 𝑃4 = 𝑃2 and 𝑃5 = 1/2𝑃2 in Example 4.2). Let 𝑃𝐶 denote the probability of reaching state 1. Then 6 6 𝑃𝐶 = ∑ 𝑃 (reach state 1|𝑋0 = 𝑖)𝑃 (𝑋0 = 𝑖) ≡ ∑ 𝑃𝑖 ⋅ 𝑃 (𝑋0 = 𝑖) 𝑖=1 𝑖=1 Substituting in the probabilities from the initial distribution, and from previous example, we have: 1 1 1 𝑃𝐶 = 𝑃2 + 𝑃4 + 𝑃5 6 3 2 and so 𝑃𝐶 = 1/2. 58CHAPTER 4. INTRODUCTION TO DISCRETE TIME MARKOV PROCESSES Part V Week 3 59 Chapter 5 Classifying discrete time Markov processes Suppose we wish to study a particular Markov chain, summarised by its transition matrix 𝑃 and an initial distribution 𝑝(0) . We may be interested in the long-run probabilistic behaviour of the chain or in the probability that some future event of interest happens. Even if nobody has studied this particular Markov chain before, we are able to use some general results on the behaviour of Markov chains to help us. In order to use this theory we need to classify each state of the chain as being one of a number types, and hence classify the type of Markov chain that we have. The first step in classifying Markov chains is to split the state space into nonoverlapping groups (classes) of states. We will see later that states in the same class are of the same type. This simplifies the problem of classifying all the states in the chain because we only need to classify one state in each class. 5.1 Irreducible classes of intercommunicating states State 𝑖 communicates with state 𝑗 (we write 𝑖 → 𝑗) if (𝑛) 𝑝𝑖𝑗 > 0
for some 𝑛 ≥ 0,
that is, starting from state 𝑖, it is possible that the chain will eventually enter
state 𝑗.
States 𝑖 and 𝑗 intercommunicate (we write 𝑖 ↔ 𝑗) if 𝑖 → 𝑗 and 𝑗 → 𝑖.
Properties of ↔
• 𝑖 ↔ 𝑖 for all states 𝑖.
• if 𝑖 ↔ 𝑗 then 𝑗 ↔ 𝑖.
• if 𝑖 ↔ 𝑗 and 𝑗 ↔ 𝑘 then 𝑖 ↔ 𝑘.
(i)+(ii)+(iii) mean that ↔ is an equivalence relation on 𝑆.
We can partition the state space 𝑆 into non-overlapping classes of intercommunicating states. Within a particular class, every pair of states intercommunicate;
states in different classes do not intercommunicate.
Each class is called an irreducible class of states.
If there is only one irreducible class, then the Markov chain is said to be irreducible.
Example 5.1. For the Markov chain with state space 𝑆 = {0, 1, 2} and transition matrix
1/4 1/4
⎜ 0
𝑃 =⎛
⎝ 1/2 0
0 ⎞
1/2 ⎠
the state space can be partitioned in to the irreducible classes {0, 2} and {1}. In
particular, the Markov chain is not irreducible because there are two irreducible
classes. It might be helpful to see this by drawing a state-space diagram.
Figure 5.1: State-space diagram, which may help in deciding the irreducible
Recurrence and transience
Let {𝑋𝑛 } be a Markov chain with state space 𝑆 and let 𝑖 ∈ 𝑆. Let
𝑓𝑖 = P(ever return to 𝑖|𝑋0 = 𝑖) = P(𝑋𝑛 = 𝑖 for some 𝑛 ≥ 1|𝑋0 = 𝑖).
If 𝑓𝑖 = 1 then 𝑖 is recurrent (eventual return to state 𝑖 is certain).
If 𝑓𝑖 < 1 then 𝑖 is transient (eventual return to state 𝑖 is uncertain). Studying recurrence and transience enables us to ask further interesting questions of our Markov chain, for example, how many times will a Markov chain hit state 𝑖, given it starts in state 𝑖? 5.2.1 Transience Suppose that 𝑖 is transient. Let 𝑁 be the number of hits on 𝑖 (including the hit from the fact that 𝑋0 = 𝑖). Then P(𝑁 = 𝑛|𝑋0 = 𝑖) = P(return 𝑛 − 1 times to 𝑖, then never return|𝑋0 = 𝑖) = 𝑓𝑖𝑛−1 (1 − 𝑓𝑖 ), So, given 𝑋0 = 𝑖, 𝑁 has a geometric distribution with parameter 1 − 𝑓𝑖 . So 𝑁 is finite with probability 1 and E(𝑁 |𝑋0 = 𝑖) = 5.2.2 1 (< ∞). 1 − 𝑓𝑖 Recurrence Suppose that 𝑖 is recurrent. With probability 1 the chain will eventually return to state 𝑖. By time homogeneity and the Markov property, the chain ‘starts afresh’ on return to the initial state, so that state 𝑖 will eventually be visited again (with probability 1). Repeating the argument shows that the chain will return to 𝑖 infinitely often (with probability 1). We also have E(𝑁 |𝑋0 = 𝑖) is infinite. 5.2.3 First passage times Another idea connected with recurrence and transience is that of first passage times. This describes the time it takes for a Markov chain to return to state 𝑖, given that it started there. Let 𝑇𝑖𝑖 = min{𝑛 ≥ 1 ∶ 𝑋𝑛 = 𝑖|𝑋0 = 𝑖}. 64CHAPTER 5. CLASSIFYING DISCRETE TIME MARKOV PROCESSES 𝑇𝑖𝑖 is the first passage time from state 𝑖 to itself, that is, the number of steps until the chain first returns to state 𝑖 given that it starts in state 𝑖. If the chain never returns to state 𝑖 we set 𝑇𝑖𝑖 = ∞. Note that: 𝑓𝑖 = P(ever return to 𝑖|𝑋0 = 𝑖) = P(𝑇𝑖𝑖 < ∞) 1 − 𝑓𝑖 = P(never return to 𝑖|𝑋0 = 𝑖) = P(𝑇𝑖𝑖 = ∞). Connecting this with the idea of transience and recurrence yields: • If 𝑖 is recurrent then P(𝑇𝑖𝑖 < ∞) = 1, that is, 𝑇𝑖𝑖 is finite with probability 1. • If 𝑖 is transient then P(𝑇𝑖𝑖 < ∞) < 1, or equivalently P(𝑇𝑖𝑖 = ∞) = 1 − P(𝑇𝑖𝑖 < ∞) > 0, that is, 𝑇𝑖𝑖 is infinite with positive probability.
(Therefore, 𝜇𝑖 = 𝐸[𝑇𝑖𝑖 ] is infinite.)
Positive recurrence and null recurrence
In fact, there are two types of recurrent state. Recall that the first passage time
(also called the recurrence time) 𝑇𝑖𝑖 of a state 𝑖 is the number of steps until
the chain first returns to state 𝑖 given that it starts in state 𝑖.
Let 𝜇𝑖 = E(𝑇𝑖𝑖 ) be the mean recurrence time of state 𝑖. We have already
seen that if 𝑖 is transient then 𝜇𝑖 = ∞. We have not computed 𝜇𝑖 when 𝑖 is
recurrent. In fact, the value of 𝜇𝑖 when 𝑖 is recurrent enables us to distinguish
between two types of recurrence: positive recurrence and null recurrence.
A recurrent state 𝑖 is
• positive recurrent if 𝜇𝑖 < ∞ (that is, 𝜇𝑖 is finite), • null recurrent if 𝜇𝑖 = ∞ (that is, 𝜇𝑖 is infinite). At first glance it may seem strange that a state 𝑖 can be null recurrent: return to state 𝑖 is certain, but the expected time (number of steps) to return to 𝑖 is infinite. In other words, the random variable 𝑇𝑖𝑖 is finite with probability 1, but its mean, E(𝑇𝑖𝑖 ), is infinite. Such states do exist and we will see examples later in the course. 5.3 Periodicity The period of a state 𝑖 is the greatest common divisor of the set of integers (𝑛) 𝑛 ≥ 1 such that 𝑝𝑖𝑖 > 0, that is,
𝑑𝑖 = gcd {𝑛 ≥ 1 ∶ 𝑝𝑖𝑖 > 0} .
• If 𝑑𝑖 = 1 then state 𝑖 is aperiodic.
• If 𝑝𝑖𝑖 = 0 for all 𝑛 ≥ 1 (i.e. return to state 𝑖 is impossible) then, by
convention, state 𝑖 is also said to be aperiodic.
Class properties
A class property is a property which is shared by all the states in an
irreducible class. If it can be shown that one of the states in a class has a
certain class property then all the states in the irreducible class must also have
that same property.
I give sketch proofs for some of the following results relating to class properties.
The other proofs can be found in any good textbook on Markov processes.
Theorem 5.1. A pair of intercommunicating states are either both recurrent
or both transient. That is, recurrence/ transience is a class property. It is not
possible to have a mixture of recurrent and transient states in an irreducible
Proof. Look in ‘Introduction to Probability Models’ (Sheldon Ross).
Theorem 5.2. Null recurrence is a class property.
Proof. Recall that 𝜇𝑖 = 𝐸(𝑇𝑖𝑖 ) is the expected first passage time or the mean
recurrence time of state 𝑖, and that
if 𝑖 is positive recurrent
𝜇𝑖 is {
infinite if 𝑖 is null recurrent or transient.
It can be shown (proof omitted) that
𝑖 aperiodic ⇒ 𝑝𝑖𝑖
𝑖 periodic, period 𝑑 ⇒ 𝑝𝑖𝑖
as 𝑛 → ∞,

as 𝑛 → ∞ (and 𝑝𝑖𝑖 = 0 if 𝑚 is not a multiple of 𝑑).

So, as 𝑛 → ∞

(positive, finite),

(positive, finite),
• if 𝑖 is aperiodic, positive recurrent then 𝑝𝑖𝑖
• if 𝑖 is periodic, positive recurrent then 𝑝𝑖𝑖
• if 𝑖 is null recurrent then 𝑝𝑖𝑖 → 0,

• if 𝑖 is transient then (proof omitted) ∑𝑛=0 𝑝𝑖𝑖 < ∞, and 𝑝𝑖𝑖 → 0. (𝑛) (𝑚) Now let 𝑖 ↔ 𝑗, with 𝑝𝑖𝑗 > 0, 𝑝𝑗𝑖 > 0. Suppose that 𝑖 is null recurrent. Then
As 𝑘 → ∞
(𝑛) (𝑘) (𝑚)
≥ 𝑝𝑖𝑗 𝑝𝑗𝑗 𝑝𝑗𝑖 .
→ 0
(because 𝑖 is null recurrent), and so
𝑝𝑗𝑗 → 0
(because 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0). So 𝑗 is either transient or null recurrent. But 𝑖 is
recurrent and 𝑖 ↔ 𝑗, so 𝑗 must be recurrent as well (because recurrent/transience
is a class property). Hence 𝑗 must be null recurrent.
Theorem 5.3. Positive recurrence is a class property.
Proof. Suppose that 𝑖 ↔ 𝑗 and that 𝑖 is positive recurrent. Then 𝑗 must be
recurrent (because recurrence is a class property). If 𝑗 is null recurrent then
𝑖 is also null recurrent. This is a contradiction. Hence 𝑗 must be positive
Theorem 5.4. Intercommunicating states have the same period. That is, periodicity is a class property.
Proof. Let 𝑖 and 𝑗 be states with 𝑖 ↔ 𝑗. Let 𝑑𝑖 be the period of 𝑖 and 𝑑𝑗 be the
period of 𝑗. Let 𝑛 and 𝑚 be such that 𝑝𝑖𝑗 > 0 and 𝑝𝑗𝑖 > 0. Then
(𝑚) (𝑛)
≥ 𝑝𝑗𝑖 𝑝𝑖𝑗 > 0,
because there is at least one way of returning to 𝑗 in 𝑛 + 𝑚 steps: 𝑗 to 𝑖 in 𝑚
steps and then 𝑖 to 𝑗 in a further 𝑛 steps. Therefore,
𝑑𝑗 |𝑛 + 𝑚,
(𝑑𝑗 is a divisor of 𝑛 + 𝑚).
Let 𝑘 be such that 𝑝𝑖𝑖 > 0. Then
(𝑚) (𝑘) (𝑛)
≥ 𝑝𝑗𝑖 𝑝𝑖𝑖 𝑝𝑖𝑗 > 0,
𝑑𝑗 |𝑛 + 𝑘 + 𝑚,
(𝑑𝑗 is a divisor of 𝑛 + 𝑘 + 𝑚).
together imply that
𝑑𝑗 |𝑘,
(𝑑𝑗 is a divisor of 𝑘).
That is, 𝑑𝑗 is a factor of any {𝑘 ≥ 1 ∶ 𝑝𝑖𝑖 > 0}.
However, by definition 𝑑𝑖 is the greatest common divisor of such 𝑘. Therefore,
𝑑𝑗 ≤ 𝑑𝑖 . Reversing the roles of 𝑖 and 𝑗 gives 𝑑𝑖 ≤ 𝑑𝑗 . Hence 𝑑𝑖 = 𝑑𝑗 .
The following are class properties:
• Null recurrence;
• Positive recurrence;
• Transience;
• Periodicity.
Further properties of discrete time Markov
Theorem 5.5. A finite irreducible Markov chain cannot be null recurrent.
Proof. Recall that 𝑝𝑗
= P(𝑋𝑛 = 𝑗). Suppose that the chain is null recurrent.
It can be shown that 𝑝𝑗
→ 0 as 𝑛 → ∞ for all 𝑗 ∈ 𝑆.
We know that
∑ 𝑝𝑗
= 1 for all 𝑛,
lim ∑ 𝑝𝑗
= 1.
But 𝑆 is finite so if the chain is null recurrent
lim ∑ 𝑝𝑗
= ∑ lim 𝑝𝑗
= 0.
This is a contradiction, so the Markov chain cannot be null recurrent.
Theorem 5.6. A finite Markov chain cannot contain any null recurrent states.
Proof. Suppose that the chain contains a state 𝑖 which is null recurrent. Then,
since null recurrence is a class property, state 𝑖 is in some irreducible finite closed
class of null recurrent states. This is not possible (see above).
Theorem 5.7. It is not possible for all states in a finite state space Markov
chain to be transient.
Note that this implies that all states in a finite irreducible Markov chain are
recurrent (because recurrence is a class property). That is, finite and irreducible
⇒ recurrent Markov chain.
Proof. Suppose all states are transient and 𝑆 = {0, 1, … , 𝑀 }. A transient state
will be visited only a finite number of times. Therefore for each state 𝑖 there
exists a time 𝑇𝑖 after which 𝑖 will never be visited again. Therefore, after time
𝑇 = max{𝑇0 , … , 𝑇𝑀 } no state will be visited again. BUT the Markov chain
must be in some state. Therefore we have a contradiction. Therefore, at least
one state must be recurrent.
5.6 Closed classes and absorption
A class 𝐶 of intercommunicating states is closed if 𝑝𝑖𝑗 = 0 for all 𝑖 ∈ 𝐶 and 𝑗 ∉
𝐶, that is, once the chain gets into 𝐶, it never leaves 𝐶.
If a single state forms a closed class then it is called an absorbing state.
Note the following important observations:
• A finite closed class must be recurrent: once entered, the class behaves
like a finite irreducible Markov chain.
• An absorbing state is positive recurrent: its mean recurrence time is 1.
Theorem 5.8. An irreducible class 𝐶 of recurrent states is closed.
Proof. Suppose that 𝐶 is not closed. Then there exist 𝑖 ∈ 𝐶 and 𝑗 ∉ 𝐶 such
that 𝑖 → 𝑗. But 𝑗 does not communicate with 𝑖, otherwise 𝑖 and 𝑗 would
intercommunicate and 𝑗 would be in 𝐶. Thus, there is positive probability that
the chain leaves 𝑖 and never returns to 𝑖. This means that 𝑖 is transient. This
is a contradiction, as 𝑖 ∈ 𝐶 and 𝐶 is recurrent.
Decomposition of the states of a Markov chain
Putting the previous results together shows that the state space 𝑆 of a Markov
chain can be decomposed as
𝑆 = 𝑇 ∪ 𝐶𝑚1 ∪ 𝐶𝑚2 ∪ ⋯ ,
• 𝑇 is a set of transient states;
• 𝐶𝑚1 , 𝐶𝑚2 , … are irreducible closed classes of recurrent states.
Each 𝐶𝑚𝑖 is either null recurrent or positive recurrent, and all states in a particular 𝐶𝑚𝑖 have the same period (different 𝐶𝑚𝑖 ’s can have different periods).
The following observations apply:
• If 𝑋0 ∈ 𝐶𝑚𝑖 then the chain stays in 𝐶𝑚𝑖 forever.
• If 𝑋0 ∈ 𝑇 then either:
– the chain always stays in 𝑇 (only possible if the set 𝑇 of states is
infinite); or
– the chain is eventually absorbed in one of the 𝐶𝑚𝑖 ’s (where it stays
• If 𝑆 is finite then it is impossible for the chain to remain forever in the
(finite) set 𝑇 of transient states.
• If 𝑆 is finite then at least one state must be visited infinitely often and so
there must be at least one recurrent state.
We also know that if 𝑆 is finite then there are no null recurrent states. The
consequences this are:
• If a Markov chain has a finite state space then there must be at least one
positive recurrent state.
• If a Markov chain has a finite state space and is irreducible, then it must
be positive recurrent.
• A finite closed irreducible class must be positive recurrent.
Part VI
Week 4
Chapter 6
Long-run behaviour of
discrete time Markov
This is the crux of the first half of the course: understanding a Markov chain’s
behaviour in the long run (limiting behaviour).
Our goal here is to find whether the distribution of states, 𝑝(𝑛) , settles down
(i.e. converges) as 𝑛 → ∞. If it does, what is this limit?
We now have enough tools at our disposal. It will turn out that to fully determine its long term behaviour we must classify the Markov chain in terms
• Irreducible classes;
• Transience and positive/null recurrence;
• Periodicity.
Remember: what we’re interested in is
lim 𝑝(𝑛)
For example, if 𝑆 = {1, 2, 3, 4}, then we’re interested in how the vector
𝑝(𝑛) = (𝑃 (𝑋𝑛 = 1), 𝑃 (𝑋𝑛 = 2), 𝑃 (𝑋𝑛 = 3), 𝑃 (𝑋𝑛 = 4))
changes as we let 𝑛 → ∞. Does this vector settle down as 𝑛 → ∞ (does it
have a limit)? If is does then this is good news: we can use this information to
predict the future behaviour of the chain because we know what 𝑃 (𝑋𝑛 = 𝑖) is
for large 𝑛. This allows us answer questions such as “what’s the probability that
the chain is in state 2 at some large time”. While this might not sound all that
interesting, it opens the door to computing other questions about long-term
6.1 Invariant distributions
We will start with a somewhat simpler question than that of limiting behaviour:
are there particular initial distributions for a Markov chain, such that when the
Markov chain starts in any of these distributions, we can determine the limit of
𝑝(𝑛) as 𝑛 → ∞?
The probability row vector 𝜋 is an invariant distribution of a discrete-time
process {𝑋𝑡 , 𝑡 ∈ 𝑇 } if
𝜋 = 𝜋 𝑃.
If 𝜋 is an invariant distribution, and the initial distribution of the Markov chain
is 𝜋, why does this imply that we can study its long term behaviour?
The idea is that if the chain starts in 𝜋 (or equivalently, we find at some later
time that the distribution of states is 𝜋), then this tells us that the distribution
of states is 𝜋 forever: we’ve uncovered the long term behaviour of the chain in
this specific case.
This works since, suppose that an invariant distribution 𝜋 exists and that 𝑝(0) =
𝜋, then
𝑝(1) = 𝑝(0) 𝑃 = 𝜋 𝑃 = 𝜋;
𝑝(2) = 𝑝(1) 𝑃 = 𝜋 𝑃 = 𝜋;
𝑝(3) = 𝑝(2) 𝑃 = 𝜋 𝑃 = 𝜋;
In fact 𝑝(𝑛) = 𝜋,
as required!
for 𝑛 = 1, 2, … and we have found the limit of 𝑝(𝑛) as 𝑛 → ∞,
If our initial distribution is 𝜋, with 𝜋 = 𝜋𝑃 , then the distribution of the states
at any given future time 𝑛 is also 𝜋. This is why it is called an invariant
distribution. An invariant distribution always exists if the state space 𝑆 is
finite, but it may not be unique. If 𝑆 is infinite then an invariant distribution
may not exist.
Assuming that the transition matrix 𝑃 is a 𝑘 × 𝑘 matrix, and the state space is
𝑆 = {1, 2, …, 𝑘}, solving 𝜋 = 𝜋𝑃 is a matter of solving a set of 𝑘 simultaneous
equations, each of the form
𝜋𝑖 = 𝜋1 𝑝1𝑖 + 𝜋2 𝑝2𝑖 + … + 𝜋𝑘 𝑝𝑘𝑖 .
If you try this you will find that one equation is redundant (try it!): we are
therefore trying to solve for 𝑘 unknowns (𝜋1 , …, 𝜋𝑘 ) using (𝑘 − 1) equations.
This is clearly impossible, but we have the advantage that since 𝜋 is itself a
probability distribution, we must have 𝜋1 +…+ 𝜋𝑘 = 1. This added requirement
means that we can now solve for 𝜋1 , …, 𝜋𝑘 .
How many invariant distributions?
How many invariant distributions can a discrete time Markov process have?
To find an invariant distribution, if one exists, we solve a set of simultaneous
linear equations. Given such a set there are either:
• No solutions – no invariant distribution exists;
• Exactly one solution – exactly one invariant distribution exists;
• An infinite number of solutions – there are an infinite number of invariant
If 𝑆 is irreducible then:
An invariant exists ⟺ all states are positive recurrent.
If the invariant distribution exists under these conditions, it must be unique.
1. If 𝑆 is irreducible, then all states are of the same type.
2. If 𝑆 is finite (and irreducible), then all states must be positive recurrent,
so an invariant must exist.
3. If 𝑆 is infinite (and irreducible), then all states must be of the same type,
but they aren’t necessarily positive recurrent.
If 𝑆 is NOT irreducible then:
• Finite S:
Chain is guaranteed to enter a positive recurrent class eventually, so invariant exists.
• Infinite S:
– If there exists a positive recurrent class, then there must be an invariant distribution.
– If there does not exist a positive recurrent class, then there cannot
be an invariant distribution
In any case, if there are two or more positive recurrent classes, then an infinite
number of invariant distributions exist.
6.2 Equilibrium distributions
Simply looking at invariant distributions in order to determine a Markov
chain’s long term behaviour is very restrictive. We want a far more general
idea of long term behaviour. Can we determine whether 𝑝(𝑛) converges as
𝑛 → ∞, regardless of the initial distribution? If this limit exists, what is it?
Example 6.1. Recall the weather example (Exercise 3.2), with state space
𝑆 = { 𝑅, 𝐹 } and transition matrix 𝑃 ,
0.6 0.4
0.5 0.5
𝑃 =(
We examine how 𝑃 (𝑛) = 𝑃 𝑛 , the 𝑛-step transition matrix changes as 𝑛 increases.
𝑃2 = (
0.56 0.44
0.55 0.45
𝑃3 = (
𝑃 6 = 𝑃 3 ⋅𝑃 3 = (
0.555556 0.444444
0.555555 0.444445
Looking at different initial distributions for this Markov chain and what happens
in the ‘long run’ from these starting distributions:
Initial distribution
Distribution of 𝑋1
Distribution of 𝑋2

Distribution of 𝑋6

Notice that as 𝑛 → ∞, 𝑃 (𝑛) is tending to a limiting matrix whose rows are
identical to each other, and 𝑝(𝑛) seems to be tending to a limiting probability
distribution which is the same regardless of the initial distribution. This
is a very important phenomena which we describe next.
A probability row vector 𝜋 = {𝜋𝑗 , 𝑗 ∈ 𝑆} is an equilibrium distribution for the
discrete-time Markov chain {𝑋𝑛 } if
𝑝(𝑛) ⟶ 𝜋
as 𝑛 ⟶ ∞,
independently of the initial distribution 𝑝(0) (i.e. for any initial distribution, the
long-run distribution of 𝑋𝑛 as 𝑛 → ∞ is the same).
When an equilibrium distribution exists,
𝜋𝑗 = lim 𝑝𝑖𝑗 ,
for all 𝑖, 𝑗 ∈ 𝑆.
Notice what this says: 𝜋 is an equilibrium distribution if 𝑝(𝑛) → 𝜋 as 𝑛 → ∞
regardless of the start distribution, 𝑝(0) .
It is important to note that there cannot be two or more of these limits: either
such a limit does not exist, or exactly one exists. Think about this carefully
because it tells you that there is at most one equilibrium distribution!
The following interpretations of the equilibrium distribution, 𝜋, are important:
• 𝜋𝑗 is the limiting probability that the chain is in state 𝑗 at time 𝑛.
• It can be shown that 𝜋𝑗 is also equal to the long-run proportion of time
that the chain spends in state 𝑗.
Invariant vs equilibrium distributions
We know that 𝑝(𝑛) = 𝑝(𝑛−1) 𝑃 . If this chain has an equilibrium distribution then
𝑝(𝑛) ⟶ 𝜋 as 𝑛 → ∞, and 𝜋 is the equilibrium distribution. Therefore, letting
𝑛 → ∞ we have 𝜋 = 𝜋𝑃 , so that 𝜋 is also an invariant distribution.
Though the concepts of an invariant distribution and an equilibrium distribution
are therefore closely related, they are NOT the same thing: an equilibrium distribution must also be an invariant distribution, BUT an invariant distribution
is not necessarily an equilibrium distribution.
• An equilibrium implies the existence of an invariant;
• An invariant does NOT guarantee the existence of an equilibrium.
Invariant distribution 𝜋:
• If the chain starts in (or equivalently, gets to) 𝜋, the distribution of states
at all further times is also 𝜋.
Equilibrium distribution 𝜋:
• If the chain is run for long enough (i.e. forever) its probabilistic behaviour
settles down to that of 𝜋.
• This is regardless of the state in which the chain starts.
Important consequences
• No invariant distribution ⇒ no equilibrium distribution.
• Exactly 1 invariant distribution ⇒ there may or may not be an equilibrium
• Two or more invariant distributions ⇒ no equilibrium distribution.
Some other important things to note:
• Not all Markov chains have equilibrium distributions.
• If a Markov chain has an equilibrium distribution, then it is unique (cannot
have more than one equilibrium distribution).
• If a Markov chain has an equilibrium distribution, then it also has an
invariant distribution and these distributions are identical.
• If a Markov chain does not have an equilibrium distribution, then it may
or may not have an invariant distribution.
• A Markov chain can have any number of invariant distributions (including
no invariant distribution).
• It is possible for a Markov chain to have more than one invariant distribution but not have an equilibrium distribution.
• It is possible for a Markov chain to have neither an invariant distribution
nor an equilibrium distribution.
6.4 Ergodicity and the equilibrium distribution
An ergodic state is one that is both positive recurrent and aperiodic. The
existence of an equilibrium distribution depends on ergodicity.
Theorem 6.1. An equilibrium distribution exists if and only if there is an
ergodic class 𝐶 and the Markov chain is certain to be absorbed into 𝐶 eventually,
wherever it starts.
We won’t be proving this result. Notice that it gives us a checklist for existence
of an equilibrium distribution:
1. There is a single ergodic class;
2. The process is guaranteed to enter the single ergodic class wherever it
A special case
What happens when {𝑋𝑛 } is an irreducible ergodic Markov chain? Applying
the theorem:
1. There is a single ergodic class (by definition in this case);
2. We’re guaranteed to enter the ergodic class (because it’s irreducible, we
start there and always stay there!)
Therefore an equilibrium distribution exists.
Without loss of generality we assume that 𝑆 = {0, 1, 2, …}. Then in this special
case (we won’t prove this) the equilibrium distribution 𝜋 satisfies
𝜋𝑗 =
for all 𝑗 ∈ 𝑆 ,
where 𝜇𝑗 is the mean recurrence time of state 𝑗.
This provides us with an alternative way to show that a Markov chain is positive
recurrent: if the chain is irreducible, aperiodic and has an invariant distribution
(that is, there exits a 𝜋 such that 𝜋 = 𝜋𝑃 ) then it must be positive recurrent.
Therefore, an irreducible aperiodic Markov chain has an invariant distribution
if and only if the chain is positive recurrent.
When does an equilibrium distribution exist?
• More than 1 closed class ⇒ no equilibrium distribution exists (the longterm behaviour of the chain depends on 𝑝(0) ).
• No closed classes (i.e. all states are transient) ⇒ no equilibrium distribution exists.
• If there exists exactly one closed class 𝐶, is it certain that the Markov
chain will eventually be absorbed into 𝐶?
– If “no” then no equilibrium distribution exists (the long-term behaviour of the chain depends on 𝑝(0) ).
– If “yes” then an equilibrium distribution exists if and only if 𝐶 is
Part VII
Week 5
Chapter 7
Checking the Markov
property and solving
The idea of ‘first step decomposition’
Recall from Section 4.3.1 that the ‘first step decomposition’ technique is a very
useful method for computing certain expectations and probabilities associated
with Markov chains. You have already seen how to use this idea to compute
probabilities. Here we’ll use the same idea to compute expectations that would
otherwise be tricky to calculate.
The main tool will be the use of the law of conditional (iterated) expectation to
find expectations of interest:
𝐸[𝑋] = 𝐸 [ 𝐸[𝑋|𝑌 ] ]
An example
‘Roads connecting three locations in a city are shown in the following diagram,
the time taken to cycle along any of the roads being 1 minute. Geraint is a
bicycle courier whose office is located at location A. He has a parcel that he
needs to deliver to location C. However, he is a law abiding cyclist who will only
follow the direction of travel permitted on each road, as indicated by the arrows.
Whenever he comes to a junction, he selects any of the routes available to him
with probabilities as given in the diagram below.
• How long, on average, will it take Geraint to deliver the parcel?
• How many times, on average, will Geraint visit location B before delivering
the parcel?
Figure 7.1: State-space diagram
There are many ways of solving these questions, but a convenient method is to
use ‘first step decomposition’. The idea is the same as you saw in Section 4.3.1:
think about ‘where do we go next’?
It’s important that you can ‘spot’ the types of situations where first step decomposition can be applied.
Example 7.1. How long, on average, will it take Geraint to deliver
the parcel?
We’ll apply first step decomposition to solve this question. Let 𝑋𝑛 denote
Geraint’s location at time 𝑛, and let 𝑇 denote the time it takes for Geraint to
deliver the parcel to location C.
We need to calculate 𝐸[𝑇 ], and we will do this by conditioning on Geraint’s
next move. In the first instance, he starts in location A, so 𝑋0 = 𝐴. Using the
iterated law of conditional expectation, we get:
𝐸[𝑇 ] = 𝐸[𝐸[𝑇 |𝑋1 , 𝑋0 = 𝐴]] = 𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴]𝑝𝐴𝐵 +𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴]𝑝𝐴𝐶 .
Substituting in the relevant transition probabilities we get:
𝐸[𝑇 ] =
𝐸[𝑇 |𝑋1 = 𝐵, 𝑋0 = 𝐴] + 𝐸[𝑇 |𝑋1 = 𝐶, 𝑋0 = 𝐴].
We don’t really need to condition on both 𝑋1 and 𝑋0 (remember that the
process is Markov!) so we can write:
𝐸[𝑇 ] =
𝐸[𝑇 |𝑋1 = 𝐵] + 𝐸[𝑇 |𝑋1 = 𝐶].
Now notice that 𝐸[𝑇 |𝑋1 = 𝐶] = 1 because we are conditioning on Geraint
reaching location C for the first time at time 1, so this simplifies to
𝐸[𝑇 ] =
𝐸[𝑇 |𝑋1 = 𝐵] + .
Now compute 𝐸[𝑇 |𝑋1 = 𝐵] by applying first step decomposition again: from
B, where can Geraint go next?
𝐸[𝑇 |𝑋1 = 𝐵] = 𝐸[𝐸[𝑇 |𝑋2 , 𝑋1 = 𝐵]] =
𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵]+ 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵].
Now notice that 𝐸[𝑇 |𝑋2 = 𝐶, 𝑋1 = 𝐵] = 2 and 𝐸[𝑇 |𝑋2 = 𝐴, 𝑋1 = 𝐵] =
2 + 𝐸[𝑇 ] (the latter because Geraint used 2 minutes going to B and then back
to A, and once back in A it is as if the process starts over from scratch).
Therefore 𝐸[𝑇 |𝑋1 = 𝐵] = 1 + 12 (2 + 𝐸[𝑇 ]). Substituting this back in to (7.1),
we have:
𝐸[𝑇 ] = 𝐸[𝑇 |𝑋1 = 𝐵] + = 𝐸[𝑇 ] + .
Solving gives 𝐸[𝑇 ] = 2.5 minutes.
You may want to consider simplifying the notation of the solution to Example 7.1
above, by defining new quantities. For example, let 𝐿𝑖 denote the expected time
to reach C, given that we start in state 𝑖. Then, the solution above simplifies to
𝐿𝐴 = 1 + 𝐿𝐵 + 𝐿𝐶 .
We must add 1 since we are using one minute to move from state A to another
state. Now we know that 𝐿𝐶 = 0 (if we start in C, it takes no time to get
there!), and 𝐿𝐵 = 1 + 𝐿𝐴 /2 + 𝐿𝐶 /2 = 1 + 𝐿𝐴 /2. Therefore, substituting into
(7.2), we need to solve
𝐿𝐴 = 1 + (1 + 𝐿𝐴 /2) = 𝐿𝐴 + .
Solving gives 𝐿𝐴 = 2.5 minutes, as before.
Note that exactly the same technique – first step decomposition – can also be
applied to solve problems for continuous time Markov processes (which we will
cover later in the course).
How can I tell if a process is Markov?
You will come across many ICA and exam questions asking you to explain
whether or not a particular process is Markov. There are two ways that you
could establish whether a process is Markov.
1. Without using any mathematics, deduce that the process is/ isn’t Markov
2. Checking (mathematically) that the Markov property holds for all states
in the statespace at all times.
The second strategy is particularly hard, so we mainly use this technique when
we suspect that the Markov property does not hold by finding a counter example
which does not satisfy the Markov property. To show that the Markov property
does not hold, find one set of states 𝑎, 𝑏, 𝑐 and perhaps 𝑑 such that one of the
following holds (you do not need to show both hold – one will do!):
• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏) for some 𝑎, 𝑏, 𝑐;
• 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑐) ≠ 𝑃 (𝑌𝑛+1 = 𝑎|𝑌𝑛 = 𝑏, 𝑌𝑛−1 = 𝑑) for
some 𝑎, 𝑏, 𝑐, 𝑑 with 𝑐 ≠ 𝑑….

Don't use plagiarized sources. Get Your Custom Essay on
STAT 0007 Discrete Time Markov Process Questions
Just from $13/Page
Order Essay
Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Live Chat+1(978) 822-0999EmailWhatsApp

Order your essay today and save 20% with the discount code LEMONADE