Three years ago, I was asked to help analyze the kinetic data for another master’s student’s thesis project. The goal was to fit a 3-state Markov model to a set of observations. However, the data was not enough so the dwell time distribution fits badly. At that time, I had probably just learned about the concept of “censored observation”, so I worked on how to recover the information from the censored data as well. We ended up using the msm package that directly estimates the transition rates from state time series. It seemed to work, but I had not justified that choice. Recently, it seems that work is getting prepared for submission, so I found myself revisiting the topic. The findings are surprising, so I wrote them down.

Some background

cyclic_3_state
Consider a reversible reaction of three states. Chemists usually describe the reaction kinetics with the concentration of each component as continuous functions of time. However, a molecule can only exist as one of the components, so the reaction is more like a jump process between discrete states.

In single-molecule studies, it is generally assumed that the jump process is a continuous-time homogeneous Markov chain. I think this is because (1) a Markov chain is a simple stochastic process with nice mathematical properties and (2) Markov chains are memoryless, which is very similar to what we expect from a (pseudo) first-order reaction. This model can be specified by one transition rate matrix \(Q\), with its non-diagonal elements being the transition rate between two states. For a 3-state model,

\[ Q = \begin{pmatrix} -(q_{12}+q_{13}) & q_{12} & q_{13} \\ q_{21} & -(q_{21}+q_{23}) & q_{23} \\ q_{31} & q_{32} & -(q_{31}+q_{32}) \end{pmatrix}, \]

there are 6 parameters. Let the probability of finding a molecule in state \(i\) at time \(t\) be \(\pi_i(t)\), then the row vector \(\pi\) and its rate of change is related by \(Q\).

\[ \left. \frac{\mathrm{d}\pi(\tau)}{\mathrm{d}\tau}\right|_{\tau=t} = \pi(t) Q \]

The transition rates \(q_{ij}\) are usually thought to be the (pseudo) first-order rate constants from state \(i\) to \(j\). Therefore, the goal of single-molecule kinetic analyses is usually to estimate \(Q\) from some time series.

The actual states are generally not directly observed. The experimental observation could be thought of as a (noisy) function of the current state, and there’s should a step that obtains discrete state time series from the noisy data, usually using a discrete-time hidden Markov model. I assume the state time series are obtained in the following discussion.

Traditional dwell time analysis

The dwell times \(T_{ij}\) are the duration between two adjacent transitions if the first transition enters state \(i\) and the second transition leaves \(i\) and enters state \(j\). For example, we have three dwell times in the following observation.

The property of the Markov chain implies that the distribution of \(T_{ij}\) is an exponential distribution with rate \(q_{ij}\). Therefore, the transition rates could be estimated from state time series by collecting all the dwell times and taking the inverse of their arithmetic mean.

However, the problem with this estimator is that the first and the last intervals are not included. Because we only observe one transition for these two intervals, we only have the lower bound of the dwell time, this is called right censoring.

The first interval can be included in the maximum-likelihood estimator of the exponential distribution, but the last interval cannot because we do not know the final state. The information is lost from the analysis. This was why I started trying the msm analysis at that time.

Directly estimate the transition rates

The msm package estimated the transition rates directly from the time series. Given the transition rate matrix \(Q\), it is known that the transition probability matrix \(P(t)\) for time interval \(t\) is the matrix exponential

\[ P(t) = e^{Qt} \]

We can construct the likelihood function of our data if we plug the time intervals between each observed state. More formally, assume that the state time series of molecule \(n\) is observed at \(t_n = (t_{n1}, t_{n2}, \ldots, t_{nL_{n}})\) and can be written as \(s_n = (s_{n1}, s_{n2}, \ldots, s_{nL_{n}})\). The likelihood of the data is then

\begin{equation} L(Q) = \prod_{n = 1}^{N} \prod_{i = 1}^{L_n-1} P_{s_{ni}s_{n(i+1)}}\left(t_{n(i+1)}-t_{ni}\right) \end{equation}

By maximizing the likelihood function numerically, we can find the \(Q\) that best describes the data.

Comparing the two methods

To compare the two estimators, we need to know their sampling distribution, in other words, how the estimates vary for random samples. I set up a model with the following transition rate matrix,

\[ Q = \begin{pmatrix} -0.0274 & 0.025 & 0.0024 \\ 0.01 & -0.017 & 0.007 \\ 0.00085 & 0.021 & -0.02185 \end{pmatrix} \]

which is similar to one experimental result. For each estimate, I generated 200 state time series, assuming the state at time 0 s is all 1 and we observe every 0.05 s during the interval [20 s, 170 s]. The transition rates were estimated with both methods for 1000 such datasets. Then the distributions of the estimates are plotted with the Gaussian kernel density estimator, the bandwidth of which was estimated using the Sheather-Jones method. The source code can be found here.

The msm estimates center around the ground truth, but the peak of the dwell time estimates are generally off to one side. This means, on average, the dwell time estimator differs more from the truth than the msm estimator. We say that the dwell time estimator is more biased.

If we increase the observed duration from 150 s to 600 s, the bias generally decreases. So, the dwell time estimator might perform well given more data. However, the observation duration for single-molecule fluorescence methods is limited by photobleaching, so it would be unlikely to collect data up to 600s. Simply increasing the number of observed molecules may work, but that would be very tedious.

Possible reasons for the bias

Why does the dwell time estimator not perform as well as we expect? Here are my speculations.

As previously explained, dwell time analysis cannot handle the interval after the last transition in the time series, resulting in data loss. This might not be the only reason. We can handle that interval for two-state models, but the bias is also present for a two-state model.

Another reason might be that the transition points are not observed. We observe the state at determined time points. We know transition happened if the states are different for two adjacent observations, but we do not know the exact time point of this transition. The actual transitions are interval censored. For dwell time analysis, I have to assume the transition always happens exactly at the mid-point of the two adjacent observations. This assumption will introduce a small error in the dwell times. The likelihood function we constructed already considers interval censoring, so this assumption is not needed.

Lastly, the fraction of molecules in each state is ignored by the dwell time estimator. Because the transition probability matrix can be derived from the transition rate matrix, the number of molecules in each state at any time point contains the rate information. Also, the number of the observed transitions depends on the state occupancy, so they contain information on the rates as well. By grouping transitions by their initial and final state and analyzing each group separately, the dwell time analysis discards this information, while the msm estimator includes the data as a whole.

Conclusion

Dwell time analysis has been widely used in single-molecule kinetic studies. However, likely due to the loss of some data and the splitting of the data, it might not perform as well as directly estimating the rates from the time series. Researchers might benefit from comparing and reevaluating different analysis methods.