Non-monotonicity of Fano factor in a stochastic model for protein expression with sequesterisation at decoy binding sites

—We present a stochastic model moti-vated by gene expression which incorporates unspe-ciﬁc interactions between proteins and binding sites. We focus on characterizing the distribution of free (i.e. unbound) protein molecules in a cell. Although it cannot be expressed in a closed form, we present three different approaches to obtain it: stochastic simulation algorithms, system of ODEs and quasi-steady-state solution. Additionally we use a large-system-size scaling to derive statistical measures of approximate distribution of the amount of free protein, such as the Fano factor. Intriguingly, we report that while in the absence of or in the excess of decoy binding sites the Fano factor is equal to one (suggestive of Poissonian ﬂuctuations), in the intermediate regimes of moderate levels of binding sites the Fano factor is greater than one (suggestive of super-Poissonian ﬂuctuations). We support and illustrate all of our results with numerical simulations.


I. INTRODUCTION
The number of proteins and other species present in the biological processes inside the cells such as gene expression is usually small [6], [28]. Therefore, deterministic modeling of such reactions can be quite inaccurate and we often turn to stochastic methods [18]. They account for discrete number of molecules and can easily be simulated through stochastic simulation algorithms, in particular the Gillespie algorithm [10], [11]. Being a very timely topic, gene expression spurred a revival of interest in Markovian models of chemical kinetics, e.g. [25]. We assume that the protein is produced with a constant rate and that the rate of its decay is proportional to the number of proteins. We study the protein dynamics in presence of so-called decoy binding sites [17] on the DNA. Our model takes into account protein binding/unbinding reactions with these binding sites. Similar models have already been studied previ-ously; in particular [9] investigated the model with protected complexes, i.e. the case when bound proteins were immune to degradation, showing that the steady-state distribution is Poissonian. Our model allows bounded proteins to degrade, which introduces additional noise into the model [2], [3]. For simplicity, we ignore effects of burst-like protein synthesis or transcriptional auto-regulation [2], [4], [26]. In section II we formulate the Master equation for the probability distribution of free protein; unfortunately, we are unable to find its solution in a closed form. However, biochemical reactions often operate on different timescales as was already thoroughly investigated in works such as [14], [12], [13]. Specifically, in the current context the interactions between the protein and its binding sites occur on a substantially faster timescale than the turnover of protein (by transcription and decay) does [1]. This allows us to successfully use singular perturbation methods [5], [22], [23] to obtain the quasi-steady-state solution to our problem. Obtaining the quasi-steady-state approximations in our model involves finding an equilibrium of binding/unbinding reaction, which is a specific case of a reversible bimolecular reaction studied by Laurenzi in [16]. In section III we expand the Master equation using the linear noise expansion as proposed in [29].

II. STOCHASTIC APPROACH
The number of proteins expressed from a single gene can often be quite small [31]; it would therefore be inaccurate to use deterministic approach, which treats reactants as continuous variables. Instead we take a stochastic approach, in which we model each species as a discrete random variable and each reaction as a random event with a given probability to occur. Some protein species are present at even less than 10 copies per E. coli cell [28], therefore we also work with mean number of proteins below 100 in this paper. As we use discrete variables it is quite clear that we need to simulate these reactions through a simulation algorithm. Such algorithms gained a wider recognition after the work of Gillespie [10].

A. Model Description
In this section we present a minimalistic model for gene expression. We neglect the effect of mRNA translation and production of proteins in bursts; instead we focus on the interaction between the proteins and decoy binding sites. Unlike in [26], we assume that bounded proteins are subject to degradation processes.
Let us use the following notation for our variables: X -protein (free or bound), X f -free protein, Y -binding site, Y f -free binding site, C -complex (protein bound to the binding site).
For the sake of simplicity we omit 'decoy' from the binding site notation as we do not take into account any other binding sites. We assume that three reversible reactions can take place: 2) Protein binding/unbinding reaction.
Decay of the complex (whereby a binding site is vacated).
We use upper-case letters in italics to represent a number of corresponding species throughout this paper. We reserve the corresponding lower-case letter as a notation for a concentration of a given species. In order to avoid confusion with X , we use N instead of X f as the number of free protein.
Although we mentioned five different variables, the problem is just two-dimensional, with the following straightforward conservation laws held between the variables: -Y is a known constant (the number of binding sites) -C = X − N (the number of complexes is the same as the number of bound protein) free binding sites is the same as the number of all binding sites without the complexes) Therefore we can express the Master Equation of the system in terms of X and N (with a constant total number of binding sites Y ):Ṗ where P X ,N is the abbreviation of P (X(t) = X , X f (t) = N ). We can use the Gillespie algorithm to simulate this process (parameters of the reactions are set to Y = 10, k = 3, γ = 0.1, k + = 1, k − = 10 with no proteins at the beginning: P X ,N (0) = δ X ,0 δ N ,0 . We illustrate the individual components of the process in Figure 1.

B. Total Protein distribution
Our first goal is to obtain the marginal distribution of the total protein number. It can be obtained as the sum of P X ,N through all possible N 's: Let us use an abbreviation N for the sum through all integers. In order to derive an equation P X from (1) let us sum both sides of the equation through all N 's. As the sum goes through all integers (if N < 0 is the probability is naturally equal to zero), we can use the fact that which simplifies tȯ P X = k (P X−1 −P X )+γ ((X +1)P X+1 −X P X ) . (2) A system of differential equations in this form can be solved using the method of generating functions [15]. First we multiply both sides by s X and sum them through all integer X 's.
Now we can use the definition of the generating function G(s, t) = X s X P X and its derivative ∂G ∂s = X X s X −1 P X in order to transform (3) into a partial differential equation: In the steady state we can omit the left-hand side of the equation (as ∂G ∂t = 0) and easily solve it using the separation of variables. We come to the solution G(s, ∞) = e k γ (s−1) , which can be recognized as the probability generating function of the Poisson distribution parametrized by λ = k γ . Using the notation X = λ for the distribution's mean we can write P X (∞) = X X e − X X ! . Away from the steady state, (4) is an example of a nonhomogeneous first-order linear partial differential equation, which can be solved for suitable initial conditions. Using a common extension of the method of characteristics for (quasi-)linear PDE's (see e.g. [8]) we introduce an auxiliary function u = u(s, t, G), which satisfies The characteristic system of this equation has the form:ṫ By finding the functions which are constant on the characteristics, we obtain solutions in the form G(s, t) = ψ t − ln(s−1) γ ·e ks γ . We can specify the results for any particular initial conditions. Let us investigate the case when there is no protein at the beginning, i.e. P (x, 0) = δ x,0 , which gives us an initial condition G(s, 0) = 1 also for the generating function. Using the initial condition we come to the solution G(s, t) = e − k γ (e −γt (s−1)+1)+ ks γ = e − k γ (s−1)(e −γt −1) , which is again the Poisson distribution, just with different, time-dependent, parameter meaning that X (t) = k γ · 1 − e −γt . Taking t → ∞ we can easily see that the timedependent solution converges to the steady-state one.

C. Free Protein distribution
Searching for an exact formula for the free protein distribution yields no apparent results: the master equation (1) does not have solution in the closed form (unless Y = 0). Therefore we have to look for alternative ways to obtain it.

1) Stochastic Simulation Algorithms:
The first approach consists of using the techniques of stochastic simulation algorithms, in particular the Gillespie algorithms (see [7] for a practical guide). With their help we can simulate the time evolution of all species involved in a system of reactions. The simulation technique is based upon the following scheme: we generate two uniformly distributed random numbers, the first of which determines the time of the next reaction and the second one determines which reaction will occur. When we use a sufficient number of sample trajectories, we obtain a robust estimate for the distribution of any variable. The main problem of this approach is its limited computational capacity. In order to obtain a robust distribution of variables we have to run a substantial number of simulations; this can be very time-consuming and even unrealistic when calculating results for many different parameter sets.
2) System of ODEs: The second option is to transform the Master equation, which is an infinite system of equations, into a finite system of ordinary differential equations. The most straightforward way to obtain this is to set threshold values of the discrete variables, replacing the unknown probability by zero whenever these thresholds are exceeded (see [20], in our case we set all P X ,N to zero, whenever X > 100). In this way, the overall sum of probability distribution is no longer conserved at one, but in return we get a finite system of differential equations. We use the MAT-LAB ode15s solver for stiff problems to calculate the solution. But this system is also very timeconsuming, as the number of equations and the computational difficulty grows rapidly with raising the thresholds for the non-zero probabilities of P X ,N .

3) Quasi-steady-state solution:
In order to obtain the solution in a closed form, which is as close to exact solution as possible, we proceed to obtain the so-called quasi-steady-state solution [24], [27]. We utilize the fact from biological background that binding/unbinding reactions are fast compared to protein production/degradation (k − γ). In order to use this fact we perform the singular perturbation reduction [9] using the ratio γ/k − as a small parameter. Let us introduce new dimensionless time and the parameters: The master equation then reads As ε tends to zero, we obtain an equation for the leading-order approximation of P X ,N . The number of total protein X is constant after this approximation, so that we can abbreviate P X ,N by P N , writing Solving (7) subject to boundary conditions (see e.g. [16] where C(X ) is a constant with respect to N , dependent only on the value of X .
The multiplicative constant C(X ) can be deter- and thus Since we already know about the Poisson character of steady-state solution for P X , we are ready to calculate the number of free protein in quasisteady state: This represents a major improvement in terms of computational requirements necessary to get the distribution of free protein amount.
In order to illustrate the quality of this approximate solution, let us investigate the differences between the simulated distribution (by Gillespie algorithm) and the approximate quasi steady state  (9) for selected values of ε −1 = k− γ , which is assumed to be large as justified above. We assumed the following values of the parameters: Y = 10; k = 3, γ = 0.1 ⇒ X = 30; k + = a, k − = 10a ⇒ k b = 10. We vary the value of a in order to change the value of ε −1 while keeping k b fixed; thus the distribution obtained from (9) remains the same. The results are shown in Figure  2, where the green histogram is estimated from repeated simulations (10 5 trajectories) of the Gillespie algorithm on a sufficiently long timescale. The blue line is the probability distribution of free protein in quasi-steady state (9). It is easy to see that with the lowering of k− γ , we diverge from exact distribution results estimated by the Gillespie algorithm histogram. To evaluate the goodness of the fit, we calculate the 2 distance between these two realizations of free protein distribution for all N 's from 0 to 100 (see TABLE I).

D. Moments of free protein distribution
In this section we focus on the analysis of the free protein probability distribution in the quasisteady state. We focus on four basic statistical characteristics: mean ( (Figure 3)), variance, the Fano factor (F = σ 2 µ ) ( Figure 4) and the squared coefficient of variation (CV 2 = σ 2 µ 2 ). We use the number of binding sites as a free parameter, which we vary on the horizontal axis. We sometimes use an abbreviation BS instead of binding site. We study the change in these characteristics for selected choices of the total protein production X and the dissociation constant k b . The values γ = 0.1 and k − = 10 are fixed. Different choices of k and k + are used to modify X and k b .
Without any binding sites (Y = 0) the Poisson distribution follows, so that µ = σ 2 and thus F = 1. Very large dissociation constants imply that proteins bind weakly to the binding sites and thus the system exhibits characteristics that are consistent with the Poisson distribution. It is apparent that for large values of Y , the mean and variance tend to 0, Fano factor tends to 1 and CV 2 diverges to infinity. The convergence of the Fano factor to one suggests an approximate Poisson distribution in the excess of decoy binding sites. Below, we show with asymptotics that this is indeed the case.
Provided that Y X , we can use the approximation for the factorial Substituting this into (9) and using binomial theorem and basic algebraic manipulation, we obtain a simplified equation for the probability distribution of the free protein: This formula can also be rewritten as P X ·P N |X , where P X has a Poisson distribution with parameter X ; we see that P N |X has a binomial distribution with probability of a Bernoulli trial given by kb kb+Y and the number of trials given by X . Therefore, we can express the free protein distribution to be that of a random N = where P (ξ i = 1) = kb kb+Y . After some algebraic manipulation with probability generating functions (such as the rule for calculating probability generating function of the sum of independent random variables) we arrive at: It is clear now that probability generating function for free protein is given as a composition of generating functions for G X (Poisson distribution) and G ξ (Bernoulli distribution). Substituting the well-known formulae for generating functions of these distributions into (10), we obtain which again indicates the Poisson distribution, but with a different parameter Computing the probability distribution in quasisteady state for large values of binding sites is infeasible as the formula works with the terms of the order Y !; therefore, we use the Gillespie algorithm (with 10 5 repetitions) to estimate the mean and the variance of the distribution. The results are provided in TABLE II. Other parameters used are X = 10 and k b = 10). As the distribution is Poissonian, the Fano factor tends to one and CV 2 = 1 N , which tends to infinity.

E. Comparison of the methods
In this section we compare the probability distributions generated by the three methods mentioned earlier: stochastic simulation by Gillespie algorithm (referred to as Gill in the TABLE III), transforming master equation into the finite system of ODEs (ODE in TABLE III) and calculating the solution in quasi-steady state (Quasi in TABLE  III). Obviously we use the same set of parameters in all three approaches: Y = 10, k = 3, k + = Fig. 4: Fano factor of free protein probability distribution 1, k − = 10 and γ = 0.1. Initially, the system is set to contain no proteins: P X ,N = δ (0,0) . The output of each method is the probability surface P X ,N in all given time points. In TABLE III we record the differences in the probability surfaces obtained by the three methods. We define the distance between the two methods as the 2 -matrix norm, i.e.: which we apply to the differences between the two probability surfaces.

III. SMALL NOISE APPROXIMATION
In the second part of this paper we focus on the limit case when the size Ω of the system is large enough and try to obtain the statistical moments of free protein distribution in the closed form for this case. The noise in free protein distribution has two sources; the first one is the reversible  association of protein and binding site and the second one is new protein production/degradation. Let us first concentrate on the reversible association In the course of this reaction, we can treat the values of x (total protein concentration) and y (total binding sites concentration) as constants. We would like to underscore the fact that lowercase letters denote the concentration of a given reactant denoted by corresponding uppercase letter.

A. Deterministic case
If we assume large protein numbers, we can obtain the mean of the distribution by finding the stationary state of deterministic reaction kinetics. This approach is used since the inception of chemical kinetics [19]. We know that in the stationary state (see e.g. [21]), we have n · y f = c, where the concentrations x f of free protein, y f of free binding site, and c of their complex are measured in units of the dissociation constant. If we combine this equation with conservation laws, n + c = x for the total protein and y f + c = y for the total binding-site concentration, we will obtain a system of three equations with three unknown variables. This system has a single non-negative solution, which we refer to asn,ȳ f ,c. We can obtain it by solving the quadratic equation For the sake of parsimony, we omit the explicit notation of dependence ofc,n,ȳ f on the total concentration of x and y in all non-ambiguous cases.

B. Stochastic component
As we mentioned before, we were not able to obtain the distribution of free protein in a closed form; therefore we have to choose an alternative approach. The number of free protein is affected by binding and unbinding reactions as well as by the production/degradation of new proteins. As the timescales of these two processes are diametrically different, we can treat them separately. Let us divide the stochastic component into the part corresponding to binding/unbinding and the part corresponding to total protein production/degradation. 1) Constant X (total protein count): First we assume that the number of total protein remains constant, focusing only on the binding/unbinding reactions: Using the shift-operator notation from [29] (E i f (x) = f (x+i)), we can write down the Master equation in the steady state for this reaction as (14) where P C denotes probability mass function of C , X f = X − C and Y f = Y − C , and the shifting is executed with respect to C . We further assume that size Ω of the system is large, whereby we identify the system size with the dissociation constant Ω = k− k+ . This will naturally lead to the macroscopic dissociation constant being equal to unity after the system-size reduction is performed. Now we can write down the relation between the total species count (uppercase letters in italics) and their corresponding concentrations (lowercase letters): N = Ω · n, Y f = Ω · y f and C = Ω · c. Inserting this scaling into (14) yields Using the Taylor expansion together with the fact that C is large (i.e. Ω −1 is small), the shift operators can formally be expanded as follows: Inserting (16) into (15) and neglecting terms of order of Ω −1 we obtain If we now integrate this equation with respect to c (with zero-flux conditions) we come to We denote by terms A = (c − ny f ), B = c + ny f the terms which appear in the above equation.
Since Ω −1 1, we can use a small-noise approximation (variance of c is of order Ω −1 ), which is based on Taylor-expanding A and B around the deterministic mean value: Substituting approximations (18) into (17) yields This equation can be easily solved by separation of variables, which yields We have found an approximate distribution of c; to put it in other words, we can say that as Ω → ∞. As n and y f can be computed directly from c as x − c, resp. y − c, they have the same variance as c.
2) Fluctuating X (total protein count): In the second part, instead of taking X to be a constant, we assume that total protein count fluctuates due to creation of new protein molecules and decay of old ones, which is denoted by a reversible pair of chemical reactions We already know that these processes operate on a much slower timescale that the association/dissociation processes. Furthermore, we have already found the steady-state distribution of X (from II-B); the result is that it is Poissonian with X = Var(X ) = k γ . We use the systemsize scaling k γ = x · Ω.
As Ω → ∞, we can approximate the Poissonian distribution by a small-noise Gaussian distribution In order to calculate the statistics of n (free protein concentration) we have to combine the results (20) and (21) given that n is expressed in terms of slowly fluctuating x (total protein concentration).
The free protein concentration also naturally depends on y (total concentration of binding sites), but as it is a constant through the whole process, we can neglect it from our notation for the sake of simplicity. We can use the variance decomposition theorem in order to solve this problem (see [30]). It expresses the total (unconditional) variance as the sum of the expectation of conditional variances and the variances of conditional expectations, in our case, we can write Var(n) = E (Var(n|x)) + Var (E(n|x)) . (22) We already know (from (20)) the solution for E(n|x) and Var(n|x); we utilize the fact that x fluctuates much more slowly that n and thus we can obtain results for n subject to constant x.
Using the large Ω approximation, we can write Since the variance of x is of order Ω −1 and we neglect all terms of order higher than Ω −1 , we can use the approximationsn(x) n( x ) and y f (x) ȳ f ( x ) in the formula for the conditional variance, which yields evaluated at x (mean of total protein concentration) and y (constant concentration of binding sites). The final term left to calculate is Var(n(x)).
If we used the approximationn(x) =n( x ), we would end with zero, which would incorrectly neglect all the variance. Therefore we also include terms of next order by using the linear variance approximation, i.e. by writinḡ From this expression we can find that Hence, the last item to calculate is the derivation ofn with respect to x. In order to find it, let us differentiate the equations nȳ f =c,n +c = x,ȳ f +c = y that define the dependence ofn, among others, on x, with respect to variable x; this yields Substituting (28) Thus we are now in a position to find the moments of n. We get the mean by substituting x into the deterministic results (13) and we get the variance by substituting the partial results (24), (25) and (29) into the variance decomposition theorem (22). For mean we obtained the formula Variance can be expressed in the form (31) Combining these two results gives us the possibility to obtain the Fano factor as After some simplifying steps we come to the expression where 1 can be interpreted as the Poissonian noise and the residual fraction as an additional non-Poissonian noise present due to the interaction with decoy binding sites.

C. Numerical simulations
In the current section we perform numerical simulations in order to visualize of our results and to confirm the validity of the approximation scheme presented above.
1) Fano factor based on system-size approach: In the first simulation we investigate the relation between x and the Fano factor. In the first figure ( Figure 5) we plot the dependance of Fano factor on the number of binding sites. We calculate the Fano factor for different values of x . All values are meant as the concentrations; therefore to obtain the corresponding number of molecules we have to multiply them by Ω. We clearly see that for larger values of x we are able to reach larger values of F . An interesting point to observe in the graphs is the slope near y = 0. We see that for small values of x the slope increases with increasing x , but for larger values of x the slope starts to decrease. In order to investigate this phenomenon into further depth, let us calculate Taylor expansion near F (x, 0): We can find the maximum of this value by differentiating which confirms that the maximal slope is obtained for x = 1.
In Figure 6 we take the ratio y/ x (number of BS divided by mean number of total protein) as the independent variable and plot Fano factor again for several different choices of x .
We can see that for larger values of x the maximum of Fano factor is achieved near y/ x = 1.
In order to investigate this phenomenon further, let us express the Fano factor in terms of new variables a = y x and b = 1 The problem of finding the maximum of Fano factor with respect to a is equivalent to finding the solution of ∂F (a,b) ∂a = 0, with F (a, b) from (34). This yields a very complex implicit function. As we are interested in cases with large values of x , we want to investigate its solution for small values of b. As we are unable to express the value of b in terms of a from the implicit function in reasonable way, we have to settle for numerical and graphical solution for this equation, which is provided in Figure 7. This verifies our hypothesis that for large x the maximal Fano factor is achieved near the point where y = x .
2) Quality of system size approach: In this section we investigate whether the linear noise approximation is consistent with the stochastic   results for quasi-steady state: Using this formula we can obtain probability distribution of free protein and therefore calculate its Fano factor. But it is not that straightforward. In order to get the results for big values of N , we have to rewrite the sum in a way that we get around the problems of calculating big factorials and multiplying number close to 0 with an extremely large number. For this purpose, (9) can be rewritten as: Now we can proceed with the calculations: we use 1, 5, 10 and 100 as the value for Ω and compare the numerically calculated exact Fano factor with the LNA expression for Fano factor (33) for values of y between 0 and 10. In the case where Ω = 1 we can only use integer values of y, for Ω = 5 we can use multiples of 0.2 for y, in other cases we use multiples of 0.1 for y in order to plot the graphs. We use the same setup for the different values of x , in particular 0.1, 0.5, 1 and 2. Results for scenario with x = 2 is shown in the Figure 8; results corresponding to other values of x are analogous. Furthermore, in Table IV we present the sum of squared residuals in the points y = 1, . . . , 10. (If y = 0, then F = 1 always, so we can omit this point.) We see that the differences between approximations increase with x , which is no surprise as Fano factor achieves higher values in such cases.

IV. CONCLUSION
We introduced three methods to obtain free protein distribution of our gene expression model. In addition to stochastic simulation and numerical simulation of the Master equation, we also employed singular-perturbation reduction tech-Ω x = 0.1 x = 0.5 x = 1 x = 2 1 1.8 · 10 −5 2.2 · 10 −4 8.0 · 10 −4 0.0032 5 5.6 · 10 −7 8.9 · 10 −6 3.3 · 10 −5 1.4 · 10 −4 10 1.4 · 10 −7 2.2 · 10 −6 8.4 · 10 −6 3.6 · 10 −5 100 1.3 · 10 −9 2.2 · 10 −8 8.5 · 10 −8 3.6 · 10 −7 niques to obtain a quasi-steady-state approximation, which was helpful in finding a relatively simple explicit formula for the free protein distribution. Using this formula we were able to observe its statistical moments for many different input parameters. Very interesting results were obtained for the Fano factor, which substantially differed from Poissonian case. In the second part of the paper we considered the case of large system size, using the dissociation constant as the measure of size. This approach yielded a tractable expression for the Fano factor of free protein distribution. Through numerical simulations we showed that this solution is consistent with results from the first part of the paper. While we have employed our methodologies to explore the properties of a relatively simple model, we expect that the same approaches can yield valuable insights in stochastic gene-expression models in particular as well as problems of mathematical biology more widely.
ACKNOWLEDGMENT PB and MH are supported by the Slovak Research and Development Agency grant APVV-14-0378. PB acknowledges the support from the VEGA grant agency (grant no. 1/0319/15). PB would like to thank A. Singh for helpful conversation about questions related to the research presented herein.