Plasmid Design¶
from nupack import *
import pint
u = pint.UnitRegistry()
Plasmid Components¶
Promoter: necessary for transcription initiation on the DNA molecule
Terminator: necessary for transcription termination on the DNA molecule
Ribosome binding site (RBS): necessary for translation initiation on the RNA transcript
Coding sequence (CDS): encodes the amino-acid sequence of the protein
Origin of replication: necessary for replication of the DNA molecule (in particular for vertical gene transfer)
Restriction enzyme (RE) sites: used for editing the plasmid backbone by inserting or deleting DNA fragments
Nucleic Acid Thermodynamics¶
Secondary structure¶
A DNA or RNA molecule can form bonds with itself or other molecules in the form of base-pairing. Base-pairing is a reversible and stochastic process dictated by the rules of thermodynamics. In particular, different copies of the molecule will probably be in different states and a single molecule will probably change states over time. The probability of a given secondary structure $s$ to occur in equilibrium is given by \begin{equation} p(s) = \frac{1}{Q} e^{- \Delta G(s) / kT} \end{equation} where $\Delta G(s)$ is the difference in Gibbs free energy of structure $s$ (usually with respect to the structure without base-pairings), $k$ is the Boltzmann constant, $T$ is the temperature, and $Q$ is the partition function (that is, a normalization term to have all probabilities sum up to $1$). The secondary structure with the highest probability is usually called the MFE (minimum free energy) structure.
The secondary structure can be analyzed with the nupack
library:
temperature = pint.Quantity(23, u.celsius)
my_model = Model(material='rna', celsius=temperature.to(u.celsius).magnitude)
a = Strand('GGCUGGUUUCUGCUCUCUAGUUCGCGAGGUGCAAUCUCCUAUC', name='a')
c1 = Complex([a])
set1 = ComplexSet(strands=[a])
result = complex_analysis(complexes=set1, model=my_model, compute=['pfunc','mfe'])
result[c1]
nupack.analysis.ComplexResult({model: Model('stacking', 'rna06.json', T=296.15 K), strands: (<Strand a>,), pfunc: 1299942011.309280711098976784, free_energy: -12.350025362861416, mfe_stack: -11.470466613769531, mfe: [StructureEnergy(Structure('((..(((...(((((((........)))).)))))).))....'), energy=-11.515907287597656, stack_energy=-11.470466613769531)]})
Partition function, MFE structure¶
The number $T(N)$ of possible pseudoknot-free secondary structures with $N$ nucleotides satisfies the recursion \begin{equation} T(N+1) = T(N) + \sum_{k=1}^{N-2} T(k) T(N-k-1) \end{equation} and has the asymptotic formula (Stein and Waterman 197990033-5)): \begin{equation} T(N) \sim \sqrt{\frac{15 + 7\sqrt{5}}{8\pi}} N^{-3/2} \left( \frac{3+\sqrt{5}}{2} \right)^N \end{equation} That is, the number of possible secondary structures is exponential in the length of the sequence. It is nonetheless possible to calculate the MFE structure and the partition function without pseudoknots (no crossing base pairings in the plane) in polynomial time using dynamic programming (Zuker and Sankoff 1984). If pseudoknots are to be taken into account, the computation becomes NP hard.
To show the principle, we now describe a dynamic program to caclulate the partition function (McCaskill 1990). The basic recurrence for the partition function of the subsequence delimited by indices $i$ and $j$ is: \begin{equation} Q_{i,j} = 1 + \sum_{i\leq d < e\leq j} Q_{i,d-1} Q^b_{d,e} \end{equation} In this sum, the pair $(d,e)$ denotes the rightmost base pairing (that is, $e$ is maximal). The term $Q^b(i,j)$ denotes the partition function of the subsequence between $i$ and $j$ with the additional constraint that $i$ and $j$ are base-paired.
\begin{equation} Q^b_{i,j} = P^\mathrm{hairpin}_{i,j} + \sum_{i<d<e<j} Q^b_{d,e} P^\mathrm{interior}_{i,d,e,j} + Q^m_{i+1,d-1} Q^b_{d,e} P^\mathrm{multi,init}_{i,d,e,j} \end{equation}
Here, $Q^m_{i,j}$ is the partition function of the subsequence between $i$ and $j$ with the additional constraint that it is inside a multiloop and contains at least one base pair.
\begin{equation} Q^m_{i,j} = \sum_{i\leq d < e\leq j} Q^b_{d,e} P^\mathrm{multi,end}_{i,d,e,j} + Q^m_{i,d-1} Q^m_{d,e} P^\mathrm{multi,continue}_{i,d,e,j} \end{equation}
Initially, $Q_{i,i-1} = 1$ and $Q^b_{i,i-1} = Q^m_{i,i-1} = 0$.
Interacting strands¶
Until now, we discussed secondary structure of a single nucleic acid strand. The case of multiple interacting strands is very important, although a bit more difficult (Dirks, Bois, Schaeffer, Winfree, and Pierce 2007). For the secondary structure, the strands in a complex are linearly ordered around a circle. The biggest difference is that this multi-strand sequence now has nicks (between strands). This introduces the need to consider exterior loops between strands. Mathematically, in the computation of the term $Q^b_{i,j}$ of the partition function, we add a new term that accounts for exterior loops:
\begin{equation} \sum_{\substack{i\leq c<j\\\text{nick between $c$ and $c+1$}}} Q_{i+1,c} Q_{c+1,j-1} \end{equation}
The nupack
library can carry out these calculations:
# define physical quantities
temperature = pint.Quantity(23, u.celsius)
concentration_a = 5e-6 * u.molar
concentration_b = 5e-6 * u.molar
# Define physical model
my_model = Model(material='rna', celsius=temperature.to(u.celsius).magnitude)
# Define strand species
a = Strand('GGCUGGUUUCUGCUCUCUAGUUCGCGAGGUGCAAUCUCCUAUC', name='a')
b = Strand('GTCUGGGAUGCUGGAUACUGAACCUAGAGAGCAGAAACCAGCC', name='b')
# Define tube ensemble containing strands at specified concentrations
# interacting to form all complexes up to 4 strands
t1 = Tube(strands={a:concentration_a.to(u.molar).magnitude, b:concentration_b.to(u.molar).magnitude}, complexes=SetSpec(max_size=4), name='Tube t1')
# Analyze the tube ensemble
# Calculate pfunc (default), pairs, mfe, concentration (default) for each complex
# Since pairs is specified, calculate ensemble pair fractions for the tube ensemble
tube_result = tube_analysis(tubes=[t1], compute=['pairs', 'mfe'], model=my_model)
tube_result
Complex | Pfunc | ΔG (kcal/mol) | MFE (kcal/mol) |
---|---|---|---|
(a) | 1.2999e+9 | -12.350 | -11.516 |
(b) | 1.8417e+10 | -13.910 | -13.118 |
(a+a) | 4.9588e+24 | -33.464 | -31.510 |
(a+b) | 7.6453e+48 | -66.241 | -65.130 |
(b+b) | 7.4450e+29 | -40.479 | -39.350 |
(a+a+a) | 3.6462e+38 | -52.254 | -48.841 |
(a+a+b) | 1.9410e+65 | -88.470 | -86.171 |
(a+b+b) | 2.8568e+64 | -87.342 | -84.975 |
(b+b+b) | 3.1631e+49 | -67.076 | -65.394 |
(a+a+a+a) | 2.5495e+53 | -72.370 | -67.667 |
(a+a+a+b) | 1.0730e+79 | -107.092 | -103.502 |
(a+a+b+b) | 8.4355e+97 | -132.697 | -129.756 |
(a+b+a+b) | 2.6622e+105 | -142.859 | -140.833 |
(a+b+b+b) | 1.2438e+84 | -113.954 | -111.019 |
(b+b+b+b) | 1.5135e+69 | -93.744 | -91.626 |
Complex | Tube t1 (M) | |
---|---|---|
(a+b+a+b) | 1.767e-06 | |
(a+b) | 1.466e-06 | |
(a+a+b+b) | 5.599e-14 | |
(b) | 1.915e-16 | |
(a) | 1.327e-18 | |
(a+b+b) | 1.028e-18 | |
(a+a+b) | 6.862e-19 | |
(b+b) | 1.453e-24 | |
(a+b+b+b) | 8.406e-27 | |
(a+a) | 9.338e-32 | |
(b+b+b) | 1.159e-32 | |
(a+a+a+b) | 6.996e-34 | |
(b+b+b+b) | 1.041e-40 | |
(a+a+a) | 1.266e-46 | |
(a+a+a+a) | 1.633e-60 |
Eqilibrium complex concentrations¶
Let $M_s$ be the number of solvent molecules (think water). Further, let $m^0$ be the initial single-strand population and $m_j$ the number of complexes $j\in\Psi$. Set $M = \sum_{j\in\Psi} m_j$. The partition function for the possible configurations in a box is then:
\begin{equation} Q_\mathrm{box} = Q_\mathrm{ref} \sum_{m\in\Lambda} \frac{(M_s + M)!}{M_s! \prod_{j\in\Psi} m_j!} \prod_{j\in\Psi} Q_j^{m_j} \approx Q_\mathrm{ref} \sum_{m\in\Lambda} q(m) \end{equation} where we defined \begin{equation} q(m) = \prod_{j\in\Psi} \frac{M_s^{m_j} Q_j^{m_j}}{m_j!} \end{equation} and used $M_s \gg M$, which gives $(M_s + M)!/M_s! \approx M_s^M$.
The probability of population vector $m$ is $p(m) = Q_\mathrm{box}^{-1}Q_\mathrm{ref} q(m)$. By the Central Limit Theorem, for large populations, this is an approximately multivariate normal distribution with variance in coordinate $j$ proportional to the mean value $\langle m_j \rangle$. Since $q(m)$ is proportional to $p(m)$, it is also approximately a multivariate Gaussian function with height $Q_\mathrm{ref} q(\langle m\rangle)$ and standard deviation in coordinate $j$ proportional to $\langle m_j\rangle^{1/2}$. Its integral is thus proportional to: \begin{equation} Q_\mathrm{box} \approx Q_\mathrm{ref} q(\langle m\rangle) \prod_{j\in\Psi} \langle m_j\rangle^{1/2} \end{equation}
Using Stirling's approximation $\log n! = n\log n - n + O(\log n)$, we get: \begin{equation} \Delta G_\mathrm{box} = -kT \log Q_\mathrm{box} = -kT \log Q_\mathrm{ref} + kT \sum_{j\in\Psi} \left( \langle m_j\rangle \left( \log\frac{\langle m_j \rangle}{M_s} - \log Q_j - 1 \right) + O(\log \langle m_j\rangle) \right) \end{equation} Neglecting the smaller $O(\log \langle m_j \rangle)$ terms, we get \begin{equation} \frac{\Delta G_\mathrm{box}}{M_s kT} \approx g_\mathrm{ref} + \sum_{j\in\Psi} x_j \left( \log x_j - \log Q_j - 1 \right) \end{equation} where we set $x_j = \langle m_j\rangle / (M_s+\sum_{k\in\Psi} \langle m_k\rangle) \approx \langle m_j\rangle/M_s$ and $g_\mathrm{ref} = - M_s^{-1}\log Q_\mathrm{ref}$.
The equilibrium distribution of complexes is thus approximated by the solution of the optimization problem \begin{equation} \min_{x} \sum_{j\in\Psi} x_j(\log x_j - \log Q_j - 1) \qquad \text{s.t.}\quad Ax = x^0 \end{equation} where the matrix $A$ calculates the conservation-of-mass constraints of the number of strands in each complex.
This optimization problem is convex since the feasibility set is an affine space and the Hessian matrix of the objective function is a diagonal matrix with diagonal entries and eigenvalues $1/x_j > 0$ in the open first orthant.
RBS Calculator¶
One benefit of understanding the properties of RNA molecules is the fact that we can predict the transcription and translation processes to some extent. A particulatly interesting application is the RBS Calculator (Salis, Mirsky, and Voigt 2009, Salis 2011), which allows the analysis and design of RBS kinetics. Translation initiation is an important step in the translation process.