Homework: Mechanistic vs. Data-Driven Classification¶

Question¶

You've seen two ways to analyze growth curves. In Lecture 10, you extracted features from time-series and trained neural networks. In Lecture 11, you fit the logistic growth model (derived from the CRN $S + X \xrightarrow{k} 2X$) to extract parameters $k$, $S_0$, and $X_0$. Which works better for predicting toxicity? Can combining them help?

Dataset¶

Use the same dataset as the lectures: isolate-growth-curves.csv from Figshare with bacterial growth under chemical stress.

Problem 1: Extract Mechanistic Features¶

Fit the logistic growth model to every curve in the dataset. Recall from Lecture 11, starting from the CRN:

$$S + X \xrightarrow{k} 2X$$

with conservation $S + X = S_0 + X_0$, you get:

$$\frac{dX}{dt} = kX(S_0 + X_0 - X)$$

The solution is:

$$X(t) = \frac{K}{1 + \left(\frac{K}{X_0} - 1\right)e^{-kKt}}$$

where $K = S_0 + X_0$.

Extract the three CRN parameters for each curve: $k$ (rate constant), $S_0$ (initial substrate), and $X_0$ (initial cells). Note that you'll actually fit the model using some initial guesses and optimization (as in Lecture 11).

Go through all growth curves in the dataset and fit the logistic model. Some fits will fail (numerical issues, weird curves, etc.). Just skip those curves and move on. Save the successfully fitted parameters to a file called mechanistic_features.csv with columns for strain, chemical, replicate, k, S0, X0, and inhibited.

Plot 6 example fits showing both the data points and the fitted curve. Pick 3 control curves and 3 treated curves to show the difference.

Report how many curves you fitted successfully and how many failed.

Problem 2: Compare Three Approaches¶

Build three binary classifiers to predict whether bacteria are inhibited by a chemical. Use the same train/test split (80/20, random_state=42) for all three approaches so the comparison is fair. Use a neural network classifier with the same architecture for all three.

2.1 Mechanistic Features Only¶

Use only the three CRN parameters as features: $k$, $S_0$, $X_0$. Train a neural network classifier (use the same architecture as in Lecture 10: input → 16 neurons (ReLU) → 8 neurons (ReLU) → 1 output (sigmoid)).

Answer these questions: Which parameter is most important for prediction? Make a scatter plot of $k$ vs $K = S_0 + X_0$ colored by inhibited/resistant. Do inhibited samples cluster in a particular region? What does this tell you biologically about how chemicals affect growth?

2.2 Data-Driven Features Only¶

Use features extracted directly from the time-series, like you did in Lecture 10. You can either use the first 12 hours of OD measurements, or engineer features like max OD, area under curve, slopes at different times, etc.

Use the same neural network architecture as in 2.1 to keep the comparison fair. How does performance compare to the mechanistic approach? What are the trade-offs between these two methods?

2.3 Hybrid¶

Combine CRN parameters and data-driven features into a single feature vector. Train the same neural network architecture.

Does combining both approaches improve performance? If yes, why do you think that is? If no, why not? Which features end up being most important in the hybrid model?

Create a comparison table showing accuracy, precision, recall, and F1 for all three approaches.

Problem 3: Discussion¶

Write 1-2 pages discussing your results.

First, talk about performance. Which approach worked best? Were the differences large or small? Under what circumstances would you choose one approach over another?

Second, discuss the interpretability vs accuracy trade-off. CRN parameters like $k$, $S_0$, and $X_0$ have clear biological meanings tied to the underlying chemistry. You can explain to a biologist exactly what they represent. Data-driven features might predict better, but you can't always explain why they work. For a toxicity screening application, which matters more? Give an example of a scenario where interpretability is critical, and another where raw prediction accuracy is what you need.

Finally, what are the limitations of your analysis? What would you do differently with more time or data? What additional features or measurements would help?