Problem 1: Chemical Case Study¶
Pick one chemical from the dataset. Extract all samples (across all strains) treated with this chemical. Compute the 5 biological features for each sample. Apply PCA to these features and plot samples in PC1-PC2 space, colored by strain. Do you see clusters corresponding to resistant versus sensitive strains? Which of the 5 features shows the largest difference between resistant and sensitive strains? Compute the mean AUC and growth rate for DMSO controls versus your chosen chemical.
Problem 2: Representation Learning Showdown¶
Compare three representation learning approaches on the same classification task: (1) Biological features (the 5 we engineered) fed into a small neural network, (2) First 10 PCA components fed into the same architecture, (3) Raw time series fed into the 1D CNN we built. Train all three with identical training protocols (same batch size, same number of epochs, same optimizer settings). Report F1 and AUC for each. Which achieves the best performance? Which trains fastest? Which is most data-efficient (try training with only 50% of data)? If you had to explain predictions to a biologist, which would you choose and why?
Problem 3: How Early Can You Predict?¶
The entire motivation for this analysis was reducing screening time. We worked with MAX_HOURS=24, but can we predict even earlier? Modify the configuration to test cutoffs at 3h, 6h, 9h, 12h, 18h, and 24h. For each cutoff, train all four approaches and record their F1 scores and AUC values. Create a plot with time on the x-axis and performance metrics on the y-axis, with separate curves for each model. Where's the "elbow" where additional waiting time gives diminishing returns? You'll likely find that 3-6 hours is too early (many cultures are still in lag phase), but there's a sweet spot somewhere between 12-24 hours where performance is nearly as good as using the full 72 hours.
After completing this analysis, answer the practical design question: You're designing an automated screening platform. Would you prefer 72-hour measurements or 6-12 hour measurements? Defend your choice considering:
- The cost of false negatives (missing a true hit)
- The cost of false positives (wasting resources on follow-up validation)
- The economic value of increased throughput
- The timeline and budget constraints of your contract
License: © 2025 Matthias Függer and Thomas Nowak. Licensed under CC BY-NC-SA 4.0.