Hello everyone!
It’s been a while since I’ve done one of these (have I ever?). It’s mostly been Jan’s thing. But lately I’m really digging into phage + ML (even though I am a complete ML newb). Some of you may have seen our talks in Cartagena, Colombia last week (hello new South American phage friends who’ve joined us here at Capsid & Tail this week!). Anyway if you saw them, you’ll not be surprised at my picks this week. I talked about them in my talk… namely, I think we should pay attention to what is going on with building new ML models with phage data. Hint: it is no longer just phage people. The labs near us at Stanford now are training generative DNA models on phage data, and doing crazy things like generating whole new genomes! What is next? Can we (should we) design all our experiments so that machines can gobble it up and learn more stuff faster? How would we even do this? I will be exploring this further in coming weeks/months. But first, let’s look at some of these papers!
I also chose a paper that’s about actually CHOOSING projects. It came up today in conversation with an undergrad student I am mentoring in the lab, Aaryan, who was wondering what he might do in January in the lab. My answer was, I could come up with something, OR you and I can each read this paper, and then meet up to discuss. TLDR is that we should all be spending more time thinking about what to do before we start lab work. We currently spend miniscule amounts of time picking a project, and then years answering to it. Why?
Hope you enjoy!
~ Jess
Sequence modeling and design from molecular to genome scale with Evo
What is it about?
Everyone’s using ChatGPT, but how long until we get ChatGPT for DNA? It was always inevitable, but who knew it would be so soon? This paper introduces Evo, a new generative machine learning model built by a group at the Arc Institute (a new institute next door to Stanford funded by Silicon Valley billionaires who want to give biology a boost — this is one of their first big releases and it’s awesome). A team there trained Evo using about 3 million prokaryotic and PHAGE genomic sequences (just the nucleotide sequences… not annotated or labeled!). It can predict and GENERATE functions across DNA, RNA and proteins. For example, it can generate CRISPR-Cas protein-RNA complexes and transposons that don’t exist (but function in the lab!) from scratch, and create plausible phage genome sequences over 1 megabase in length (though they haven’t yet rebooted these genomes, they told me they are close). Here is a seminar Brian Hie, the last author on the Evo paper, gave a little while ago; I watched this before reading the paper and it really made it click for me.
Why I’m excited about it:
I am obviously not an AI/ML expert, but even I can sense that this a huge leap forward in applying AI to biology. The fact that Evo can predict complexes of protein and RNA, AND do this all from nucleotide sequences alone (without even being labeled/annotated)… this seems wild to me. To me it means that a lot more of biological information may be encoded in the DNA than I thought. And if this was done with just nucleotides, without labeling what all the genes are (’this is a CRISPR sequence, that’s a transposon, that’s a chaperone’), then imagine what it could do once we (somehow?) add in all the knowledge we have amassed as humans about biology!
Secondly, being able to generate genomes would be pretty huge. At first I was somewhat indifferent when I heard this team was ‘trying to create the first synthetic phage’, because I thought it was already done, ie. synthesizing the DNA from a sequence, using something like Twist, IDT or Genscript. But no, this is generating a phage genome sequence that doesn’t exist, then checking if it would work if it were synthesized (by synthesizing the parts, sticking them together, then putting them into a cell to ‘reboot’ the phage). Does this mean I could finally say ‘I want a phage that goes to the bladder and kills all the UTI bugs once it gets there’, and even though I don’t KNOW the features the phage would need to have (non-immunogenic capsid proteins? some sort of not-yet-discovered tag that makes the kidneys let it pass it into the bladder? cross-genus host range?), I could generate a genome that would lead to a phage with these features?! (I asked Brian Hie and he said something like ‘not yet but essentially yes’).
Lastly, the model has been released openly, so researchers anywhere can build on this work and take it in new directions. In fact, a hackathon team (Team PhageBook!) did just that to build a phage-host prediction system using the Evo model as a base (in 10 days)! This team used Evo to do phage-host prediction that seems to rival some of the new papers coming out backed by a whole bunch of phage bioinformatics and expert annotation… (More on this in a future Phage Pick I think!).
~ Jess
Paper: https://www.science.org/doi/10.1126/science.ado9336
Nguyen, E., Poli, M., Durrant, M. G., Kang, B., Katrekar, D., Li, D. B., Bartie, L. J., Thomas, A. W., King, S. H., […], & Hie, B. L. (2024). Sequence modeling and design from molecular to genome scale with Evo. Science, 386(6723).
Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information
What is it about?
More phage machine learning! Now I am hooked, I have started to seek these papers out. This paper is by Aude Bernheim’s group in France, and focused on understanding and predicting how phages interact with E. coli strains. The team assembled a collection of 403 E. coli strains and 96 phages (I was surprised that it wasn’t THAT many; I was thinking there had to be like, thousands? tens of thousands?), systematically tested how they interact, and loaded up the data into machine learning models. And voila! They were able to predict interactions with 86% accuracy just from genetic data alone. They also developed an algorithm for recommending phage cocktails that could target specific E. coli strains, and their cocktails worked! One super interesting finding was that, based on which of their model experiments worked the best, they could find out that phage binding to bacteria matters more for successful infection than the bacteria’s defense systems. This was also recently shown in this paper for 2 phages, done through GWAS/transposon mutagenesis methods. Interesting! How often / for how many phages and species is this true, I wonder? And what else about phage biology can building these models teach us?
Why I’m excited about it:
This work is exciting because it shows we can predict phage-bacteria interactions accurately just from genomic data, and we may not actually need thousands of phages/strains to do it. Their phage cocktail recommender system worked great when tested on 100 pathogenic E. coli strains, which is super exciting for phage therapy. This paper also is cool when compared directly to Dimi Boeckaerts’ paper on Klebsiella strain-level prediction (PhageHostLearn), which we covered in a previous Phage Picks issue. That one used a different species and fewer phage-host pairs, a similar but different way of scoring phage killing, and different bioinformatics to pull out the features from the genomes that would be used to train the model. Comparing the two side by side helped me start to internalize some of the decisions you might make when building these models. Overall, I love that it seems each month we get a phage-host prediction model for a new species… soon we’ll have covered them all! And then what? I am excited.
~ Jess
Paper: https://www.nature.com/articles/s41564-024-01832-5
Gaborieau, B., Vaysset, H., Tesson, F. et al. Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information. Nat Microbiol 9, 2847–2861 (2024).
Problem choice and decision trees in science and engineering
What is it about?
Not a phage paper, or even a biology paper, but perhaps even more relevant to us all. This is an excellent commentary by Stanford professor Michael Fischbach tackling an often overlooked aspect of research: how do we choose which problems to work on? He argues that scientists typically rush into execution without spending enough time selecting the right problem to solve (I would agree! We don’t know enough to pick a project, and our PIs want to get us going in the lab as fast as possible — 3 month PhD rotations at US universities do not help with this pressure!).
Dr. Fischbach’s framework for better problem selection is very practical — he suggests spending more time up front on problem choice, using specific techniques he calls “intuition pumps” to generate ideas while avoiding common pitfalls, and evaluating ideas based on both their chance of success and potential impact. He also emphasizes the importance of analyzing your assumptions, developing ideas systematically by fixing one parameter at a time, and being willing to adjust course through what he calls the “altitude dance.” Basically doing science on the way you pick what science to do. Genius.
Of note — I found this paper because of Julia Bauman’s fantastic explainer tweet about it (and it’s also a course at Stanford!) — this was BEFORE I realized she actually works in my building!
Why I’m excited about it:
As someone who has always struggled with project design and project scope (how BIG should a project be? For me, for an undergrad student, for a new grad student I’m mentoring? How ambitious is too ambitious for an R01 grant?), this piece really resonates with me. I think it could really help researchers at any career stage make better choices about what problems to tackle.
~ Jess
Paper: https://www.cell.com/cell/fulltext/S0092-8674(24)00304-0
Fischbach, M. A. (2024). Problem choice and decision trees in science and engineering. Nature Chemical Biology, 20, 1-2.