MetaPhage: Towards user-friendly phage bioinformatics

Issue 193 | September 16, 2022
24 min read
Capsid and Tail

Cover image generated by DALL-E 2. Metaphage logo by MetaPhage. DALL-E 2 Prompt: shiny black pedestal extends from a silver floor, the surface is cold and smooth, coast line beach background, impressionist Outpainting Prompt (right): Impressionist painting of rolling hills and mountain landscape. Outpainting prompt (left): Impressionist painting of galaxy of colorful swirling starscape

MetaPhage is a new pipeline for turning raw reads into phage insights. It’s a massive step in the right direction towards one-click phage analysis. Here’s my trying (and failing) to run MetaPhage on a laptop, and a short tutorial to get you started.

What’s New

Adaptive Phage Therapeutics announces “The AMR RAPID™ Challenge”: APT will pay $1000 to to any individual or institution in the clinical research community who can find a bacteria resistant to their investigational Phage Bank.

AMRBiotech newsPhage library

Paper: MetaPhage: an automated pipeline for analyzing, annotating, and classifying bacteriophages in metagenomics sequencing data.

Bioinformatics ToolResearch paperViral metagenomics

Paper: Isolation and characterization of Brochothrix phage ADU4.

Phage characterizationPhage isolationResearch paper

Paper: Structural atlas of the most abundant human gut virus.

Gut phageResearch papercrAssphage

Paper: Intestinal phages interact with bacteria and are involved in human diseases.

Gut phagePhage-host interactionsReview

Latest Jobs

Phage TherapyResearch Assistant
Research Assistant at the National University of Singapore to work on an EDB-IPP funded project that focuses on bacteriophage therapy.
Assistant ProfessorPhage-biofilm interactions
Research Assistant Professor in Infection Basic Science Research at The Department of Orthopaedic Surgery, University of Pittsburgh School of Medicine. The Arthritis and Arthroplasty Design Group is a basic science and clinical research group with ongoing projects investigating the high tolerance of biofilm to antibiotics, pathogen and host response, and bacteriophage therapy.
AMRMicrobial ecologyPost Doc
Postdoc at Marquette University (Milwaukee, Wisconsin) to work with multiple research groups in the WQC and in biology focusing on antibiotic resistance and microbial ecology in water reuse. Experience in virus/phage inactivation and/or quantification is highly desirable.
Novel antibacterial targetsPhD project
PhD fellowship: in phage-guided discovery of novel antibacterial targets.
RNA biologyScientist
Research Engineering/ Scientist Associate III at Dell Medical School, U Texas at Austin to contribute to research efforts, namely noncoding-RNA biology, RNA-based therapeutics for pediatric neuroblastoma and other cancers, and cancer vaccine technology, and to perform bacterial work, including relevant E. coli and bacteriophage strains.

Community Board

Anyone can post a message to the phage community — and it could be anything from collaboration requests, post-doc searches, sequencing help — just ask!

The 1st hybrid event co-organised by Phage-UK will be held in Oxford, UK on Friday 7th October 2022 from 17:30 to 20:00.

Very generously hosted by the #saidbusinessschool in Oxford, UK, co-organised with Will Battersby, film director, and Diane Shader Smith, mother of the late patient featured in the film. The in-person evening will start at 17:30 with the screening of Salt in My Soul, a powerful and touching film featuring Mallory, a patient with cystic fibrosis, who sadly could not be treated early enough with bacteriophages to survive an antibiotic-resistant super-bug.

The screening will be followed by a panel discussion on bacteriophage therapy in the UK, shared live, online, from 19:15.

If you would like to attend the event in person in Oxford at 17:30, kindly send an email to [email protected] directly.

For those who are further away and would like to attend the event remotely, please register for the online event.

Film screeningHybridPanelPhage Therapy

MetaPhage: Towards user-friendly phage bioinformatics

Profile Image
Product designer and co-founder of Phage Directory
Co-founderProduct Designer
Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia
Twitter @yawnxyz
Skills

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

As of Feb 2022, I’ve recently joined Jon Iredell’s group in Sydney, Australia to build informatics systems for Phage Australia. I’m helping get Phage Australia’s phage therapy system up and running here, working to streamline workflows for phage sourcing, biobanking and collection of phage/bacteria/patient matching and monitoring data, and integrating it all with Phage Directory’s phage exchange, phage alerts and phage atlas systems.

I’ve been curious about phage bioinformatics since the beginning of Phage Directory. We’ve hosted several PHAVES talks about it, I’ve met lots of great phage bioinformaticians (shout out to Evelien, Simon Roux, Bas, Ramy, Andy Millard, Nouri, Stephen Pollo, the CPT and TAILOR teams, and tons more people!), and I’ve learned a ton.

I’ll also admit that I find it all super confusing. If you told me to “do some phage bioinformatics” on a fastq file, it might as well be bioinforMagics. I understand the gist though — you take the raw files from the sequencer, assemble them into genomes, and run tools to identify “virus-looking parts,” and compare to similar “virus-y parts” to get the context of our sample. I get the concept, but I can’t put it to practice.

There are a ton of tools out there that tackle all parts of phage analysis, from assembly to virus identification and classification, but they don’t speak the same language (e.g. output and read the same files), so they don’t really work together. They also rely on different dependencies — like different programming languages and software — that can be a headache. To my understanding, bioinformaticians spend a large chunk of time kludging various home-grown tools together into a pipeline. I definitely don’t have the knowledge (or patience) to work with these.

Flowchart of all the fancy tools MetaPhage uses

Figure 1. Flowchart of all the fancy tools MetaPhage uses to make sense of a sample. I have no idea what most of these do, but I trust they’ll get the job done. Source

This is where MetaPhage (MP) comes in. In the author’s words, MP is “a fully automated computational pipeline for quality control, assembly, and phage detection as well as classification and quantification of these phages in metagenomics data”. All that is to say — MP connects the various tools together in a way so you can just give it a file, out pops some understanding of the phage. This is a massive step in the right direction, as a tool like this will make it easy for anyone to run a “quick and dirty analysis” of any sequenced phage.

MetaPhage Resources

Paper: https://journals.asm.org/doi/10.1128/msystems.00741-22

Documentation: https://mattiapandolfovr.github.io/MetaPhage/

Github: https://github.com/MattiaPandolfoVR/MetaPhage

Docker Image: docker pull andreatelatin/metaphage:1.0

High-quality generated outputs

MetaPhage uses MultiQC to create an HTML report of the findings. This makes it easy for bioinformaticians to upload the data to the web — and for others to get access to the information. This means that bioinformaticians don’t have to painstakingly assemble graphs, plots and powerpoint presentations from scratch. The tool can just upload the HTML output to a server, and now the results are online.

Below is an example of the MultiQC output of a phage:

An example of the data that MetaPhage is able to output

Figure 2. An example of the data that MetaPhage is able to output. View high quality screenshot | Visit the source to explore the full demo output

As someone with no background in bioinformatics or biology, most of this output is foreign to me, like reading a company’s quarterly earnings report but with no context of the business or its financials. I understand that these all say stuff about the phage, but I can’t really contextualize it. I can’t get from this report if a phage is able to infect a certain host, and encodes certain toxin-generating genes, or even really if it’s lytic or temperate, but someone more skilled might be able to?

This reminds me of weather apps. Weather data is messy, and weather forecasting is very, very hard. Yet, most weather apps and forecasters present the “narrative” of what the weather could be doing, to their best abilities, with a decent level of confidence. As long as they also report their level of confidence, I’m going to take their word for it. Of course, I’ll have the additional ability to dig into the data like pressure, wind systems, satellite imagery, etc. etc. but most of it’s meaningless to me, as a non-meteorologist. This is what I think bioinformatics tools could move towards — adding layers of abstractions run with scripts with reasonable defaults, for those less familiar with the field and who want a quick answer. This has a benefit of leaving bioinformaticians more time and energy to answer the harder biological questions (instead of being flooded with simple but time-consuming requests from biologists).

On the path to One-Click phage analysis…

MetaPhage is exciting because it takes all the grunt-work of setting up and connecting the tools out of the picture. It’s like buying a car without needing to assemble the parts. You can just turn the key and drive. Software that’s easy to use is usually referred to as having a good user experience, or good UX. Similarly, “good user experience” also exists for software meant to make software development easier. This is called DX, for developer experience.

The software engineering world is full of tools that makes writing software easier — a long time ago I worked with Microsoft’s Visual Studio team, a tool designed for developers to more easily write code. The biology and bioinformatics world is sorely lacking in these tools, but Metaphage and a few other tools like KBase.org, BV-BRC (formerly PATRIC) and Galaxy are setting out to change that. These are starting to provide a better BX (bioinformatician experience?), and that’s exciting for newbies to the field like me.

…but one step at a time

This article started out as a “How to get started with MetaPhage” tutorial. I was going to get it running on my Macbook Pro, take screenshots along the way, and do a writeup of how I got my first phage analyzed. After six hours of set up, 10+ hours of running on a single 30MB phage sequence file, I was ready to create a successful phage report… only for MetaPhage to crash because I messed up the configuration file!

This error came from a misconfigured config file

Figure 3. This error came from a misconfigured config file. 100% my error, but there goes 10 hours of computation time. And it crashed at the very, very end. Oops!

I think I fixed the file, but it seems like MetaPhage will restart the entire process. One of the most annoying aspects of software engineering is making sure all the configurations are correct. You might have seen big web service providers like Amazon AWS, and Github go down for several hours at a time, or Atlassian go down for several weeks — all because of incorrect config files. It seems like bioinformatics is the intersection of wrestling with config files, dealing with brittle underlying scripts, and working with the general uncertainty and complexity of biological systems. It’s really hard!

Trying this at home requires patience…

If you do want to try it at home, the requirements for MetaPhage are quite high. Ideally you should have at least 80GB of free space. Nextflow, Docker, and the databases require about 35GB of space, plus another 35GB of swap space, for files to copy over. Each run seems to output around 2-3GB of files. MetaPhage also highly recommends a 16 core CPU with at least 64GB of RAM (the documentation says “required” but you can work around this). I ran this on my 2019 Intel CPU Macbook Pro, with 6 cores and 16GB of RAM, which is woefully inadequate. I think if I got a Macbook with an M2 Pro or Max processor it’d go a lot faster. Of course it’d be much faster on cloud compute, cluster or HPC!

Much technical documentation assumes background knowledge, and more or less ends up being a “draw the rest of the owl” meme. I’ll try to fill in some more context in my writeup.

Here are my notes and files for running MetaPhage on a Mac. Remember, I didn’t get it running correctly, so your mileage may vary here. MetaPhage doesn’t work on Windows, sorry.

But if something doesn’t make sense, remember to Read the Fantastic Manual: https://mattiapandolfovr.github.io/MetaPhage/

OS Requirements: Linux or Mac OS environment; I’m on MacOS Big Sur

Disk Space Requirements: At least 80GB for swap space, DB downloads, and apps such as Docker. I had 50GBs initially and ran out of space at the DB download step. Plus, Nextflow seems to create a work folder for file caching, which is an additional 27GB. I then moved several files off to an external hard drive and freed up 150GB of free space, which seems more than sufficient.

Ideal Hardware Requirements: 16 core CPU, 64GB RAM; I’m on the last Intel CPU, but an M2 Macbook Pro might work. No idea, but let me know if it does.

General Steps + Getting Software Requirements for MacOS, using Docker:

  1. Download, install, set up, and run Docker App for Mac: https://docs.docker.com/desktop/install/mac-install

  2. Download and install Java. This is required for Nextflow. I like the Open JDK version here. I chose “OpenJDK 16” for MacOS x64. There’s no ARM option for M1 and M2 processors yet, but this option should still work: https://adoptopenjdk.net/archive.html?variant=openjdk16&jvmVariant=hotspot

  3. Open the Terminal. If this is your first time, go to Spotlight (or CMD+Space) and type Terminal. Hit enter. brew and wget are useful tools for your Mac, and are essentially “free app stores” allows you to install apps like Nextflow, python, etc.

    1. Install Homebrew by pasting this into Terminal, and hit enter: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    2. Download and install wget using brew, by pasting this into Terminal and hit enter: brew install wget
  4. Now you’re almost ready to get started. Let’s get the folder and Nextflow set up.

    1. Create a new folder (I created a folder on desktop called MetaPhage), and navigate to this folder in terminal. If you created a MetaPhage folder on Desktop, type this to navigate: cd $HOME/Desktop/MetaPhage
    2. Download and install Nextflow, which is a system for writing computational data pipelines for fields like bioinformatics. Copy, paste, “enter”: wget -qO- https://get.nextflow.io | bash. This should result in a nextflow executable appear in the folder. If this step fails for you like it did for me: go to Apple logo (top left) > System Preferences > Security + Privacy > Full Disk Access > click the lock icon (bottom left) and grant full access to Terminal. If the wget command keeps failing, try curl: curl -s https://get.nextflow.io | bash
    3. Now, check if Nextflow is running properly by running the tutorial demo. If you haven’t worked with text files, I highly recommend downloading Sublime Text: https://www.sublimetext.com/blog/articles/sublime-text-4. This lets you easily create, inspect edit text files like scripts and code.
    4. Create a new file in the MetaPhage folder called tutorial.nf and copy the tutorial code as described in “Your first script” from the Nextflow documentation. In terminal, type nextflow run tutorial.nf to run the tutorial. You should see “HELLO WORLD!” appear. If it’s not working, try adding ./ to the commands. This tells the system to find the relevant files from within the same folder, e.g. ./nextflow run ./tutorial.nf
  5. Ok now we’re finally ready for MetaPhage! Let’s download Metaphage as a Docker image. This is what Nextflow will use to run MetaPhage. This is a ~2GB download, and roughly ~6GB when installed. Copy, paste and “enter” this to terminal: docker pull andreatelatin/metaphage:1.0

  6. The next few steps requires you to download quite a few things to set up MetaPhage: the Github repo with all the scripts (300MB), the databases required for MetaPhage (~34GB), the MetaPhage demo files (~6GB), if you want to run the demo. This is way too big of a run for my laptop to handle though, so I created my own smaller test (which failed to run, but my config might get you started).

    Note: to save you (and future me) some time, I’ve created a zip file of the MetaPhage Github folder, with all necessary databases downloaded, and the failed demo run file. (I’ve removed the official demo to save on space). This zip file is 15GB compressed, and the folder is 35GB decompressed.

    Grab it from my Dropbox, to save you the effort from doing the next few steps: https://www.dropbox.com/s/zdx7m88zwyexd7n/MetaPhage-Starter.zip! Don’t forget to edit the configuration files (.conf) parameters to fit your system! Otherwise, here are the steps to reproduce:

    1. Download and install Github CLI (command line interface). Github CLI lets you download code repositories from Github. This is useful if you want to get the source scripts for MetaPhage. Copy, paste, “enter”: brew install gh. Then clone the Github folder into your current MetaPhage folder: gh repo clone MattiaPandolfoVR/MetaPhage .
    2. Download the necessary databases for Metaphage (~34GB). Remember you need ~70GB for swap space before you download or it might fail like it did for me, and then you have to restart the download… Here’s how to grab them: ./bin/python/db_manager.py -o ./db/ -m 6
    3. Get the MetaPhage example if you want. It’s 6GB. For reference, a 35MB example took my laptop 12 hours to run. This would take about 240 hours to run. Good luck lol. To get the example:
      1. Download example from script: ./bin/getExample.py --verbose -t 8

      2. Create the config from script:

        python ./bin/newProject.py -i demo \
            -m demo/infant-metadata.csv \
            -v Infant_delivery_type \
            -s demo.conf
        
      3. Run the example with Docker: ./nextflow run main.nf -c demo.conf -with-docker andreatelatin/metaphage:1.0

  7. Now you’re ready to run your own example! I just grabbed a random Klebsiella phage called “FairDinkum” (from Andy Millard’s lab) from NCBI’s Short Read Archive: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17231376&display=download

    1. Create a folder, klebtest and move the fastq file in there. Here’s probably where I’m going wrong, so this might not work from here on out. This is probably where I messed up: you need to duplicate the fastq file and give each the name SRR17231376_1.fastq.gz and SRR17231376_2.fastq.gz Here’s where my lack of bioinformatics knowledge shines, as I don’t know why there are two read files, or really how the mechanism works. But I also don’t know why SRA only gives you one file to download.

    2. Create a metadata in the klebtest folder with a csv file with this, just as a test:

      Sample, Source_URL
      SRR17231376, https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR17231376&display=download
      
    3. Create a klebdemo.conf file with the configurations. Notice the max_cpus and max_memory params:

      params {
          max_cpus = 6
          max_memory = 16.GB
          max_time = 72.h
          config_profile_name = 'Mini Klebsiella Phage MetaPhage project'
          config_profile_description = 'MetaPhage analysis configuration'
          // INPUT PATHS
          readPath = "/Users/janzheng/Desktop/projects/metaphage/MetaPhage/klebtest"
          fqpattern = "_{1,2}.fastq.gz"
          metaPath = "/Users/janzheng/Desktop/projects/metaphage/MetaPhage/klebtest/metadata"
          dbPath = "/Users/janzheng/Desktop/projects/metaphage/MetaPhage/db"
          // OUTPUT/WORKING PATHS
          outdir = "/Users/janzheng/Desktop/projects/metaphage/MetaPhage/MetaPhage-klebtest-output"
          temp_dir = "/private/tmp"
          // METADATA 
          metadata = true
          virome_dataset = true
          singleEnd = false
          src_url = "Source_URL" 
      }
      
    4. Change the nextflow.conf file’s params setting to also include max_cpus and max_memory. You might also need to change everywhere cpus = # to match your # of CPUs. Note that the number of CPUs and memory should match the memory allotted in Docker. To increase that allotment, open the Docker App > Preferences. Do NOT increase the CPU count all the way to the right. It’ll make things extra painful for you. **This is crucial as this lets Nextflow and MetaPhage to make an exception for my rinky dink machine. This also means you’re going to wait a really long time for MetaPhage to finish, but hey at least it runs!

    5. Ok now you’re ready to run the demo. Type this: ./nextflow run main.nf -c demo.conf -with-docker andreatelatin/metaphage:1.0 and it should run! Wait 12 hours and hope your run doesn’t crash! If you’ve figured out where I’ve gone wrong, please tell me ([email protected]), so I can improve this writeup later on.

Parting Thoughts

If MetaPhage makes phage bioinformatics much, much easier, I don’t want to know what it’s like without MetaPhage. It moves the entire field towards the direction of one-click phage analysis where time and energy should be spent on analyzing results, not on stringing the tools together. Even though getting MetaPhage and Nextflow running properly isn’t a trivial task today, MetaPhage and tools like KBase.org, BV-BRC (formerly PATRIC) and Galaxy are making great strides towards user-friendly phage analysis and bioinformatics.

With that said, though MetaPhage lets someone like me run these tools, I still lack the deep understanding of biology and bioinformatics to make sense of any of the data output. Easy-to-use bioinformatics tools are still not a replacement for understanding the underlying biology, statistics, and data analysis. Seems like we can’t get rid of our resident bioinformaticians quite yet!

MetaPhage Resources

Paper: https://journals.asm.org/doi/10.1128/msystems.00741-22

Documentation: https://mattiapandolfovr.github.io/MetaPhage/

Github: https://github.com/MattiaPandolfoVR/MetaPhage

Docker Image: docker pull andreatelatin/metaphage:1.0

Capsid & Tail

Follow Capsid & Tail, the periodical that reports the latest news from the phage therapy and research community.

We send Phage Alerts to the community when doctors require phages to treat their patient’s infections. If you need phages, please email us.

Sign up for Phage Alerts

In collaboration with

Mary Ann Liebert PHAGE

Supported by

Leona M. and Harry B. Helmsley Charitable Trust

Crossref Member Badge