On Data Analytics and Biology

Issue 204 | December 9, 2022
14 min read
Capsid and Tail

This week, Jan takes a detour into data analytics! This follows up on the last informal Data Series post designing an auditable, replicable, and accessible phage data system. This time, Jan delves a bit deeper into what data analytics is, what tools are used, and connects it back to the phage realm.

What’s New

In a new paper in PLOS Computational Biology, Quentin J. Leclerc (Institut Pasteur) and colleagues use mathematical modelling to investigate the conditions under which phages and antibiotics act in synergy to remove bacteria or drive antimicrobial resistance evolution.

AMRMathematical modellingResearch paper

A new preprint by Madison Stellfox (University of Pittsburgh) and colleagues reports on the use of phage therapy to treat recurrent E. faecium bloodstream infections in a patient.

Phage therapyResearch paper

A new paper in Nature Communications by Lei Tian (McMaster University) and colleagues describes a new method for producing phage microgels, which could be used as sprayable antimicrobials targeting multidrug-resistant bacteria.

BiomaterialsResearch paper

The UK Government’s Commons Science and Technology Committee announces the terms of reference for a new inquiry exploring the potential of bacteriophages as an alternative to antibiotics.

Phage therapyRegulatory

In a new opinion piece in the Boston Globe, Kevin Outterson and Henry Skinner discuss the global threat of antibiotic resistance.

Antibiotic resistanceOpinion

Latest Jobs

Student teachingTechnician
The Department of Genetics & Genome Biology at the University of Leicester is looking to recruit a technician as part of their Professional Services team, providing support to the student teaching programme and research laboratories. In this role, you will oversee the day-to-day operation of the recently created Phage Research Centre.
Baylor College of Medicine is seeking a Research Technician II to work on various projects in the Department of Molecular Virology and Microbiology.
Antimicrobial ResistanceMathematical modellingPhD Project
This MRC-funded project, led by Dr Gwen Knight (LSHTM) and Professor Jodi Lindsay (SGUL), will use mathematical modelling and data analysis to quantify the distributions of resistance in the clinical important bacteria methicillin resistant Staphylococcus aureus (MRSA).
Research Fellow
Monash University is hiring a research fellow in microbiology to work on bacteriophages and bacteria such as Klebsiella and E. coli.
Infectious Disease DynamicsPost Doc
The Theoretical Biology Group at the Institute of Integrative Biology ETH Zurich is inviting applications for a postdoctoral position in infectious disease dynamics co-supervised by Professors Sebastian Bonhoeffer and Roland Regoes.
Structural biologyProject Assistant
Umeå University offers a position as a Project Assistant for 12 months to study replication of staphylococcal bacteriophages using a multidisciplinary approach: genetics, biochemistry and structural biology, such as electron cryomicroscopy (cryoEM) and X-ray crystallography.

Community Board

Anyone can post a message to the phage community — and it could be anything from collaboration requests, post-doc searches, sequencing help — just ask!

Check out this youtube video of Rob Lavigne’s Phage Oxford 2022 talk.

Phage talksVideo

On Data Analytics and Biology

Profile Image
Product designer and co-founder of Phage Directory
Co-founderProduct Designer
Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia
Twitter @yawnxyz

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

As of Feb 2022, I’ve recently joined Jon Iredell’s group in Sydney, Australia to build informatics systems for Phage Australia. I’m helping get Phage Australia’s phage therapy system up and running here, working to streamline workflows for phage sourcing, biobanking and collection of phage/bacteria/patient matching and monitoring data, and integrating it all with Phage Directory’s phage exchange, phage alerts and phage atlas systems.

First of all, what is “data analytics”? Most companies use data analytics (sometimes called product analytics) to understand product usage and user habits. Companies usually have questions like: How popular are our blog posts (“How to design a phage data system,” the last post in this series, has a 45.9% open rate, meaning 600 people opened it, which is really good!)? What part of our site is confusing? What areas of the site are popular? This kind of information helps companies make decisions on what to improve, what to build, and what future posts to write.

Data-driven decisions

Many internet companies that sell a product or service (i.e. not us 🙃) intentionally design their sites for a certain behavior. Users visit their site, poke around, and if users like whatever they’re selling, they “convert” by signing up or adding to cart and checking out. This is a “product funnel”, and the “conversion rate” is the % of people who visit the website and successfully convert.

Fig. 1. A product funnel, with conversion rates. Credit: technically.dev https://technically.substack.com/p/how-do-product-analytics-work

Fig. 1. A product funnel, with conversion rates. Credit: technically.dev https://technically.substack.com/p/how-do-product-analytics-work

Analytics is usually at the heart of many companies, as it tracks how sales is doing, and helps project either future success (e.g. “growth”) or in this economy, how much time until the clock runs out (e.g. revenue decline vs. how much cash is left in the bank). These numbers generally paint a picture of a company’s health, and executives often make decisions based on these numbers. For example, Netflix has been losing revenue and subscribers, and consequently they’ve cut their animation team, probably because of weak viewership and expensive production costs (despite making some of the best animation series in TV history).

What does data analytics look like? The data takes many shapes, from “raw data” to more “refined data”. Raw data for internet companies usually look like “events”: e.g. a list of “a user visited the “about page” at 3.05pm on October 11, 2022. More refined data, usually generated by a tool or with some code (e.g. using a SQL script to pull data from the database), could look like a list of users who visited product pages for that day, week, or month. Or a list of top 10 pages, broken down by days of the year. If the site sold costumes, it might show an uptick in Halloween-related site visits and activities in the weeks or month prior to Halloween.

Raw data will at some point end up as a series of charts and graphs in a presentation or Excel sheet somewhere, in front of a room of executives or board members. Charts, graphs, and presentations are a kind of data, but what I would consider higher-order data — data that communicates information, but can’t be manipulated with math.

In the “real world”, depending on the company and product, data usually comes in the form of events, database data (like user profiles and interactions), user-generated data like tweets and comments, images, videos, spreadsheets, devices, and a ton more formats.

Who participates in data analytics within an internet company?

  • Executives make long-term decisions based on analytics: “where should we invest $ based on who buys (or uses) X?”
  • Product managers make product decisions: “Who uses and don’t use X? Where do we spend our efforts?”
  • Marketing and sales are measured against analytics: “Are sales increasing? Where are conversions coming from, e.g. Instagram ads, Facebook, or Google? What growth strategies are working?” “How much are we spending on ads per new user? (Also called ‘Customer Acquisition Cost’)”

Data teams and the tools behind them

Even though data is at the core of many companies, data roles have often been ill-defined. Some companies have data-focused teams like data science, data engineering, data analytics, while others have business intelligence, product analysts and marketing analysts teams.

Ultimately there’s two sides of a data team: (1) the ability to collect, build, and manage reliable data sets (Apple stores 8 exabytes — 8 million TB of data) and (2) the ability to translate all that data into stories to product teams and leadership, by understanding their needs. These are two different jobs. The first job needs engineering and statistics; the second needs an understanding of core products, business strategy, and leadership needs (but also statistics… everyone needs statistics).

Fig. 2. This is like 1% of the data tools out there a company might use. We don’t really use any of them. Credit: technically.dev https://technically.dev/posts/what-your-data-team-is-using

Fig. 2. This is like 1% of the data tools out there a company might use. We don’t really use any of them. Credit: technically.dev https://technically.dev/posts/what-your-data-team-is-using

Collecting data to tell cohesive stories that drive strategy is really hard. That’s why there are so many available data tools. Some of them are small (like Plausible, which we use for site analytics), and others are massive, like Segment (founded in 2012; acquired for $3.2B) and Snowflake (founded 2012; trading as $SNOW with a $50B market cap). These tools do all specialize in different parts of the data analytics pipeline, and many are used in conjunction. This makes up the “data analytics stack”. Plausible collects site analytics; Mailchimp, which we use as a newsletter sender, collects email open rates; Segment connects those events into data storage; dbt takes data from different sources and makes it interchangeable and comparable; Data Bricks, Snowflake, BigQuery, SQL, Amplitude, Hex, ObservableHQ, etc. are all help make sense of the data in different ways — python, R, and Excel are also used at this stage!

There’s also a small but notable difference between data analytics and data science, which could involve exploring data, looking for causality, patterns, and clues in the pursuit of new ideas; and let’s not forget ML engineering… This is how I understand the difference: data analytics uses data to drive organizational and strategic decisions, while data science uses data to gain more understanding of the world.

If you want a very deep dive on the technical stack, read this excellent deep dive on technically.dev!

Data analytics, biology R&D, and tools galore

Despite obvious differences, biology R&D has a few parallels with data analytics. Fundamentally, both fields deal with increasing amounts of data, a growing need to make sense of it, and an eagerness to turn data into decisions. And like data analytics, more tools are becoming available for biology R&D.

Of course, biology data looks different, from LIMS and ELN data, to images of plates, to to a mix of outputs created by devices like sequencing, TEM, optical density, OpenTrons/liquid handlers — the list goes on. But fundamentally the shape of the data looks the same: you have text files of various formats, spreadsheets and tables, images, and sometimes video and code.

And like data analytics, the goal is to turn the data into insights and strategic decisions, like drug candidates for drug discovery. A whole slew of tools for analyzing and predicting data with machine learning are coming online, like Google’s Alphafold (and its excellent Protein Structure predictions database).

Fig. 3. A variety of software for the life sciences. This is just a small representation of all the tools out there. Credit: https://vitalsignshealth.substack.com/p/the-changing-world-of-life-sciences

Fig. 3. A variety of software for the life sciences. This is just a small representation of all the tools out there. Credit: https://vitalsignshealth.substack.com/p/the-changing-world-of-life-sciences

The wave of bio R&D tools isn’t limited to drug discovery, though. Many data analytics and engineering tools are increasing being appropriated by biology. Tools like Nextflow (and MetaPhage) are using containerization tools like Docker to make it easier for bioinformaticians to run jobs. Non-bioinformatics data workflow languages like Luigi, Airflow, and dbt are increasingly use by bioinformaticians to move data around. Data storage tools (also called “buckets”, which I’ll get into in the future) like AWS S3 and the much more affordable Cloudflare R2 are making it easy to dump humongous amounts of raw data into “buckets” for your future self to worry about. And of course, there are a whole slew of tools designed specifically for biotech R&D, like ELNs like Benchling, LIMS like Quartzy, and visualization and analytics tools like LatchBio and classic fan favorites Prism and Geneious.

For the lab, the data stack can be broken down into a few key areas: experiment tracking (ELNs), lab operations tracking (LIMS), collaboration and documentation, lab automation, and lab supplies.

With the vast amount of data all of us will be generating, we’ll quickly outgrow our paper and pencil roots. At Phage Australia, we use Notion for all our process documentation and protocol development. We’re also experimenting with Airtable and Retool to prototype a lightweight phage-focused LIMS in order to collect more reliable data.

But let’s not forget: the goal of all these tools is to collect, clean, and shape data. In turn, this helps us create cohesive narratives and make sense of the world, which then in turn drive our further decisions.

Looking ahead: data analytics and phages

So what exactly does data analytics look like for phage therapy?

For Phage Australia, we want to use data analytics to help us lower the time it takes from receiving a request to delivering the therapeutic; ensure ongoing stability and safety in our phage preps; decrease our costs through choosing easier/faster/more reliable processes and automation tools, and lots more. Basically, to make us faster and safer. We’re still exploring all the different ways we’ll use data to guide our decisions, but we’ll be writing about all the phage-related decisions we’ll be making with the data we’re collecting.

Our data narratives are only as good as the data we collect. And our data is as good as the processes and tools we use. Like most labs, we’re transitioning from paper notebooks and spreadsheets to more replicable and auditable data systems.

Right now, we’re mainly focused on the data engineering question: how do we create a robust system that gives us reliable data? Because we’re a small team, we don’t want to spend too much time repeatedly aggregating and cleaning data. To achieve this, we’re using combining off-the-shelf products, building our tools, and using plenty of automation to create a phage data stack that fits our needs.

In the next issues in the Data Series, I’ll go deeper on what I think the data collection pipeline and analytics stack could look like for phage biology.


Capsid & Tail

Follow Capsid & Tail, the periodical that reports the latest news from the phage therapy and research community.

We send Phage Alerts to the community when doctors require phages to treat their patient’s infections. If you need phages, please email us.

Sign up for Phage Alerts

In collaboration with

Mary Ann Liebert PHAGE

Supported by

Leona M. and Harry B. Helmsley Charitable Trust

Crossref Member Badge