On Data Analytics and Biology

Issue 204 | December 9, 2022

14 min read

This week, Jan takes a detour into data analytics! This follows up on the last informal Data Series post designing an auditable, replicable, and accessible phage data system. This time, Jan delves a bit deeper into what data analytics is, what tools are used, and connects it back to the phage realm.

What’s New

In a new paper in PLOS Computational Biology, Quentin J. Leclerc (Institut Pasteur) and colleagues use mathematical modelling to investigate the conditions under which phages and antibiotics act in synergy to remove bacteria or drive antimicrobial resistance evolution.

A new preprint by Madison Stellfox (University of Pittsburgh) and colleagues reports on the use of phage therapy to treat recurrent E. faecium bloodstream infections in a patient.

A new paper in Nature Communications by Lei Tian (McMaster University) and colleagues describes a new method for producing phage microgels, which could be used as sprayable antimicrobials targeting multidrug-resistant bacteria.

The UK Government’s Commons Science and Technology Committee announces the terms of reference for a new inquiry exploring the potential of bacteriophages as an alternative to antibiotics.

In a new opinion piece in the Boston Globe, Kevin Outterson and Henry Skinner discuss the global threat of antibiotic resistance.

Latest Jobs

All Jobs Post a job

The Department of Genetics & Genome Biology at the University of Leicester is looking to recruit a technician as part of their Professional Services team, providing support to the student teaching programme and research laboratories. In this role, you will oversee the day-to-day operation of the recently created Phage Research Centre.

Baylor College of Medicine is seeking a Research Technician II to work on various projects in the Department of Molecular Virology and Microbiology.

This MRC-funded project, led by Dr Gwen Knight (LSHTM) and Professor Jodi Lindsay (SGUL), will use mathematical modelling and data analysis to quantify the distributions of resistance in the clinical important bacteria methicillin resistant Staphylococcus aureus (MRSA).

Monash University is hiring a research fellow in microbiology to work on bacteriophages and bacteria such as Klebsiella and E. coli.

The Theoretical Biology Group at the Institute of Integrative Biology ETH Zurich is inviting applications for a postdoctoral position in infectious disease dynamics co-supervised by Professors Sebastian Bonhoeffer and Roland Regoes.

Umeå University offers a position as a Project Assistant for 12 months to study replication of staphylococcal bacteriophages using a multidisciplinary approach: genetics, biochemistry and structural biology, such as electron cryomicroscopy (cryoEM) and X-ray crystallography.

Community Board

All posts Post a message

Anyone can post a message to the phage community — and it could be anything from collaboration requests, post-doc searches, sequencing help — just ask!

Check out this youtube video of Rob Lavigne’s Phage Oxford 2022 talk.

On Data Analytics and Biology

Product designer and co-founder of Phage Directory

Co-founderProduct Designer

Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia

Email [email protected]

Twitter @yawnxyz

Website https://janzheng.com

Skills

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

As of Feb 2022, I’ve recently joined Jon Iredell’s group in Sydney, Australia to build informatics systems for Phage Australia. I’m helping get Phage Australia’s phage therapy system up and running here, working to streamline workflows for phage sourcing, biobanking and collection of phage/bacteria/patient matching and monitoring data, and integrating it all with Phage Directory’s phage exchange, phage alerts and phage atlas systems.

First of all, what is “data analytics”? Most companies use data analytics (sometimes called product analytics) to understand product usage and user habits. Companies usually have questions like: How popular are our blog posts (“How to design a phage data system,” the last post in this series, has a 45.9% open rate, meaning 600 people opened it, which is really good!)? What part of our site is confusing? What areas of the site are popular? This kind of information helps companies make decisions on what to improve, what to build, and what future posts to write.

Data-driven decisions

Many internet companies that sell a product or service (i.e. not us 🙃) intentionally design their sites for a certain behavior. Users visit their site, poke around, and if users like whatever they’re selling, they “convert” by signing up or adding to cart and checking out. This is a “product funnel”, and the “conversion rate” is the % of people who visit the website and successfully convert.

Fig. 1. A product funnel, with conversion rates. Credit: technically.dev https://technically.substack.com/p/how-do-product-analytics-work

Analytics is usually at the heart of many companies, as it tracks how sales is doing, and helps project either future success (e.g. “growth”) or in this economy, how much time until the clock runs out (e.g. revenue decline vs. how much cash is left in the bank). These numbers generally paint a picture of a company’s health, and executives often make decisions based on these numbers. For example, Netflix has been losing revenue and subscribers, and consequently they’ve cut their animation team, probably because of weak viewership and expensive production costs (despite making some of the best animation series in TV history).

What does data analytics look like? The data takes many shapes, from “raw data” to more “refined data”. Raw data for internet companies usually look like “events”: e.g. a list of “a user visited the “about page” at 3.05pm on October 11, 2022. More refined data, usually generated by a tool or with some code (e.g. using a SQL script to pull data from the database), could look like a list of users who visited product pages for that day, week, or month. Or a list of top 10 pages, broken down by days of the year. If the site sold costumes, it might show an uptick in Halloween-related site visits and activities in the weeks or month prior to Halloween.

Raw data will at some point end up as a series of charts and graphs in a presentation or Excel sheet somewhere, in front of a room of executives or board members. Charts, graphs, and presentations are a kind of data, but what I would consider higher-order data — data that communicates information, but can’t be manipulated with math.

In the “real world”, depending on the company and product, data usually comes in the form of events, database data (like user profiles and interactions), user-generated data like tweets and comments, images, videos, spreadsheets, devices, and a ton more formats.

Who participates in data analytics within an internet company?

Executives make long-term decisions based on analytics: “where should we invest $ based on who buys (or uses) X?”
Product managers make product decisions: “Who uses and don’t use X? Where do we spend our efforts?”
Marketing and sales are measured against analytics: “Are sales increasing? Where are conversions coming from, e.g. Instagram ads, Facebook, or Google? What growth strategies are working?” “How much are we spending on ads per new user? (Also called ‘Customer Acquisition Cost’)”

Data teams and the tools behind them

Even though data is at the core of many companies, data roles have often been ill-defined. Some companies have data-focused teams like data science, data engineering, data analytics, while others have business intelligence, product analysts and marketing analysts teams.

Ultimately there’s two sides of a data team: (1) the ability to collect, build, and manage reliable data sets (Apple stores 8 exabytes — 8 million TB of data) and (2) the ability to translate all that data into stories to product teams and leadership, by understanding their needs. These are two different jobs. The first job needs engineering and statistics; the second needs an understanding of core products, business strategy, and leadership needs (but also statistics… everyone needs statistics).

Fig. 2. This is like 1% of the data tools out there a company might use. We don’t really use any of them. Credit: technically.dev https://technically.dev/posts/what-your-data-team-is-using

Collecting data to tell cohesive stories that drive strategy is really hard. That’s why there are so many available data tools. Some of them are small (like Plausible, which we use for site analytics), and others are massive, like Segment (founded in 2012; acquired for $3.2B) and Snowflake (founded 2012; trading as $SNOW with a $50B market cap). These tools do all specialize in different parts of the data analytics pipeline, and many are used in conjunction. This makes up the “data analytics stack”. Plausible collects site analytics; Mailchimp, which we use as a newsletter sender, collects email open rates; Segment connects those events into data storage; dbt takes data from different sources and makes it interchangeable and comparable; Data Bricks, Snowflake, BigQuery, SQL, Amplitude, Hex, ObservableHQ, etc. are all help make sense of the data in different ways — python, R, and Excel are also used at this stage!

There’s also a small but notable difference between data analytics and data science, which could involve exploring data, looking for causality, patterns, and clues in the pursuit of new ideas; and let’s not forget ML engineering… This is how I understand the difference: data analytics uses data to drive organizational and strategic decisions, while data science uses data to gain more understanding of the world.

If you want a very deep dive on the technical stack, read this excellent deep dive on technically.dev!

Data analytics, biology R&D, and tools galore

Despite obvious differences, biology R&D has a few parallels with data analytics. Fundamentally, both fields deal with increasing amounts of data, a growing need to make sense of it, and an eagerness to turn data into decisions. And like data analytics, more tools are becoming available for biology R&D.

Of course, biology data looks different, from LIMS and ELN data, to images of plates, to to a mix of outputs created by devices like sequencing, TEM, optical density, OpenTrons/liquid handlers — the list goes on. But fundamentally the shape of the data looks the same: you have text files of various formats, spreadsheets and tables, images, and sometimes video and code.

And like data analytics, the goal is to turn the data into insights and strategic decisions, like drug candidates for drug discovery. A whole slew of tools for analyzing and predicting data with machine learning are coming online, like Google’s Alphafold (and its excellent Protein Structure predictions database).

Fig. 3. A variety of software for the life sciences. This is just a small representation of all the tools out there. Credit: https://vitalsignshealth.substack.com/p/the-changing-world-of-life-sciences

The wave of bio R&D tools isn’t limited to drug discovery, though. Many data analytics and engineering tools are increasing being appropriated by biology. Tools like Nextflow (and MetaPhage) are using containerization tools like Docker to make it easier for bioinformaticians to run jobs. Non-bioinformatics data workflow languages like Luigi, Airflow, and dbt are increasingly use by bioinformaticians to move data around. Data storage tools (also called “buckets”, which I’ll get into in the future) like AWS S3 and the much more affordable Cloudflare R2 are making it easy to dump humongous amounts of raw data into “buckets” for your future self to worry about. And of course, there are a whole slew of tools designed specifically for biotech R&D, like ELNs like Benchling, LIMS like Quartzy, and visualization and analytics tools like LatchBio and classic fan favorites Prism and Geneious.

For the lab, the data stack can be broken down into a few key areas: experiment tracking (ELNs), lab operations tracking (LIMS), collaboration and documentation, lab automation, and lab supplies.

With the vast amount of data all of us will be generating, we’ll quickly outgrow our paper and pencil roots. At Phage Australia, we use Notion for all our process documentation and protocol development. We’re also experimenting with Airtable and Retool to prototype a lightweight phage-focused LIMS in order to collect more reliable data.

But let’s not forget: the goal of all these tools is to collect, clean, and shape data. In turn, this helps us create cohesive narratives and make sense of the world, which then in turn drive our further decisions.

Looking ahead: data analytics and phages

So what exactly does data analytics look like for phage therapy?

For Phage Australia, we want to use data analytics to help us lower the time it takes from receiving a request to delivering the therapeutic; ensure ongoing stability and safety in our phage preps; decrease our costs through choosing easier/faster/more reliable processes and automation tools, and lots more. Basically, to make us faster and safer. We’re still exploring all the different ways we’ll use data to guide our decisions, but we’ll be writing about all the phage-related decisions we’ll be making with the data we’re collecting.

Our data narratives are only as good as the data we collect. And our data is as good as the processes and tools we use. Like most labs, we’re transitioning from paper notebooks and spreadsheets to more replicable and auditable data systems.

Right now, we’re mainly focused on the data engineering question: how do we create a robust system that gives us reliable data? Because we’re a small team, we don’t want to spend too much time repeatedly aggregating and cleaning data. To achieve this, we’re using combining off-the-shelf products, building our tools, and using plenty of automation to create a phage data stack that fits our needs.

In the next issues in the Data Series, I’ll go deeper on what I think the data collection pipeline and analytics stack could look like for phage biology.

Readings

https://technically.substack.com/p/how-do-product-analytics-work (paid)
https://technically.dev/posts/what-your-data-team-is-using
http://blog.booleanbiotech.com/biotech-data-infrastructure.html
https://twitter.com/jacobeffron/status/1575553864094216192
https://vitalsignshealth.substack.com/p/the-changing-world-of-life-sciences
https://hex.tech/blog/storytellers-and-system-builders
Want more data science in your life? This is my favorite data science newsletter: https://dataelixir.com/
This data science blog is gold: https://counting.substack.com/
This blog gives a solid background on technical terms. I love how it breaks down complicated jargon into layman terms (paid subscription): https://technically.dev

Product designer and co-founder of Phage Directory

Co-founderProduct Designer

Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia

Email [email protected]

Twitter @yawnxyz

Website https://janzheng.com

Skills

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

How to Cite

To cite this, please use:

Zheng, J. (2022). On Data Analytics and Biology. Capsid & Tail, 204. https://phage.directory/capsid/data-analytics-biology

BibTeX citation:

@article{Zheng2022On,

author = {Zheng, Jan},
journal = {Capsid & Tail},
number = {204},
year = {2022},
publisher = {Phage Directory},
title = {On {Data} {Analytics} and {Biology}},

}

RSS Feed

All diagrams and text in this issue of Capsid & Tail is licensed under Creative Commons Attribution CC-BY 4.0, unless otherwise noted.

For every issue of Capsid & Tail, we are committed to getting our facts straight, but we’re not experts in the information we’re bringing to you. If you feel that we’ve missed an important viewpoint, or if you have something to add, please reach out to us by emailing [email protected]. We’d love to hear from you, and we’d be happy to revisit topics we’ve covered (ideally with added information and viewpoints from community members like you!).

Lastly, please reach out if you’re interested in writing for us, or have suggestions for future issues!

Have an idea for Capsid & Tail? Itching to promote your research? Excited to share about your new clinical trial or phage company news? Check out our Capsid & Tail Author Guide,

Follow Capsid & Tail, the periodical that reports the latest news from the phage therapy and research community.

We send Phage Alerts to the community when doctors require phages to treat their patient’s infections. If you need phages, please email us.