Working with tabular data

Issue 215 | March 17, 2023
19 min read
Capsid and Tail

This week, Jan takes another detour into data-land, playing around with spreadsheets and tabular data, exploring VLOOKUP, pivot tables, and other formulas.

Sponsor

Sponsor Capsid!

What’s New

Armata Pharmaceuticals announces positive topline data from the completed Phase 1b2a SWARM-P.a. trial evaluating AP-PA02, a novel, inhaled multi-phage therapeutic for the treatment of chronic pulmonary Pseudomonas aeruginosa infections in cystic fibrosis patients.

Add a Comment

Biotech newsClinical trials

Boston-based global health startup PhagePro has obtained an exclusive license to develop and market their phage cocktail product to prevent community spread of cholera.

Add a Comment

Biotech newsPhage therapy

A new Transplant Infectious Diseases study by Ghady Haidar (University of Pittsburgh School of Medicine) and colleagues reports the use of nebulized phage therapy to treat Burkholderia multivorans in a CF patient. Unfortunately this was a clinical failure, but highlights the limitations, unknowns, and challenges of phage therapy for resistant infections.

Add a Comment

Cystic fibrosisResearch paperPhage therapy

Amit Rimon (Hebrew University of Jerusalem) and colleagues published a new paper in Nature Communications on topical phage therapy in a mouse model of Cutibacterium acnes-induced lesions. They found that phage therapy may be a promising strategy to combat resistant acne vulgaris. This paper was also featured as an Editors’ Highlight.

Add a Comment

AcneResearch paper

Cyanophages play an important role in controlling the abundance of their hosts, which are globally significant populations of Prochlorococcus and Synechococcus. In a new paper published in Viruses, Emily Dart (Clark University) and colleauges carried out quantitative analyses of T4-like cyanophage community structure and dynamics in combination with host community analysis.

Add a Comment

CyanophagesPhage ecologyResearch paper

Latest Jobs

Post DocPhage therapy
The FDA is looking for a postdoctoral fellow to help assess the potential of bacteriophage therapies against Vancomycin-Resistant Enterococcus species.
AMRMolecular biologyJob posting
The research programme in molecular bacteriology at Imperial College London, led by Prof. José R Penadés, is looking for a Research Associate to analyse the impact of phages in the emergence of antibiotic resistant bacteria.

Community Board

Anyone can post a message to the phage community — and it could be anything from collaboration requests, post-doc searches, sequencing help — just ask!

From the Phagebiotics Research Foundation:

Greetings Phage friends!

The 25th biennial International Evergreen Phage Meeting is set to kick off in person, in full Evergreen fashion, with Dr. Betty Kutter at the Evergreen State College. Mark your calendar for AUG 6-11th 2023! Phagebiotics Research Foundation is looking forward for you to participate and we are counting down these final few days until the launch of the 2023 website!

We ask you to hold tight, while we are hard at work finalizing the details and preparing for a great meeting. Stay tuned for further announcements and please follow us on social media to catch the latest announcements and updates.

P.S. We are going to Mt. Rainier on Betty’s Birthday, Fri. AUG 11th 2023!
Please plan on attending this special day to celebrate and hike with Betty.

Hosted by:
Phagebiotics Research Foundation & Phage Directory
Phage Meeting: Aug 6-11 2023
Multiple Workshops: Aug 6 and TBA
Mt. Rainier: AUG 11 2023

Add a Comment

Conference

Our @PhageFiles podcast is finally here! Tune in on Spotify, Google Podcasts, Deezer and Amazon Music to listen to our first episodes. In our podcast we will discuss all things phage. You will hear from researchers and entrepreneurs interested in different aspects of phage biology. But first, we will tell you about the fascinating history of phage discovery.

We would like to thank @Ellie_Jameson for our beautiful phage mini versions. Check out more of Ellie’s artwork here!

Stay tuned!

@ZuzanaPaleckov1 & @quinnit_2_winit

Add a Comment

Phage historyPodcast

The Good Virus is a new book by Tom Ireland that tells the story of phage research, from its beginnings to the present day. Now available for preorder in the UK and the US!

Add a Comment

Phage researchBook

Working with tabular data

Profile Image
Product designer and co-founder of Phage Directory
Co-founderProduct Designer
Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia
Twitter @yawnxyz
Skills

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

As of Feb 2022, I’ve recently joined Jon Iredell’s group in Sydney, Australia to build informatics systems for Phage Australia. I’m helping get Phage Australia’s phage therapy system up and running here, working to streamline workflows for phage sourcing, biobanking and collection of phage/bacteria/patient matching and monitoring data, and integrating it all with Phage Directory’s phage exchange, phage alerts and phage atlas systems.

In our previous data issue, we defined some types of data, and explored ways to think about data. This time, we’ll get more hands on, with an example of manipulating and analyzing tabular data in Google Sheets. We’ll mess with Pivot Tables and formulas like VLOOKUPs, and discuss what Spreadsheets are great and not so great at doing.

Working with tabular data and spreadsheets

Any data that can fit into rows and columns is tabular data, which includes lists and measurements, survey responses, and more.

Tabular data is the most common kind of data that labs work with, and spreadsheets are the ideal tool collecting, analyzing, understanding, and visualizing tabular data. That’s why brushing up our spreadsheet skills once in a while is so important.

I’ve been revisiting my spreadsheet and data visualization skills by reading the book “Hands-on Data Visualization” (get it here — it’s free! https://handsondataviz.org/). This post is heavily inspired by Chapter 2 of the book, but instead uses public Phage Directory data.

I highly recommend that everyone — Regardless of your spreadsheet experience — at least skim “Chapter 2: Strengthen Your Spreadsheet Skills”. If at any point, this post gets too confusing, read through Chapter 2 and you should be good.

Fig 1

Fig 1: Hands-On Data Visualization is a great book that covers the entire range from data design to storytelling, cleaning, to charting and publishing data on the web. I would highly suggest to every biologist and data scientist. Get it here, it’s free: https://handsondataviz.org

Everyone reading this blog has probably used a spreadsheet (99% of you probably have at least one spreadsheet open as you’re reading this). But when was the last time you used a Pivot Table or VLOOKUP? It’s been a while for me (I even forgot to use those for our State of Phage series and instead wrote a bunch of custom code 🤦).

Starting with an example

For this post, I built the following example as a way to practice my Google Sheets skills. I’m sharing this is a post to give you a gist of what my process of working in a spreadsheet looks like. It’s written to be as helpful as possible, but might fall short of a full tutorial. As always, if you get lost, please read Chapter 2 of the book to get up to speed!

For this data journey, I decided to look at all public organizations that have filled out phage collection information on Phage Directory.

Get the example sheet on Google Sheets:
https://docs.google.com/spreadsheets/d/1ZqLKWpKLV1Jg_dwfZf6-sas4IqqAzHaClgtxtqz80cg
(go to File > “Make a Copy” to play with it yourself)

If you want to skip ahead, or get really lost, here is the “solutions” sheet (but don’t peek before you’ve tried it yourself!):
https://docs.google.com/spreadsheets/d/1PilL_3hQQ3UOxcvRRABfpaiExH6yjEOMf746bqcDzk4
(go to File > “Make a Copy” to enable pivot tables)

In this example, I’ve created an Organizations sheet to show name, description and organization type of each organization that’s publicly listed on our Phage Directory website. I’ve also created a PhageCollections sheet to show the organizations that have filled out their phages.

As always, before working with any data — make a mental note of the caveats. The Phage Directory data is fairly out of date and biased towards self-reporting. Thus, it might not paint an accurate picture of the current, global state of phage collections. Heck, it probably doesn’t even paint a good picture of the labs we’ve indexed, anymore. But this doesn’t mean it’s not interesting!

Combining two sheets of data with VLOOKUP

(Reminder: This example acts as a refresher of Google Sheets formulas, VLOOKUP, and pivot tables. For a full tutorial, read Chapter 2 of this book)

The first thing I want to know is “how many phage hosts has each organization listed on PD?” To start, we need to combine the columns from the Organizations sheet with the PhageCollections sheet. For this, we’ll use the VLOOKUP formula.

Fig 2

Fig 2: Our first goal is to match the organizations (left sheet) to the phage collections (right sheet), so we can look at the phages that belong to each lab.

To start, copy the column names from PhageCollections (People, Organizations, Hosts) into the Organizations sheet. We now use the VLOOKUP formula, which “finds and pastes matching data from another sheet”. Using the formula =VLOOKUP(D2,PhageCollections!A:F,2,false) will grab the data from the People columns of the PhageCollections sheet into the Organizations sheet. To get the Organizations and Hosts columns, copy the formula into the other columns, and increment the index from “2” to “4” to grab the 4th (People) column. Now do the same for Organizations and Hosts columns. Drag the fill handle (bottom-right corner of a cell) to fill the rest of the rows.

Fig 3

Fig 3: Using the VLOOKUP formula to pull the Name, Description, and other columns from the PhageCollections sheet into the Organizations sheet.

You’ll end up with columns that look like Fig 4. Remember, our goal is to figure out the number of phage hosts per organization. Hosts are stored in a comma-separated text in the Hosts column, so we’ll need to count up the number of items in that text. Let’s create a new column called “Number of Hosts”, and use the following formula to count up the hosts: =COUNTA(split(I2,",")). The split function splits up each comma-separated value, and the “COUNTA” function counts those values and outputs a single number. Drag the fill handle (bottom-right corner of a cell) to fill the rest of the rows.

Fig 4

Fig 4: After counting up your hosts, your data should look something like this. Column J counts up all the hosts listed in the Hosts column. Now we can look at some numbers!

Making sense of numbers with Pivot Tables

The first thing we could do is sort the number of hosts in descending order. We can see that Mikroliz Biotech have the most listed, at 25 phage hosts. The de la Fuente lab has 22 hosts, and the Phage Bank of Korea has 21 hosts.

Fig 5

Fig 5: Phage organizations sorted descending by number of phage hosts.

Ok this is great, but what if we wanted a list of all organizations labeled as “Lab” with the most phage hosts? Let’s use a pivot table. In Google Sheets, go to Insert > Pivot Table, and create it as a new sheet. Add “Org Type” by ascending order and “Number of Hosts”, by descending order as Rows to your pivot table. Under Values, add Name, summarized by “COUNTA”. We’re counting the number of Org Types for each “Number of Hosts.” Your results should look something like Fig 6.

Fig 6

Fig 6: Pivot table counting how many Org Types exist for each Number of Hosts. For example, one “Biotech” is listed with 6 phage hosts, one “Lab” with 22 phage hosts; two “Labs” with 16 hosts, another “Lab” with 15, etc…

If we add Name as a row, we can now see the organizations’ names listed for each Number of Hosts category, as shown in Fig 7.

Fig 7

Fig 7: We now add Name as another row, to see the exact breakdown of each Org Type category. You can keep adding columns and rows in pivot tables to reveal more insights about your data.

We could go further. Notice how in Figure 6, the Org Types are “incorrect” because the Org Types are comma separated and not unique. This means that “Biotech” is treated as a different group than “Biotech, CRO.” Instead, we want all organizations labeled “Biotech” to appear in host counts. To create a proper pivot table, we need to create a list of values that matches each Org Type for each Organization. For example, Creative Biolabs is listed as both a Biotech and a CRO. We need to create two rows, where Creative Biolabs is separately listed as a Biotech and a CRO:

Org Type Org Name
Biotech Creative Biolabs
CRO Creative Biolabs

Start by moving the Org Type column to the right of the Number of Hosts column, so that Org Type is now column J. This is necessary because the next step will create a number of columns based on how many items exist in the comma-separated Org Type. Click on the Org Type column to highlight it, then click on Data > “Split text to Columns.” Now each type is now broken down into five columns, from J to N. Now we need to combine this these separated “Org Types” with “Org Names”

Add two more columns to the left of column J, and call these “Split Org Type” and “Split Org Name.” We create two columns to the left of J, because constantly scrolling to the last column can be annoying; in our example we only generate 5 columns, but in other use cases we could generate hundreds!

We can now create a new list where each organization type is associated with an organization name. To do this, we create a formula that creates an array of filters: ={ filter({L2:L, A2:A}, len(L2:L)); filter({L2:L, A2:A}, len(L2:L)) } which pairs up each org type in column L with the Organization Name in column A. This The results are shown in Fig 8.

Fig 8

Fig 8: We can split up the comma-separated Org Type column into separate rows of Split Org Type, Split Org Name pairs. This lets us use the pivot tables again!

In our example, “Split text to Columns” only created five extra columns (L through P). What if you had many dozens? We need to manually type each “filter” row in the formula, which could become tedious! This is where you can “cheat” and ask ChatGPT to complete the formula for you, as in Fig 9. Handily, ChatGPT even explains what it’s doing, and how the formula works. ChatGPT rarely makes mistakes at these types of clearly-defined, repetitive tasks.

(I actually got the original idea to use filter/len as the formula from ChatGPT in the first place. It’s a great tool, even if it makes a lot of mistakes)

Fig 9

Fig 9: Using ChatGPT to “cheat” and expand our formula from earlier to cover columns L through O. Each line that starts with “filter” covers each column of organization types we’d want in the final output.

Here’s the final array filter formula used to generate the new Split Org Type and Split Org Name lists:

={
  filter({L2:L, A2:A}, len(L2:L)); 
  filter({M2:M, A2:A}, len(M2:M));
  filter({N2:N, A2:A}, len(N2:N));
  filter({O2:O, A2:A}, len(O2:O));
  filter({P2:P, A2:A}, len(P2:P))
}

Using the previous instructions for generating Pivot Tables using the new Split Org Type, Split Org Name as rows in the new pivot table. (I’m using the ”I now leave this as an exercise to the reader” as an excuse to not type this part out)

In this new pivot table, we can see that each organization type is properly broken down, as illustrated in Fig 10. We also have new numbers now, as Theralyx is listed as the “leader” with 10 phage host.

Take a peek at the results and the breakdowns in the Solutions sheet, labeled as “Pivot Table 2”. If you get stuck setting up the pivot table, click the “Edit” button bottom left corner to pull up the settings (it can be a bit hard to find).

Fig 10

Fig 10: A pivot table with the organization types broken down properly into groups. Host and org name counts have been hidden for conciseness.

Spreadsheets are great for analysis, but not great as source of truth

Working with tabular data in spreadsheets is a lot of fun! The flexibility of spreadsheets can get us the information we need very quickly. But notice how many columns I’ve created and moved around in the process.

If everyone in the lab worked off of the same “source” spreadsheet, it’d become an absolute mess.

If everyone copied the spreadsheet every time they analyzed data, their own versions wouldn’t get the latest data updates. They’d have to copy the spreadsheet and redo all the steps, every time there’s new data.

What we need is a spreadsheet that acts as a “main source of truth.” It should always have the latest information. The data should be easy to copy. The data should be easy to analyze and understand — but it should not easily become messy.

This “spreadsheet source of truth” should have plenty of VLOOKUPs — so the Organization sheet can see all the host data — but the PhageCollections sheet could also see data in Organizations. And when a PhageCollection gets updated, the Organization data gets updated, and vice versa. And even better: what if when the source of truth is updated, everyone else’s data analysis would be updated too, without having to copy and re-analyze the data?

Spreadsheets are still my go-to place for playing with and understanding data — but I don’t use it as a source of truth. I share a lot of data across teams, and store most data within databases.

Databases lets you automagically connect sheets (called “tables”) together. You can combine columns from one table into another (lookups), perform calculations (like rollups and counts), and more. You can also create “views” different columns, sorts, and filters, then export them to CSV or copy and paste the data into a spreadsheet for analysis (which is what I did for this exercise).

Databases have a lot of pros and cons which make them great partners with spreadsheets. In the next Data issue, we’ll try to recreate our Organizations and Phage Collections example in Airtable, our favorite database tool — and discuss pros and cons of each tool. Stay tuned!

Capsid & Tail

Follow Capsid & Tail, the periodical that reports the latest news from the phage therapy and research community.

We send Phage Alerts to the community when doctors require phages to treat their patient’s infections. If you need phages, please email us.

Sign up for Phage Alerts

In collaboration with

Mary Ann Liebert PHAGE

Supported by

Leona M. and Harry B. Helmsley Charitable Trust

Crossref Member Badge