Working with Databases

Issue 224 | May 19, 2023
21 min read
Capsid and Tail

This week, Jan continues his casual “Data Series” of posts to show how phage labs can start using relational databases today, with Airtable.

Sponsor

Targeting Phage 2023 Conference banner

Targeting Phage Therapy 2023 Will be Hosted this June in Paris with More than 70 Communications and Innovations

More than 150 international participants will present the most recent advances on phage therapy within different sessions and topics. The most strategic question to discuss is: “how to bring phage therapy up to a new level?”.
For more information about the agenda and the attendees please visit the Targeting Phage Therapy website.

What’s New

Jagdev Singh and colleagues published a clinical trial announcement that will involve a single-arm, open-labelled study investigating the safety and tolerability of intrabronchial and nebulized phage treatment in children with cystic fibrosis and Pseudomonas aeruginosa.

Add a Comment

Clinical TrialCystic fibrosisPhage therapy

Denish Piya and colleagues published a preprint on genome-wide CRISPRi knockdown to map essential elements across whole phage-host genome combinations at once. Their results recapitulate the essential genes determined from decades of analysis for phage lambda, and give new insights into essential genes for phage P1. Twitter thread by Vivek Mutalik.

Add a Comment

Phage-host InteractionsHigh-throughput screenPreprint

David Wishart and colleagues have introduced PHASTEST (PHAge Search Tool with Enhanced Sequence Translation) as a faster and more accurate web server for annotating prophage sequences and bacterial genomes. In standardized tests, PHASTEST outperformed PHASTER, processing a typical bacterial genome in 3.2 minutes and offering improved whole genome annotations and enhanced visualization capabilities.

Add a Comment

Bioinformatics toolResearch paper

A recent preprint by Michele Mutti and colleagues reveals that phage activity against Staphylococcus aureus is severely impaired in plasma and synovial fluid. Coagulation factors binding to the bacteria form clots that protect them from phages, posing fundamental challenges for phage therapy in S. aureus infections.

Add a Comment

Phage stabilityPhage-immune interactionsPreprint

Copenhagen-based SNIPR Biome, a CARB-X portfolio company, has developed SNIPR001, a cocktail of CRISPR-armed phages designed to target E. coli. A recent release by Robin Berghaus of CARB-X highlights the potential of SNIPR001 to prevent bloodstream infections in cancer patients without disrupting their microbiomes.

Add a Comment

Biotech newsClinical TrialPhage therapyPhages and cancer

Latest Jobs

Research ScientistCell Culture
ADM (Cork, Ireland) is seeking a Cell Culture and Bacteriophage Research Scientist to work on various research projects revolving around the support of basic microbiological cell culture platform and the isolation, purification, characterization, design, and production scale-up of novel phages.
Gut microbiomePhD projectPreterm infantsPhage therapy
The Deng lab at the Technical University of Munich is seeking a PhD student to investigate novel therapeutics for preterm infant infections. The team will focus on gut colonisation by pioneer bacterial species and its interruption in preterm babies, which leads to diseases like necrotising enterocolitis (NEC), and test different phage therapy strategies in preterm piglets to fight against NEC-related pathogens.
MicroversePhD projectPhage diversity
Bas Dutilh at Friedrich Schiller University in Jena, Germany is seeking a PhD candidate for their Viral Ecology and Omics Group to study phage diversity and host range across biomes. This position is part of the Cluster of Excellence “Balance of the Microverse” and offers interdisciplinary collaboration opportunities.
OmicsViral Ecology
Bas Dutilh at Friedrich Schiller University in Jena, Germany is inviting applications for a postdoc position in viral ecology.
MicrobiomePhD project
Bas Dutilh at Friedrich Schiller University in Jena, Germany is inviting applications for a PhD position in microbiome structure.
TechnicianClinical samples
Balgrist University Hospital in Zurich, Switzerland is looking to hire a laboratory technician with a background in microbiology; experience with clinical samples or bacteriophages is an advantage but not required.
Bioinformatician
The Jagiellonian University is looking for individuals with a PhD in a quantitative field like biology, chemistry, physics, mathematics, computer science, or similar, who have relevant scientific achievements and experience in designing bioinformatic pipelines.
Assistant professorBiological Sciences
The Department of Bacterial Molecular Genetics of Faculty of Biology at the University of Gdansk seeks applicants for Assistant Profesor position in Biological sciences.

Community Board

Anyone can post a message to the phage community — and it could be anything from collaboration requests, post-doc searches, sequencing help — just ask!

The Africa Phage Forum is organizing a webinar on the tail spike protein diversity of Ackermannviridae phages: From Theory to Practice. Dr. Anders Nørgaard Sørensen, a Post-Doc Researcher at the University of Copenhagen, will be the guest speaker.

Add a Comment

WebinarPhage proteins

BSVoM is hosting a webinar as part of the My Hero and Me webinar series, entitled “From phage components to antimicrobials.”

Add a Comment

WebinarPhage protein therapeutics

Oluwasegun Daramola of the Ibadan Bacteriophage Research Team was interviewed an episode of the Medlab LabCast called Phage Therapy: The Enemy of My Enemy Is My Friend.

Add a Comment

Phage therapyPodcast

Tune in to Science Friday for an interview with Graham Hatfull about phage therapy: “Are Phages a new page in medicine?”

Add a Comment

Phage therapyPodcast

The Leicester Centre for Phage Research celebrated its grand opening/official launch this week!

Add a Comment

Phage researchLaunch

Working with Databases

Profile Image
Product designer and co-founder of Phage Directory
Co-founderProduct Designer
Iredell Lab, Phage Directory, The Westmead Institute for Medical Research, Sydney, Australia, Phage Australia
Twitter @yawnxyz
Skills

Bioinformatics, Data Science, UX Design, Full-stack Engineering

I am a co-founder of Phage Directory, and have a Master of Human-Computer Interaction degree from Carnegie Mellon University and a computer science and psychology background from UMBC.

For Phage Directory, I take care of the product design, full-stack engineering, and business / operations aspects.

As of Feb 2022, I’ve recently joined Jon Iredell’s group in Sydney, Australia to build informatics systems for Phage Australia. I’m helping get Phage Australia’s phage therapy system up and running here, working to streamline workflows for phage sourcing, biobanking and collection of phage/bacteria/patient matching and monitoring data, and integrating it all with Phage Directory’s phage exchange, phage alerts and phage atlas systems.

In the previous issue, Working with tabular data, we looked at some capabilities, pros, and cons of tabular data, and ran through a quick mini “not-a-tutorial” on using basic spreadsheet techniques. We discussed pivot tables, VLOOKUP, and saw how painful it can be to connect data from two different sheets.

Here’s a confession: the example I created was slightly contrived to end up in a situation that required us to look up information from two different tables, so I could have an excuse to go into databases. While spreadsheets are great for analyzing data, relational databases are great at collecting and creating relationships between data. Let’s jump into how that works!

What are databases again?

Recall that “records” are “pieces of information that describes a thing or event, like logs, transactions, and “objects” that consist of one or more pieces of information.

Databases are “spreadsheets” that store records. There are a few large exceptions: each column follows a strict schema. For example, the “Age” column could only store numbers, so any attempt at storing text will be rejected. Databases can have multiple tables, and each table stores a specific type of record.

Databases also allow one column to link to another table, called a relationship. For example, a People table could have a “Driver’s license” column that links to a table of driver’s licenses. This relationship is kind of like an “automatic VLOOKUP”; changes to one table automatically update in all related tables. (Recall that this might be a manual process with spreadsheets, when there’s an update). If a person’s name is changed, that change is reflected in the Driver’s license database.

Unlike VLOOKUPs, information is never duplicated between tables. The name of the person is always stored in the People table. The Driver’s license table merely “looks up” the name from the People column.

Starting with an Example

Let’s use the same example as “Working with tabular data”, where we wanted to break down the number of phage hosts by organization type.

We’ll be using Airtable to demonstrate the power of databases. We use Airtable for pretty much everything (from saving Capsid links to collecting Directory signups to organizing Evergreen!). Sign up to Airtable for free here (we get a referral!): https://airtable.com/invite/r/AHQ9XJCx

In the following example, I use Airtable to recreate the Excel sheet example from Working with Tabular Data.

Airtable example

The first thing I did was copy each tab (Organizations, PhageCollections) from the spreadsheet from the Working with tabular data issue (found here) to a new tab (called a table) in Airtable. I then connected the PhageCollections column in the Organizations tab by turning it into a linked field by clicking “Link to PhageCollections”. In the PhageCollections column, I did the same by connecting the “Name” field to the Organizations tab. By linking these tabs together, any updates to one table (e.g. if an organization changes name) will update the other linked table.

Fig 1

Fig 1. Here’s how the PhageCollections column in Organizations*** is used to connect to the PhageCollections table. In this example, the selected column from the Organizations table is linked to the “Collection ID” column of the PhageCollections table.

I then created two more tables, Hosts and Org Type, and connected the Org Type column from the Organizations table, and Hosts column from the PhageCollections table, as you can see below.

Fig 2

Fig 2. This example shows the PhageCollections table. See how the first column, Collection ID, has the same values as the PhageCollections column in the Organizations table in Fig 1. Also note that the column Hosts has been similarly connected to the Hosts table.

Fig 3

Fig 3. Here’s an example of connecting the Org Type column to Org Type table. Notice that by connecting tables in this way, we can count and display values from connected tabs, like Phage Hosts and Phage Host Count. This feature is called a rollup, and is a much simpler way to connect and display data compared to pivot tables. In this example, we’ve grouped each organization by type, and this screenshot shows a collection of Biotech organizations

By connecting the fields of one table to another table, we’re creating a relationship between the fields and the tables. In Fig. 2, we can see that PhageCollections has a field that shows all the Host items. In Fig. 3, we see the same hosts appear in the Phage Hosts column. We can do this because we’ve rolled up the Host field from PhageCollections, and we’re choosing to show them in the Organizations table. You can think of this as a “window” into the PhageCollections table. The best part is that if I add or remove a phage host (or change its name), the change propagates all the way to the Organizations**** table. As opposed to VLOOKUPs, Airtable relationships are “bidirectional,” which means that updating an organization will update the phage collection, and vice versa.

Fig 4a

Fig. 4a. By turning the Hosts column into a linked field, we now treat each host as a separate item in the database. The “Host Count” column is a calculated value that automatically counts the number of Host items connected to each row. This gives us a quick glance to see how many hosts are connected to each row.

Fig 4b

Fig. 4b. We can have special fields like Count, that can count the number of times Hosts occurs in a linked field. In this case, the Host Count table automatically counts how many items are in each Hosts field, but there are many other tools that could be used instead. For example, a Lookup tool could display one or more columns of the Host directly in the PhageCollections tab, without needing to switch to the Hosts table.

In the above screenshots, we’ve created a fairly complex relational database with just a few clicks, and we can get most of the information that took us a while to figure out with formulas and transformations in Excel. Isn’t that neat?

But wait a minute — from Fig. 3, we can see that in the Biotech category, ExoLytics has the most phage hosts… but that is incorrect. This is because of a peculiarity with Airtable’s rather than relational databases as a whole. In Airtable, an organization with only “Biotech” as an organization type is treated as an organization marked as “Biotech, CRMO, CRO, Phage Manufacturer”. This behavior is similar to the pivot table example from Excel, but unfortunately not something we need.

Fig 5a

Fig 5a. This is a “Grouped view” of the Organization table, where we create groups based on organization types, like “Biotech”. Airtable treats the each grouped category separately (“Biotech” is different from “Biotech, Phage Bank”), which means organizations marked as Biotech might fall into different groups. This makes it really difficult to use a grouped view to look at and compare all organizations tagged “Biotech”

Fig 5b

Fig 5b. To overcome this grouping limitation, we can instead filter the view by showing any items with an Org Type of Biotech. In this list, we can see that Mikroliz has the highest phage host count, at 25.

Airtable is a very powerful yet super simple database interface that lets you create tables, insert items, and connect columns and tables with relationships, like a relational database. Here’s the full Airtable to play around with. Because this is a public list, some features like formulas don’t appear here. Please duplicate this into your own Airtable account to play around with the data:

Here is the link to the Airtable. Please duplicate this into your own Airtable account to play around with the data. (Duplicate the database into your own account to be able to modify it): https://airtable.com/shrd5fppSieNAHONz

New to Airtable? Signup for a new Airtable account here (referral)

Recap: We looked at how to import an Excel spreadsheet into Airtable. We looked at how to insert the data into different tables, and then connect the tables via relationships called linked tables. We then looked at using tools like count to automatically tally the number of linked items in a row, and then used rollup to display that value elsewhere. We also played around with grouping the view by various types, discussed an Airtable design quirk, and created different views and used filters to get the real phage host counts.

Below is a 5-minute run-through of how I created the above Airtable base:

Watch the video here if the above doesn’t work: https://www.loom.com/share/9f980a00edc4438c86b6f72d3f1606cf

Why are databases useful?

As you can see, creating this Airtable database and crunching the numbers took very little time. Using linked records with rollups and calculations helps you quickly build a dashboard of data across many different sets of data. The best part is that we can treat each row of data as the source of truth — there is only one place to store a piece of information, and it informs every other place that needs to know about it. If a lab’s name changes, every other place that references that information is updated automatically.

When multiple tables are connected with relationships, you can easily create new “views” by combining columns from different tables, using features like rollups. These views can be really powerful, as each table could have dozens to hundreds of columns, and your view could calculate and select just the right columns needed for that view. This is one of many reasons websites store data in databases. Most tools and websites we use today are actually just user interfaces on top of databases. From LIMS, to Genbank, to Twitter and Airbnb… they’re all just a handful of pretty (sometimes not so pretty or user-friendly) interfaces that sit on top of databases.

The other neat thing about databases is that each column is “typed” and each table has a “schema of these types”. What this means is that you can set a column to be an integer, another column to be text, and another to be a predefined set of multiple choice options. If you try to set the integer column with text, well, it wouldn’t let you (typing letters wouldn’t do anything). A “schema” is a list that defines the type of each column in a table. This is really useful to have, because a quick glance at a schema can tell us what an item in the database could “look like”. This also prevents incorrect entries and typos!

How could databases be used for phage work?

You can easily use Airtable to track items in your lab and in your biobank (just create a Phage and a Host table and you’re off to a good start!). Using relationships, you can name columns after the relationship itself. A “Infects” column in a Phage table could link to a Hosts table, and every entry shows what the phage infects. You could have another column in the Phage table called “Parent” or “Child”, which would essentially create an ancestor/sibling relationship (this is called a “directed graph”).

There are lots of things you can try with a database! Airtable is the easiest to get started, but there are many out there, some with great user interfaces, some great for software developers (we won’t touch on those here).

Have a try at creating an Airtable for your lab — try tracking your phages, bacteria and inventory!

Some drawbacks of databases

While databases are powerful, they do come with a lot of drawbacks.

The power of schemas also mean they’re very annoying to update and edit: filling out forms is probably the worst part of using databases (soon you can use GPT for this!). It’s also a big hurdle to update many rows of data at the same time, and you end up having to do it repetitively, or writing scripts and code to do it.

Databases aren’t good tracking numbers like measurements and calculations. Databases also aren’t good for doing rapid data analysis. You can’t just copy and paste a bunch of numbers, run a few formulas to crunch some math, and throw it in a graph. This means you still need Excel around for that kind of work! (There are a few new types of database-like tools coming out that can address data analytics, like DuckDB or Datasette, but we’ll definitely not cover those advanced topics in this post). I always think that if you deal with measurement or time-based data (like growth curves or temperature over time, or stock price movements), use Excel. If you deal with lists of distinct items, use a database (Side note: there are time-based databases, but we’ll talk about those in a future issue) Generally, databases are terrible at telling narratives and stories — you’d need a separate tool like Jupyter Notebooks and Observable to do this job.

In addition to measurements, databases have a hard time storing things like images, videos, and documents like Excel or Word or PDF documents. Some databases like Airtable are able to attach files like PDFs, but aren’t able to “see” the content of the attached documents. For example, you can’t search Airtable for items with attachments that only mention lytic phages.

Tracking values over time

Finally, the largest drawback to both databases and spreadsheets, is that they’re both “change in place”. This means that every time you make an edit, you overwrite the previous value, and there’s no way to see what the last value was. For example, if a table of Library Books had “# of books” as column, the number would go up or down as books were borrowed and returned — but you’d never know how many books were borrowed that day. This is because by default, we don’t track the events of the books being borrowed or returned: only the current count of books.

In the next issue, we’ll go over how to set up a database to properly track events, as an event-driven database. Instead of tracking “# of books”, this database tracks “book events” that include “borrow” and “return” events. The aggregate of those events shows how many books have been borrowed and returned, and ultimately how many books are on the shelves. Such a setup has lots of benefits for the work of a phage lab. For example, we could collect phage/host data over time, and then aggregate the growth curve data to describe the host range (as opposed to subjectively declaring a host range). This is like instead of saying “Sydney is a hot weather city” and instead measuring the temperature every day, and letting the observer decide whether an average temperature of 22.6 °C is hot or not. Anyway, that’s a topic for next time — and a topic I’m really excited for… the entire reason for setting up the Data series is so I could start exploring event-driven data models with the few data nerds that are still reading. Can’t wait!

Capsid & Tail

Follow Capsid & Tail, the periodical that reports the latest news from the phage therapy and research community.

We send Phage Alerts to the community when doctors require phages to treat their patient’s infections. If you need phages, please email us.

Sign up for Phage Alerts

In collaboration with

Mary Ann Liebert PHAGE

Supported by

Leona M. and Harry B. Helmsley Charitable Trust

Crossref Member Badge