“Data” is one of those words that everyone uses regularly. But ask two people “what is data” you’ll get vastly different answers! This is because “data” is a broad, catch-all term for various kinds of information.
Previously, I’ve written about data from a higher level, from organizing your biobank data to designing a phage data system. I’ve also introduced the field of data analytics, and hinted at writing about data analytics in biology. Before we do that, let’s break down and define all the different types of data, and explore what they’re used for.
You might have heard of “Big Data” and “data-driven organizations.” The trap many of these orgs fall into is that “the cost of keeping data around is less than the cost of figuring out what to throw away,” so they record as much data as they can. Unfortunately, most of this is noisy and uninformative and rarely creates new insights, and often only exists to confirm what the leadership has already suspected (this is unfortunately too common in both academia and industry). Even Elon is guilty.
The different kinds of data
We should be very intentional about what data we collect and generate, and the kind of insights we want our data to generate. Broadly, I think of data as a tool to both accumulate and communicate organizational knowledge. I tend to categorize data in the following buckets:
Measurements are quantitative pieces of information (inches of rainfall, number of plaques in a plaque assay, PFU/mL) usually recorded by hand or by a machine. These can be either discrete or continuous, and can usually be graphed into trends, compared side by side, and manipulated with math. 3/10/23 note: The general category is “Tabular Data” and also includes any kind of information that fits into tables of rows and columns: this includes lists of names, events, measurements, etc. These are usually represented with spreadsheets and CSVs.
Artifacts are bundles of information that generally need to be extracted and processed before we can make sense of them. I think of these as photos, images, and files (photos of plaque assays, screenshots of a table or graph, PDFs, FASTQ, Excel and Word documents). Some artifacts are easy to extract data from (CSV files), and others are much harder (# of plaques in a plaque assay photo). Some are proprietary files (GraphPad Prism files) that requires the right app or license; others are URLs. Generally, most artifacts are hard to read and parse with computers.
Presentational data are technically artifacts, and are meant convey and communicate insights, findings, and observations from measurements and artifacts. These include tables and human-readable spreadsheets, pictures of tables, slideshows, graphs, and infographics, preprints and manuscripts, and sometimes websites (Airbnb shows a reduced, user-friendly view of their data). Presentational data is meant for people to understand and is focused around communication rather than completeness.
Records are pieces of information that describes a thing or event, like logs (the package was scanned and shipped at 2pm; 5mL of Pae7 was used; 50 users visited our homepage on July 1), transactions (Heather paid $25 to Mary on July 1), and “objects” (Driver’s license, license plates, DOIs, SKUs, and the information they describe, commonly also called “documents”).
Records that have relationships are called “relational data” (Heather has Driver’s license number 123456 and license plate ABC123; Mary used 5mL of Pae7). Directional relationships can exist too (a transaction showed that Heather gave $25 to Mary; Heather and Mary love each other; Mary is annoyed at Heather). Records can also describe measurements, artifacts, and metadata.
Metadata are pieces of information that describe how things are described. Technically these are a type of record data, but I like to single these out. Metadata usually tracks “meta-information,” which means “information that describes information” (How many versions of a manuscript exist within a lab, and where are they? How many of those are named _final_final_final.doc, and how many versions of those exist? What are the differences and similarities between all those files, and when were they all edited?). Not everyone does this, but I think of metadata as processes (How do we publish Capsid & Tail every Friday? How is the team doing plaque assays? Which lab members don’t use sterile tubes for plaque assays?), settings, and configurations as well (what type of media was used, what settings were used for a sequencing run).
Raw or cooked?
We use “raw” to describe data that’s close to the source that generated it. For example, a temperature sensor that collects data per second will store a lot of temperature data. If you place these sensors all over a city, then you’ll have a lot of temperature data. Consequently, “rawness” ends up describing how easy the data is to work with. Here’s how I think about “rawness”:
Raw data is usually information collected straight from a sensor or machine, like temperature per second, or BAM/SAM/FASTQ files from a sequencer.
Processed data can be filtered or reduced data (e.g. daily temperature), or in bioinformatics, the data output from each step of the pipeline, from assembly to BLASTing. Processing data makes it easier to use and for humans to understand. It can always be processed further (turn second-by-second temperature to daily, weekly, or monthly temperatures), or combined with other data (weekly temperatures and humidity). Once you have the appropriate “resolution” you need of the data, you don’t have to go back to the raw data. If all you need is weekly temperature, you don’t have to re-process it unless you need to get second, minute, or daily data. As data is processed further, it starts to lose “resolution.” You can’t un-bake a cake. Ultimately, processed data helps us see patterns, get insights, and communicate findings much easier than raw data.
Presentational data, like slideshows, graphs, or tables (or screenshots of tables) is the most processed kind of data. These are usually reduced from dozens or thousands of data points to just a few, meant to communicate insights and findings. Because presentational data has lost so much resolution, we shouldn’t make a habit of re-processing, combining, or relying on it as primary evidence.
A potluck data party
Data is at the center of any research lab or clinical trial work. We collect and interpret it to gain insights, make decisions, and build evidence for our work — but it’s almost always consumed individually, and rarely shared among the lab. What if data was shared more often? What would that look like? Let’s use another food analogy!
In a potluck dinner party, everyone brings a dish to be shared. The food is placed easily within reach, and guests can choose to eat whatever they fancy. Everyone wins!
Sharing data is kind of like a potluck. Everyone brings a single dish — no need to share everything in the fridge. All the dishes are easily within reach — you don’t have to ask someone to pull their dish out of a bag every time. If a dish isn’t placed on the table, it’s assumed not ready to be shared (or doesn’t exist in the first place). In data science terms, the table is considered the “single source of truth.” The best part is: we only share what we’re proud of, and only when we’re ready!
Great data parties require great data housekeeping
If you run a lab or do any kind of data science or bioinformatics across team members, you’re now the (un)official host of the data party (lucky you)!
A kitchen full of hungry guests gets messy quickly! Setting up lines, placing plates and cutlery, and separating the food and drinks stations will help you direct traffic. As the host, cleaning up spills and picking up empty cups after guests is also crucial to a great experience. As a party host, your job is to plan and curate all the elements for guests to have a great time.
Hosting dinner parties is hard. Hosting data parties is harder! As the data party host of Phage Australia, here are a few pointers I’m trying to follow, in no particular order:
- Create the communal space: This is the table where all the data (or food) goes, and it should be the single source of truth for the lab. Every piece of data in this space is “real,” “official,” “verified,” and ready to be shared, re-used, combined, and audited. Any published data and publication data should be committed to the communal space. The space itself can be anything — a Notion, Airtable, MS Access or Filemaker project, a Trello board, Google Drive or Dropbox, a network drive or a single lab computer, or another document management system. There are a ton of options — just make sure to use one and stick to it for at least a few months before making a switch.
- The space should be easily within reach: Regardless of what tool you use, make sure everyone knows where the space is, and that everyone has access — either physical access or account/password access. Help those out who lost the link or the password or are locked out — if this happens often, create a set of instructions and email them out or put them on a wall somewhere in the lab.
- Help your guests if they need anything: Like a responsible party host, if guests need anything, you should help them. If people repeatedly need the same help, create a process document or poster and email/print it out, or make the process easier (e.g. if people keep forgetting the link, email it to everyone; if people keep forgetting their passwords, create a Bitwarden account for secure passwords for the lab). Helping guests out helps you keep things tidy in the long run.
- Build clear processes: Make sure everyone knows how to create, access/read, and update the data. Make these processes easy, and communicate them clearly. If you expect people to provide data, make sure it’s clear how they do that. Do they add their spreadsheets to a folder? How should they name their files? What should they upload? What data should they upload, in what format? This is tricky to get right, and something we’re still struggling with at Phage Australia. We’re building processes and SOPs that include better data handling practices, including templates, forms, and QR code systems.
- Create convenience: use templates and forms! Is someone tracking a patient case, or submitting a catalogue of new phages and bacteria, or putting in a sequencing request? Use forms for simple data, and Excel / Google Sheets templates with properly labeled columns for complex, tabular data. If a person can’t access a computer, as a last resort try using paper forms (and at a later date, save the image and scan using OCR). Use checklists!
- Use schedules, reminders, and checklists: On many occasions, the data will fall behind reality, especially for busy teams. Just like a cleaning schedule, make sure to check if the data is behind and get it up to date. Create checklists to track what data needs to be updated, and the cadence at which they need to be up to date. Use checklists to remind people what and when to record all the things — and make sure you know who’s responsible for what!
- Assign Ownership: Hand out responsibility to different lab members. Complex projects rely on contributors with different skill sets — make sure the experts share their knowledge and data back to the others. The very bare minimum data you should collect is who contributed what — so they can be chased up to fill in the gaps when the data is missing or expired.
- Lower expectations through the floor, but strictly enforce them: Make data super simple to submit; do a lot of the heavy lifting, even if means you have to clean up after others’ data spills. However, make sure they report the minimum data required of the lab or clinical trial. While people need freedom to operate, they still need to meet the bare minimum — don’t let anyone bring a half-baked cake to the party!
- …but also, don’t ask for too much: Most of the data we think we need is actually just “nice to have”. Requiring too much data becomes a burden, and people end up sharing less data. It’s a tough balance between asking for too little or too much, and it’s a dance that never stops. (Side note: they also don’t respond well when you say “well, it’s your job…”, so don’t try it!)
- Data has expiry dates too! Data can expire in the same way milk expires… if it’s past the expiry date and not sour yet, maybe keep using it? Data needs to pass the smell test too, and someone (usually the host) needs to smell the data. Some data can last longer than others: expired data means that the data doesn’t reflect reality anymore. If your Airbnb’s canceled, your app better reflect that news! The same goes for tracking phage therapy patients in the clinic, queued up sequencing jobs, or isolate shipping requests. Often, you’ll have to chase up the right people to confirm the data still reflects reality, and if not, they’ll have to update it. Set schedules to remind you to smell the data once in a while!
- Be iterative and experimental: Start early, start small. Try collecting one small piece of data first and grow it from there, and adjust as you go. You’ll be surprised how much you’ll learn from the team if you sit back, observe, and adjust the processes.
- Get everyone on the same page! Just like setting some house rules of a party, remind everyone that a clean, single source of truth for data makes the party more fun. (Also, that it’s a requirement for clinical trials). Yes, it involves more work for everyone. After all, no one likes cleaning the kitchen, but everyone enjoys the outcome!
If it’s not in the source of truth, it doesn’t exist!
Keeping the source of truth up to date is time consuming, and it will always lag the real world, and the data collected by individuals. It’s ok for people to work “in their own kitchens,” off their local spreadsheets and images and data — this keeps them productive. Just make sure to periodically remind everyone to share a dish back to the potluck! Because we can’t just turn people’s computers over to auditors, if it’s not in the official source of truth it doesn’t officially exist!
A plethora of tools and bases for a plethora of data
Recording and tracking data is hard already for individuals. We are limited by our tools: We use Word for written text, and Excel for numerical data, but Excel can’t work well with relational data (if Heather gets a new license plate, it’s tricky to update every location her license plate appears in the spreadsheet). Most of our tools don’t have “multiplayer” features like track changes and version control. It gets even harder for a team to record and track data together — it’s a massive communication challenge.
Unlike a dinner party, there isn’t a table you can just place all the food on and call it a night. There isn’t a single database that can collect and store our collective data: Excel and Sheets store numerical data; Airtable, Notion, and SQLite/Postgres store relational data and records; REDCap stores patient and trial data; loggers and key-value databases store event logs. Every researcher’s laptop (and phone) stores all sorts of images, PDFs posters, preprints, presentations. If you want anything shared, you can set up a Dropbox or Google Drive — but these don’t store record data and aren’t API-friendly! You could use AWS S3-compatible bucket, like S3, Cloudflare R2, or your own Minio instance, but they’re not user friendly…
With all these ways and places to store everything, how on earth do we keep track of all these places that store our stuff, or track who’s responsible? Do we know for sure that the one file is the actual, final-final version? And how does anyone in the lab find anything? What happens if someone leaves, or if a new person joins the team? Is there a handbook for all the rules? Where can you find it?
It would be nice if we had a tool that would let us work on a meta scale:
- Let us create easily accessible, versioned documentation, like tutorials, how-to guides, explanations, and references [according to the Diataxis framework]
- Compare and merge versions of files (when authors insist on emailing separate Word files instead of using track changes in the existing Google Doc).
- Search for that exact piece of information across all our shared Excel sheets, CSVs, and Google Sheets.
- Allow us to create relationships between a hodgepodge of PDFs, Word Docs, graphs, database records, and images, and search, filter, and extract data from them.
- Upload successive versions of files to replace older ones for when we re-run our bioinformatics pipelines against newer databases.
In the next couple of issues in the Data series, I’ll explore some of the underlying ways to think about and structure data for us to be able to create such tools for ourselves. Strap in, as it’ll get more technical!