Perspective: The motivation for proper RDM in image analysis (by Tom Boissonnet)

Data management is made for bioimage analysts

How often have you wondered about how to structure your folders, if you should write metadata within a single txt or json file or in one file for each image, how to name your images consistently so that your code will be able to find them?

Well all these problems disappear when you adopt a research data management solution like OMERO. Of course, there is one step that one cannot skip, which is data annotation. But this is outweighed by the benefits you will get.

Annotating data without overdoing it

First off, there are different kinds of annotation. In the case of Key-Value pairs, the topic is certainly vast, and you should look into minimum annotation standards in your area of research. Still, I recommend you the REMBI (Recommended Metadata for Biological Images), it’s a great place to start to get a useful description of your images.

Closing that parenthesis, let’s look at another kind of annotation: Tags. Those are very easily confused with Key-Value pairs, but conceptually the two are different from each other. While Key-Value pairs describe in details the content of the object they annotate, tags are meant to provide ways of finding and filtering the object they tag. You can see Key-Value pairs as being formatted lab notebook entries while you can regard tags as the names of your folders you use to arrange your files. And this difference means that while you should be exhaustive with Key-Value pairs to describe in details what the image is about, tags are only meant to help you find the data: no need to add a tag for everything. As a rule of thumb, when creating tags or tagsets (a category of tags), ask yourself if you would ever use such a folder name in your folder hierarchy. The less tags you have, the less likely you are to forget one. And in fact, you can see tags as a dynamic folder hierarchy:

Instead of the classical folder hierarchy…

Condition_A / Experiment_01-01-2022 / genotype_XYZ / nuclei

…you could filter the same data with tags in different orders, chose the one that flows for you:

Tag_condition_A + Tag_nuclei + Tag_genotype_XYZ

Tag_genotype_XYZ + Tag_condition_A + Tag_nuclei

…

So after uploading datasets to your favorite RDM solution, remember to tag your data and you will see how easy it becomes to find it.

Image analysis input/output

The inputs of image analysis pipelines are also often an issue: “The code looks great, but what wicked folder structure do I need for the program to recognize my files?”. Maybe you have been in this situation before. But that’s not all, think also about the image file format, input Regions of Interest (ROI, should it be list of points in a text file, json, image mask with the same filename as the image + “_mask”), additional metadata required for the processing (pixel size, value of concentration of compound C, …). All these are a hard-coded choice for a program. Wouldn’t it be better to have a common way of interacting with these kinds of data?

Well that’s where a solution like OMERO can help you a lot. Instead of communicating with a stubborn filesystem, you are talking to an API (application programming interface) here. Now all you need is a connection to the server, ask the API to give you the data associated to a certain image ID: “Give me the planes of image 123”, or “Give me the ROIs of image 123”. You get the idea.

And the nice thing is that it works the other way around, too. Your program detects ROI? Attach them to the image through the server API. Result tables to keep? Attach them to the image through the server API. No need to worry about how the server stores it, it’s here for this exact reason.

But how are you supposed to get the image ID? This is where you get to be creative and it is where the tagging part comes out handy. Imagine the following situation: Now we have three datasets on our OMERO server that are tagged. Inside them are the image stacks we want to process, and overview images that are helpful to understand the dataset but which we would rather not pass to the image analysis program to avoid crashing it.

“Talking” to the API, I can ask for the list of datasets that have the “TAG_A”. Or I can make my query arbitrarily complex: “Retrieve all datasets that have TAG_A or TAG_C but not TAG_B” (this is called the algebra of set). And: “For each dataset that is obtained, list all the images it contains, but reject the image if it has TAG_D.”
This is how tags and organized data will make your input and output of your image pipeline much clearer and easier to reuse.

Collaborating

I hope that now as you know how easy it is to select, filter and send back results, you would yourself go back to your data and start annotating it. And remember, there is no need to overdo it! But what if it wasn’t your own data? What if you are collaborating with other research groups and process their data? Considering that the images can be processed by a single pipeline, all you would need from your collaborators is to upload their data to OMERO and a way to obtain the long list of image IDs to process.

Maybe you impose a guideline on how the images should be tagged. Or maybe they tag them in whichever way they like, and they tell you how to filter the image IDs. If that is what will allow you to make the best possible analysis pipeline for them, they will hardly refrain from providing those tags for you. How they decide to name data, sort it, organize in datasets, who the images belong to – it doesn’t matter anymore. Welcome to the future of data I/O in bioimage analysis.