Bioimaging File Formats explained

The smallest piece of information underlying a computer is the distinction of two states (on or off), which can be represented as a binary digit (bit) with values 1 or 0. To store more complex information, more bits are combined. E.g., 8 bits in a row make up one byte (typically the smallest unit for information storage in computer memory). The discrete combination of ones and zeros in one byte allows storing up to 256 different states (2^8). One can assign specific meaning to each of these states. For example, a byte can represent a letter of the alphabet or a number. An example is the American Standard Code for Information Interchange, ASCII, where the small letter “a” is encoded in binary code by 01100001, while the digit “1” is encoded by 00110001. In contrast, the first 256 numbers of the decimal number systems could also be represented in 8-bit binary code where 00000001 is the decimal number 1, while 11111111 is the decimal number 256. Larger chunk sizes to store pieces of information allow denoting many more states, increasing over 12-bit, 16-bit, 32-bit, 64-bit a.s.o.

Computers essentially store, read and transmit the information as linear combinations of these bits. However, the binary code is not what computer scientists use directly to write programs or store information. Instead, data is structured using programming languages and file specifications. A simple example is that a table consists of rows and columns; hence it is structured in at least two dimensions. Yet, when the computer transmits this information or stores it, it must do so as a sequence of bits. “Serialization” describes the process of transforming the structured data into the linear series of bits to save the state in a form that it can be exchanged by computers. To ensure the information that was preserved by the data structure is still available after serialization and can be restored in another computer (de-serialized) there are different specifications of how multidimensional structures and relationships are preserved during serialization. These specifications are called “serialization formats” of which there are several different ones (e.g., XML, JSON, or YAML). Thus, when a program on one computer serializes information according to the JSON specification, another computer can receive the file encoding this information and de-serialize it using the JSON specification.

Data models vs. file formats

The data model specifies how information is subdivided into pieces and how these are semantically structured to represent the full information content best. In bioimaging, a data model is a scheme defining how all necessary pieces of information regarding a microscopy acquisition result (single images or multiple images) are structured to retrieve all information about the acquired image(s). In other words, the data model is the abstract set of rules defining how information from the outside world is represented inside a computer with specific terms and relationships in a machine-readable way.

To make use of a data model, it must be defined how the set of rules is adopted in the form of bits and bytes in a computer. This technical adoption of a data model is defined by the file format. In other words, file formats are containers describing how the information is organized so that a computer can read, exchange, store, and visualize it with the appropriate software. Different file formats can hence use the same data model, encoding the data model differently for the computer.

When a data model or a file format is used by a majority of different computer applications across or within a specific field, it may be referred to as a standard.

“Standard” file formats and (meta)data models in bioimaging

Most research microscopes are supplied by industry vendors who build their instruments including the software necessary to control the microscope and record images. Naturally, each vendor would ensure that their system provides optimal performance to record data with their microscopes. This includes engineering the file formats in which their microscope software saves the recorded images. The multitude of different imaging modalities from many different microscope vendors has led to the creation of a vast number of different file formats for microscopy, many of which are “owned” by the companies and are not open to the community (i.e., they are proprietary file formats).

This means there is no universal standard for file formats or metadata in the field of bioimaging.
Coming to a standard or a set of standard file formats in bioimaging (Swedlow et al., 2021, Nat Meth) is a declared goal collaboratively pursued by the international bioimaging community, organized under the roof of the international exchange network Global BioImaging.

However, the OME data model, the OME-XML file format (Goldberg et al., 2005, Genome Biol), and the OME-TIFF file format (Linkert et al. 2010, J Cell Biol) have indirectly become standard file formats for many applications. Since many computer applications can use TIFF as a format, the OME-TIFF file format allows quite broad use of the imaging data in different applications. OME-TIFF belongs to the classical file formats. At present, the development of a new file format that is suitable for access to data in cloud environments is under development. The format specification is OME-NGFF (Moore et al., 2021, Nat Meth) which can be categorized as a so-called next-generation file format (NGFF). A discrete implementation of OME-NGFF is the file format OME.zarr (Moore et al., 2023, Histochem Cell Biol). A high-level explanation about the strength of Zarr has been described in an Open Source Science Initiative article by J. Moore 2022.

Microscope vendors often engineer their own, specific file format to optimally work with their instruments during recording, and with their own software for processing and analysis. Industry-owned, closed formats are called proprietary file formats (PFF). Examples of proprietary file formats in microscopy are:

.czi (a file format by Zeiss)
.lif (a file format by Leica)
.nd2 (a file format by Nikon)
… any many more.

As opposed to the PFF from different vendors, the Open Microscopy Environment Consortium has developed the OME data model, a specification for storing data on biological imaging as an open file format, the OME-TIFF, which includes an OME-XML metadata header.

Classical file formats are written to hard drive or flash drive storage in a way that the information from image planes is organized in a linear sequence of bits. This way of storing the information has limitations when it comes to the dynamic use of only parts of the information stored in image files. Typically, for a computer program to use an image, the computer must load it into its temporary memory (RAM). If access is only required for a subset of the plane and even across different planes, yet all relevant planes have to be loaded into the RAM. Very large files take considerable amounts of time to load and require large RAM sizes, which are often not available to all users. Additionally, if data is transferred from one location to another, the full file is transferred completely in one piece before the information is accessible and usable. Large files take a long time for data transfer impeding its readiness for access and use in shared or remote environments. Next generation file formats offer a different approach to storing the information, which allows to dynamically load relevant parts of a file (chunks of data). Essentially, this works by breaking the file down into a large number of tiny file pieces that can be loaded together in combinations as needed.
An excellent introduction to the difference between classical and next-generation file formats in the context of bioimaging is given in a talk by C. Tischer during a Global BioImaging workshop in 2022.