Schematic overview of the KNN algorithm. A: Metabolic neighbors are placed in a multidimensional matrix using balltree algorithm. A Pfam annotation is used as the basis for this calculation.
B: We refer to the distance between neighbors as metabolic distance. The distance is larger the different the Pfam annotations are.
C: Now we want to predict a medium for a new bacterium. To do this, we also perform a Pfam annotation and place it within the precomputed matrix (in orange).
D: Using the k-Nearest Neighbor (KNN) algorithm, we now determine the nearest metabolic neighbors. Only these neighbors are considered for the following steps.
E: Next, all the media of the metabolic neighbors are considered. If multiple media are known for a neighbor, they are all considered.
F: Common components in the media of the neighbors are searched. Based on this, a score is given to each ingredient. In this example, the score for NaCl is highest because it is present in all media. The metabolic distance to said neighbors also plays a role. Finally, media are searched in the database that contain as many of these ingredients as possible and a score is given for each media.
The Score
There are two different scoring functions that lead to completely different results depending on the medium. For this reason, both are displayed side by side in the current phase of development.
Summarized ingredient scores
With this scoring function, the scores of all ingredients found in a medium are summed up as long as the concentration is within the specified range. As a result, this scoring favors media that contain as many of the predicted ingredients as possible. The downside is that there is currently no penalty if there are additional substances in the media that could possibly prevent growth.
Euclidian Distance
As the name implies, this function calculates the Euclidean distance between the ideal medium (which contains all calculated components) and the media in the database. In this process, the score of the ingredients is included in the weighting. This method often seems to favor media with fewer ingredients, and sometimes yields unexpected results.
Why don't we use phylogenetic distance?
Often, phylogenetically closely related species can differ greatly in both their phenotypes and metabolic capacities.
An example is Thorsellia anophelis, a mosquito pathogen. It is related to Edwardsiella anguillarum with a phylogenetic distance of 7.6 %. Nonetheless, the metabolic distance is more than 39 %. On the same note, Orbus hercynius is the closest metabolic neighbor with a distance of 26.3 %, although having a phylogenetic distance of almost 11 %. Remarkably, Orbus hercynius and Thorsellia anophelis use similar media: TRYPTICASE SOY BROTH AGAR (DSMZ Medium 535) and TRYPTONE SOYA BROTH (TSB) (DSMZ Medium 545), respectively.
The following figure illustrates how much the phylogenetic distance can differ from the metabolic distance:
Comparison of metabolic distance with phylogenetic distance using the Gammaproteobacterium Thorsellia anophelis. as an example.
Plotted is the metabolic distance introduced in this method on the y-axis and the phylogenetic distance on the x-axis. The latter is based on 16S rRNA and is from the RDP database. Representatives of 430 different families belonging to 32 different phyla are used for comparison.
It can be clearly seen that although there is a relation of metabolic and phylogenetic distance, the distances scatter significantly. Thus, the metabolic nearest neighbor is not the same as the phylogenetic nearest neighbor.
×Example for a found medium.
Listed are all components that the found medium contains. The concentration is shown in green if the component was predicted by the kNN algorithm and the concentration corresponds to the predicted concentration. Yellow means that the concentration differs from the predicted one. Red marked components are not present in the prediction and should possibly be omitted.
Medium prediction
MediaDive is part of the DiASPora project,
where we want to enrich biodiversity data using computational methods.
As part of this project, we are going to use artificial intelligence
to predict the cultivation conditions
for unculturable bacteria.
Stay tuned!
You will find more information on this project here, as soon as we finished our first prototype.
We are using cookies to improve the user experience. We save the following information: user preferences for language, accessibility, help features and medium favourites. Unfortunately, you cannot use this site if you don't agree.