BaFF-Tutorial

Introduction

The BaFF web server allows to automatically detect the biological features that characterize a set of prokaryotic organisms given as input by the user. The core of the system is a searchable database of tens of diverse features compiled for more than 20.000 prokaryotic organisms.
The system is able to locate those features (qualitative or quantitative) differentially associated to an input set of organisms respect to a background, both provided by the user or resulting from a database search. This analysis of bacterial features is inspired by the widely used enrichment analysis routinely used to extract the biological annotations characterizing a set of genes/proteins (e.g. those overexpressed in a given experiment).
Such a system can be used to interpret the results of experiments when these are given in the form of long lists of organisms, such as a metagenomics experiment (where the bacteria present in a given complex sample are eventually identified), or the phylogenetic profile of a gene/protein (set of species where it is present). With these analyses, it is possible, for example, to detect that the input set of species is enriched in gram + or pathogenic organisms, or that they tend to have larger genomes, or more genes involved in amino acid metabolism than the average (background).

Citing BaFF

Please cite the following reference when you mention any result or data obtained with this server:

Javier López-Ibáñez, Laura T Martín, Mónica Chagoyen, Florencio Pazos. (2019). Bacterial Feature Finder (BaFF)—a system for extracting features overrepresented in sets of prokaryotic organism. Bioinformatics, Volume 35, Issue 18, 15 September 2019, Pages 3482–3483.

Database Search

The main interface of this server allows both, to construct database searches combining a number of criteria, as well as perform an enrichment analysis. The result of a search is a paginated list of organisms matching the search criteria, where their TAXIDs are links to the corresponding complete database records. The information in these records is hyperlinked to external resources, generally those from where the original data were obtained.

Enrichment Analysis - Examples

In this tutorial, we include some very simple examples that show the main capabilities of BaFF to extract the biological features characteristic (i.e. enriched) in a set of bacteria. These try to cover the diverse scenarios for performing such an analysis.

Example 1. Using the result of a database search as input

The simplest case, from a usage point of view, would be to use the result of a database search as input, and the whole database as background. This would allow retrieving the characteristic features of the organisms matching our search criteria. Apart from the search criterion itself, which should show up as enriched (if it is specific enough), other enriched features can be interpreted as those correlated with the ones of the search criteria. This toy example will allow understanding the meaning of the results of an enrichment analysis, as well as the main features of the interface.

For example, let's retrieve all the symbiotic organisms stored in the database. We choose Symbiotic in the biotic relationship field and press Search. They are listed at the bottom of the interface. There are 381 of these in the current version of the database.

Now let's perform an enrichment analysis with them, that is, locate the features differentiating these 381 from the background represented by the whole database. For that, we check Use search results in the input set section of the Enrichment analysis panel, leave the Background untouched, and press Get enriched features.

A table with the enriched features is shown. The features are classified in categories, so that particular classes of features can be shown/hidden with the checkboxes at the top of the table. The last two columns contain the p-value of the enrichment test and that of the FDR correction. A color scale is used to highlight the significant cases (green, p-value≤0.001; red, p-value>0.05; or orange for the intermediate values). The table is sorted by these p-values by default. If we are interested in a particular feature or category of features (not listed at the top because the p-value is not good enough) we can sort the table by these columns by clicking the corresponding headers. In this case, we can see that many enriched features belong to the Taxonomy class, and represent taxa known to be related to symbiosis: Ricketsia, Erwinia, Buchnera... Since terms from all the levels in the taxonomy are evaluated, it is common that many of these terms represent the same final clade (e.g. Chlamydiae, Clamydiales, Chlamydiaceae...)
To hide that massive taxonomic information and highlight other features, we un-check the Taxonomy (Term name) category. As expected, we see Symbiotic as a differential feature. We also see at the top a number of diseases as well as pathways and systems commonly associated to these organisms. If we leave the mouse pointer on the p-value of a given feature, information on the number of organisms matching it shows up: for example, for Bronchitis, 22 out of 147 bacteria in our input set match it, what is highly significant (p-value≅0.0) taking into account the proportion of the 23,000 organisms with that particular annotation. The discrepancy between 147 and 381 (size of our original input set) is due to the fact that only organisms with annotations in this particular category (disease) are used for the calculations. This tries to minimize the problem of counting as negatives un-annotated organisms.

All those were qualitative features. We also see a number of quantitative features, such as number of genes, genes in different COG functional classes, or %GC. The deviation of these quantitative features of our input set to higher or lower values (compared with the background) is indicated by an arrow next to the corresponding value (downwards if the deviation is towards lower values and upwards otherwise). In this case, we can see that all these features are deviated to lower values in our input set. This is confirmed by placing the mouse over the p-values, that shows the average value of the feature in the input and the background sets. It is known that the genomes of these symbiotic organisms are reduced.

The Save results button exports the feature table to a .tsv file so that it can be further processed or imported into a spreadsheet program. The last column of this file shows the information about deviation of quantitative values.

A related search would be that of organisms with small genomes. For that, in the main form we introduce, for example, 800 in the maximum number of genes, and leave the minimum blank. Then we perform the enrichment analysis for the resulting 673 bacteria as explained above. We hide the many taxonomical classes known to be related to small bacteria. We now see many features known to be related to small bacteria (apart from the obvious number of genes): sphere-shaped, symbiotic, parasite, different diseases...
You can try yourself other searches. For example, what are the characteristic features of organisms with a curved shape? Do they make sense by what you know about the physiology of these bacteria?

Example 2. User provided input against different backgrounds

Now we are going to see a more "real world" scenario for using this system, in which the input is a list of organism of interest for the user. In this case, these lists comprise the organisms where a given system or gene is present, so that some of the features enriched for them are expected to reflect the function of that gene/system.

The first example is the set of organisms where the Sox system, involved in sulfur oxidation, is present (Ghosh, et al. 2009. Res Microbiol. 160:409-420). In this case, we take all bacteria where the SoxB gene (a constitutive member of that multienzymatic complex) is present. That list of TAXIDS was retrieved from the example data of the Mirrortree Server (Ochoa & Pazos (2010). Bioinformatics 26:1370-1371).
In the Overrepresentation analysis section of the main interface, we choose that file as Input set and leave the background untouched, since we want to know what characterizes the bacteria containing this gene compared with all organisms within the database. Next, we press Get enriched features. The first difference we see respect to the previous examples is that now a message shows up informing on the number of TAXIDs of our list found in the database (98 out of 100 in this case). This is because your submitted file could contain wrong TAXIDs, or simply organisms not present in the current version of our database. This check is also performed in case a file is provided for the background set (see below).
We see a number of taxonomical features indicating that our set of organisms is enriched in proteobacteria, more specifically, alpha proteobacteria, a known characteristic of the Sox system (Ghosh et al, 2009). At the top, we also see thiosulfate oxidation by SOX complex, Anoxygenic photosyntesys II, as well as other biosystems related to redox metabolism.So, in case the function of this system were unknown, it could be inferred from the differential characteristics of the bacteria containing it.
If it is known that the Sox system of certain subgroups of proteobacteria has distinctive subunit compositions (apart from SoxB and other core components) and evolutionary paths (e.g. involving HGT), although the physiological implications of these differences are not fully understood. To highlight the characteristic features of one of these subgroups respect to all SoxB-containing bacteria, we can give it as input to BaFF, and use now SoxB_all.taxid as background, instead of input. We can do this, for example, for the beta-proteobacteria subset of SoxB containing organisms. In this case, apart from the obvious taxonomic features (e.g. betaproteobacteria) we see a number of redox-related biosystems (e.g. involving copper or nitrogen) that could shed some light into the differential role of this system in beta-proeobacteria, although so far these are just "blind predictions".

Another example is the E. coli gene sfmA. It codes for a protein homologous to a component of the fimbriae, but its specific role in the cell is unknown. The list of organisms where this gene is present (b0530.taxid) was obtained from the orthology tables generated by Ochoa et al (2015). Bioinformatics 31:2166-2173. If we use that file as input, and the whole database as background, the enriched features reflect mainly biosystems characteristic of Enterobacteria, the group all these organisms, including E. coli, belong to. This is not very informative and could be influenced by the bias towards that important group of bacteria present in all databases. Even the biosystems annotations could be biased since many pathways, complexes, etc. were described in this well-studied group.
In general, if our input set is biased towards a certain taxonomical group (this being known a priori or a posteriori, after the first enrichment analysis) it is a good idea to re-do the analysis using that group as background, so that the commented bias is partially corrected. In that way we will only find features differentiating our set from the rest of bacteria in the same group, but those could be the ones we are interested in.
So, in this case, we are going to repeat the analysis using the Enterobacterales group as background set. For that, we search in the database for Enterobacterales,Taxonomy (Term name), check use search results as Background Set, and give the same file as before as Input Set. Now, more specific biosystems show up which could point to a potential role for this protein: we see many pathways related to bacterial virulence and resistance to antibiotics. The role of fimbriae in this kind of processes has been previously described, and hence it is not risky to think that sfmA could be involved in pathogenesis and/or drug resistance.

Contact

Florencio Pazos: pazos{at}cnb.csic.es.
Javier López-Ibáñez: jlopezibanez{at}cnb.csic.es.