DiTaxa: Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detectio

Identifying combinations of taxa distinctive for microbiome-associated diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on accuracy of microbiome analysis techniques. We propose subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 17 out of 29 taxa with confirmed links to periodontitis (recall=0.59), relative to 3 out of 29 taxa (recall=0.10) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to $0.91$ and $0.90$, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ~1.5 hours on 20 cores, while the standard pipeline needed ~6.5 hours in the same setting
Github link: https://github.com/ehsanasgari/DiTaxa
@article {Asgari255018,
@article {Asgari334722,
	author = {Asgari, Ehsaneddin and M{\"u}nch, Philipp C. and Lesker, Till R. 
and McHardy, Alice Carolyn and Mofrad, Mohammad R.K.},
	title = {Nucleotide-pair encoding of 16S rRNA sequences 
for host phenotype and biomarker detection},
	year = {2018},
	doi = {10.1101/334722},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2018/05/30/334722},
	eprint = {https://www.biorxiv.org/content/early/2018/05/30/334722.full.pdf},
	journal = {bioRxiv}
Maintainer Ehsaneddin Asgari (asgari@berkeley.edu)
PIs Professor Alice McHardy and Professor Mohammad Mofrad