We build GlycanML, a comprehensive benchmark for Glycan Machine Learning. The GlycanML benchmark consists of diverse types of tasks including glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type prediction, and protein-glycan interaction prediction. Glycans can be represented by both sequences and graphs in GlycanML, which enables us to extensively evaluate sequence-based models and graph neural networks (GNNs) on benchmark tasks. In addition, by performing eight glycan taxonomy prediction tasks simultaneously, we set up the GlycanML-MTL testbed for multi-task learning (MTL) algorithms.
Glycan Taxonomy Prediction. We study glycan taxonomy prediction on domain, kingdom, phylum, class, order, family, genus and species levels, leading to eight individual tasks. These tasks are formulated as classification problems with 4, 11, 39, 101, 210, 415, 922 and 1,737 biological categories, respectively. We report classification accuracy for each task.
Glycan Immunogenicity Prediction. We formulate this task as a binary classification problem, i.e., predicting whether a glycan is immunogenic or not. We evaluate with the AUPRC metric to measure the trade-off between precision and recall of a model on immunogenic glycans.
Glycosylation Type Prediction. Given a glycan, we aim at predicting whether it forms N-glycosylation, Oglycosylation or maintains a free state, formulated as a three-way classification problem. The classification accuracy is used for evaluation.
Protein-Glycan Interaction Prediction. Given a protein and a glycan, this task aims to regress their binding affinity, where the Z-score transformed relative fluorescence unit represents binding affinity. For this task, we adopt the Spearman’s correlation coefficient as the evaluation metric to measure how well a model ranks a set of protein-glycan pairs with different binding affinities.
Rank | Method | Mean Rank | Ranks: Domain → Interaction | Reference |
---|---|---|---|---|
1 | RGCN | 2.5 | [1, 5, 1, 1, 1, 1, 2, 2, 2, 8 ,3] | paper |
2 | CNN | 3.5 | [7, 6, 2, 2, 2, 2, 3, 5, 3, 2, 4] | paper |
3 | CompGCN | 3.9 | [5, 1, 3, 3, 4, 3, 1, 1, 7, 10, 5] | paper |
4 | GIN | 5.1 | [2, 3, 4, 4, 10, 5, 6, 6, 6, 4, 6] | paper |
5 | MPNN | 5.6 | [6, 7, 5, 5, 3, 4, 4, 4, 10, 3, 10] | paper |
6 | ResNet | 6.0 | [8, 8, 7, 6, 5, 8, 8, 9, 4, 1, 2] | paper |
7 | LSTM | 6.3 | [9, 9, 6, 7, 6, 6, 9, 10, 1, 5, 1] | paper |
8 | GAT | 6.6 | [4, 2, 8, 9, 7, 7, 5, 3, 9, 9, 9] | paper |
9 | GCN | 7.2 | [3, 4, 10, 8, 8, 9, 7, 7, 8, 7, 8] | paper |
10 | Transformer | 8.5 | [10, 10, 9, 10, 9, 10, 10, 8, 5, 6, 7] | paper |
@article{xu2024glycanml,
title={GlycanML: A Multi-Task and Multi-Structure Benchmark for Glycan Machine Learning},
author={Minghao Xu and Yunteng Geng and Yihang Zhang and Ling Yang and Jian Tang and Wentao Zhang},
journal={arXiv preprint arXiv:2405.16206},
year={2024},
}