GlycanML

A Multi-Task and Multi-Structure Benchmark for Glycan Machine Learning

1Peking University, 2Mila - Québec AI Institute, 3BioGeometry, 4HEC Montréal, 5CIFAR AI Research Chair
*Equal Contribution, †Corresponding Author
minghao.xu@mila.quebec, wentao.zhang@pku.edu.cn

Illustration of benchmark tasks. (a) Predicting the biological taxonomy of glycans at eight levels. (b) Judging whether a glycan is immunogenic or not in organisms. (c) Analyzing how a glycan glycosylates its target protein. (d) Given a protein and a glycan, predicting their binding affinity.

Introduction

We build GlycanML, a comprehensive benchmark for Glycan Machine Learning. The GlycanML benchmark consists of diverse types of tasks including glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type prediction, and protein-glycan interaction prediction. Glycans can be represented by both sequences and graphs in GlycanML, which enables us to extensively evaluate sequence-based models and graph neural networks (GNNs) on benchmark tasks. In addition, by performing eight glycan taxonomy prediction tasks simultaneously, we set up the GlycanML-MTL testbed for multi-task learning (MTL) algorithms.

Benchmark Tasks

Glycan Taxonomy Prediction. We study glycan taxonomy prediction on domain, kingdom, phylum, class, order, family, genus and species levels, leading to eight individual tasks. These tasks are formulated as classification problems with 4, 11, 39, 101, 210, 415, 922 and 1,737 biological categories, respectively. We report classification accuracy for each task.

Glycan Immunogenicity Prediction. We formulate this task as a binary classification problem, i.e., predicting whether a glycan is immunogenic or not. We evaluate with the AUPRC metric to measure the trade-off between precision and recall of a model on immunogenic glycans.

Glycosylation Type Prediction. Given a glycan, we aim at predicting whether it forms N-glycosylation, Oglycosylation or maintains a free state, formulated as a three-way classification problem. The classification accuracy is used for evaluation.

Protein-Glycan Interaction Prediction. Given a protein and a glycan, this task aims to regress their binding affinity, where the Z-score transformed relative fluorescence unit represents binding affinity. For this task, we adopt the Spearman’s correlation coefficient as the evaluation metric to measure how well a model ranks a set of protein-glycan pairs with different binding affinities.

Leaderboard

Rank Method Mean Rank Ranks: Domain → Interaction Reference
1 RGCN 2.5 [1, 5, 1, 1, 1, 1, 2, 2, 2, 8 ,3] paper
2 CNN 3.5 [7, 6, 2, 2, 2, 2, 3, 5, 3, 2, 4] paper
3 CompGCN 3.9 [5, 1, 3, 3, 4, 3, 1, 1, 7, 10, 5] paper
4 GIN 5.1 [2, 3, 4, 4, 10, 5, 6, 6, 6, 4, 6] paper
5 MPNN 5.6 [6, 7, 5, 5, 3, 4, 4, 4, 10, 3, 10] paper
6 ResNet 6.0 [8, 8, 7, 6, 5, 8, 8, 9, 4, 1, 2] paper
7 LSTM 6.3 [9, 9, 6, 7, 6, 6, 9, 10, 1, 5, 1] paper
8 GAT 6.6 [4, 2, 8, 9, 7, 7, 5, 3, 9, 9, 9] paper
9 GCN 7.2 [3, 4, 10, 8, 8, 9, 7, 7, 8, 7, 8] paper
10 Transformer 8.5 [10, 10, 9, 10, 9, 10, 10, 8, 5, 6, 7] paper

BibTeX

@article{xu2024glycanml,
    title={GlycanML: A Multi-Task and Multi-Structure Benchmark for Glycan Machine Learning}, 
    author={Minghao Xu and Yunteng Geng and Yihang Zhang and Ling Yang and Jian Tang and Wentao Zhang},
    journal={arXiv preprint arXiv:2405.16206},
    year={2024},
}