Finding SeMo: large scale linear motif detection and protein function annotation
Under review
De-novo prediction of Short Linear Motifs (SLiMs) in protein sequences across all species is a challenging task with immense applications. Around 22% of protein sequences reported on UniProt are uncharacterized, and about 0.2% of the sequences are manually annotated. Annotating functions across species will help understand many biological processes. This article presents Finding SeMo, a statistical method for de-novo prediction of SLiMs, capable of finding SLiMs in a large number of sequences without using any evolutionary or structural information. We show that Finding SeMo performs better or on par with other methods in the ELM database. Furthermore, we use Finding SeMo to find SLiMs across all species and report over 29 million SLiMs. Finally, we demonstrate, through examples, how the reported motifs from the UniRef50 database can be used to annotate uncharacterized protein sequences, annotate motifs and generalize existing motifs. We have also created a freely accessible webserver for finding motifs and annotating functions, with open source code.