Tutorial 3: Decoding the grammar of DNA using Natural Language Processing
Time: Nov 28th, 14:00 - 16:00
Presenter: Tyrone Chen, Sonika Tyagi
DNA is the blueprint defining all living organisms. Therefore, understanding the nature and function of DNA is at the core of all biological studies. Rapid advances in DNA sequencing and computing technologies over the past few decades resulted in large quantities of DNA generated for diverse experiments, exceeding the growth of all major social media platforms and astronomy data combined. However, biological data is both complex and high-dimensional, and is difficult to analyse with conventional methods.
Machine learning is naturally well suited to problems with a large volume of data and complexity. In particular, applying Natural Language Processing to the genome is intuitive, since DNA is a natural language. Unique challenges exist in Genome-NLP over natural languages, including the difficulty of word segmentation or corpus comparison.
To tackle these challenges, we developed the first automated and open-source genomeNLP workflow that enables efficient and accurate knowledge extraction on biological data [1], automating and abstracting preprocessing steps unique to biology. This lowers the barrier to perform knowledge extraction by both machine learning practitioners and computational biologists. In this tutorial, we will demonstrate how our workflow can be used to address the above challenges, with implications in fields such as personalised medicine.