Algorithms in Bioinformatics - BC205

Christoforos Nikolaou, Ioannis Tsamardinos, Ioannis Tollis, Kleio Lakiotaki

Description

This course aims to describe some of the most prominent and most widely used algorithms for the analysis of biological data, with emphasis on handling and analyzing biological sequences. The course focuses on a detailed description of algorithms for alignment, the rapid search of short sequences, tracing patterns and finding motifs in sequences, sequence assembly and phylogenetic analysis among others. The main goal of the course is not to describe in detail the most sophisticated implementations but to present the rationale behind the design of algorithms in a constructive and educational manner from both theoretical and practical viewpoints. Classes will include theoretical lectures as well as practical exercises, where students will be required to implement algorithms in the language of their choice.

Course code: TMAPRE102
Category: »

Course Units

Introduction

This is a first introduction in the concept of "algorithms". The basic characteristics of algorithmic techniques such as iteration and recursion will be discussed in the context of simple non-bioinformatic examples such as finding the largest common divisor of two integers and sorting a set of integers.

We will also be discussing the philosophy of different types of algorithms from brute force to branch and bound, dynamic programing etc

Analysis of Sequence Composition

This class will focus on aspects of sequence (primarily genomic) composition and problems related to its analysis, which will serve as an introductory "warm-up" to algorithms. We will be discussing the problem of calculating the GC% of a given sequence, its comparison to that of another species and ways to detect outlier regions in terms of genomic in a given genome.

Algorithms that we will be discussing include:

a.calculating the GC% of a DNA sequence

b. the computation of k-mer frequencies from a DNA sequence

c. locating candidate regions of horizontal gene transfer based on k-mer frequencies

d. the definition of a sequence's origin of replication based on sequence composition.

Analysis of Sequence Motifs

We will be discussing the problem of motif definition, search and quantification in DNA sequences. Starting from the concept of a DNA motif we will be covering the concepts of consensus sequences and the probabilistic definition of motifs as PWM and PSSM tables.We will then discuss the concept of information based on Claude Shannon's theory of Entropy and implement algorithms for:

a. the construction of a motif's PWM and PSSM

b. the search of a sequence for a specific motif given its PWM/PSSM

c. the calculation of a motif's information content through Entropy calculations

d. the representation of a motif as a sequence logo

Motif Discovery

Continuing from the motif definition and search problem, we now turn on to the more difficult task of discovering a hidden motif in a DNA sequence. Starting from a common problem, the extraction of a hidden motif in a set of DNA sequences we will be discussing the following algorithmic approaches:

a. an exhaustive (brute force) search with exponential complexity

b. an exhaustive approach after data transformation with decreased complexity

c. a greedy approach that speeds up the process with a loss in accuracy

d. a randomized algorithm that combines speed with efficiency

In the context of randomized algorithmic search we will be discussing the Gibbs Sampler as a fundamental approach in motif discovery, which we will try to implement in an exercise for the discovery of a hidden motif in a set of DNA sequences.

Wrapping up the problem of motif finding we will be discussing implementations that combine data from different sources (sequence conservation, gene expression etc) that constitute the state of the art in the field, increasing the accuracy of the obtained results.

Rapid Sequence Searches

Picking up from where we left in Sequence Alignment and Comparisons, in this class we will be discussing ways to perform fast pattern matching for:

a. long patterns (i.e. not motifs)

b. against longer sequences (i.e. large sequence chunks, scaffolds or even entire chromosomes)

c. for large numbers of patterns (i.e. not one or two but of the orders of hundreds of sequences)

We will look into the main algorithms used for pattern matching such as the Knuth-Morris-Pratt (KMP) and the Boyer Moore Algorithms as general algorithms before moving on to biology-inspired algorithms for fast searches such as FASTA, BLAST and BLAT.

In essence, this class will serve as an introduction for the following in which we will focus on NGS related algorithms for sequence mapping

NGS Mapping Algorithms

NGS Data Analysis Algorithms

NGS Downstream Analyses