Today, we are concerned with mapping short reads to long genomes. the lecture will be, as usual, online and the slides are here: wyk12-ngs
Today’s assignments are mostly related to fast text indexes:
1. Write a function compute_BWT(txt), that returns the Burrows-Wheeler Transform of the given text (as a string) and the arrays C(x) and OCC(i,x) as dictionaries.
2. Apply the function to the following DNA sequence:
3. Using the method described during the lecture, compute the LTF mapping, and write a method for identifying short strings. Find the occurrences of the string AAAC in the text above using your function.
4. Write another function (that does splitting the query into two and matching at least one half with no errors) to find occurrences of AAAACTCCGCTGGCATTCACAAAT in the text with at most one error
Homework (sixth out of 5 Write a code that uses simplified BWT matching, based on today’s assignments, to create a BWT of the E. coli genome (Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa) and find positions of occurrences of sequences from this file ( ecoli_proms.fa ) in it. Please send the code and the resulting positions to me by e-mail before the next lecture. Note – we should absolutely avoid any methods that use quadratic memory size (like generating all suffixes of the genome in memory).