WBO – regulatory genomics

WBO – final results

Sprawdziłem Zadanie 2. Wyniki są całkiem niezłe.

Jedynym powtarzającym się błędem jest to, że wielu spośród Państwa nie używało parametru alternative=”greater” w teście dwumianowym, co prowadziło np. do “wzbogacenia” motywu w obu grupach nawzajem. Oczywiście nie jest to dobry wynik i musiałem obciąć po kilka punktów osobom, które tak zrobiły.

W związku z tym, że kolokwium było dość niemiarodajne (można było zgadnąć trochę, albo głupio się pomylić w pośpiechu), to proponuję następującą skalę ocen minimalnych:

55 punktów lub więcej – ocena dobra+ i zaproszenie na egzamin ustny na 5
50 punktów lub więcej – ocena dobra i zaproszenie na egzamin ustny na 4+ lub 5
46 punktów lub więcej – ocena dostateczna+ i zaproszenie na egzamin ustny na 4, lub 4+
42 punkty lub więcej – ocena dostateczna i zaproszenie na egzamin ustny na 3+ lub 4
35 punktów lub więcej – zaproszenie na egzamin ustny na 3 lub 3+

Proponuję egzamin ustny w dniach 22-25 Czerwca (można zapisywać się przez doodle) ale jeśli komuś ten termin nie pasuje, to można się umawiać indywidualnie. Zakres materiału taki jak na kolokwium.

Osoby niedopuszczone do egzaminu (poniżej 35 punktów) mogą zdawać egzamin ustny we wrześniu (podobnie jak osoby, które nie będą zadowolone ze swojej oceny po egz. ustnym).

WBO 13 – Genome assembly and metagenomics

Today’s lecture is making a full circle to the topics from our first meeting. We’re back at DNA sequence identification from short reads, i.e. genome assembly. The slides are here: wyk13-assembly-meta

We do not have any practical assignments for today (we have done simple exercises on k-mers and DeBruijn graphs in the first lab, so you are welcome to look at them again here ), however we will spend some time reviewing the material.

I have prepared some slides with some keywords from this semester here: wyk14-review

Next week, during the lecture, we will have the test. It will be taken online

WBO 12 – Short read mapping

Today, we are concerned with mapping short reads to long genomes. the lecture will be, as usual, online and the slides are here: wyk12-ngs

Today’s assignments are mostly related to fast text indexes:

1. Write a function compute_BWT(txt), that returns the Burrows-Wheeler Transform of the given text (as a string) and the arrays C(x) and OCC(i,x) as dictionaries.

2. Apply the function to the following DNA sequence:
CGAGCCGCTTTCCATATCTATTAACGCATAAAAAACTCTGCTGGCATTCACAAATGCGCAGGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGT

3. Using the method described during the lecture, compute the LTF mapping, and write a method for identifying short strings. Find the occurrences of the string AAAC in the text above using your function.

4. Write another function (that does splitting the query into two and matching at least one half with no errors) to find occurrences of AAAACTCCGCTGGCATTCACAAAT in the text with at most one error

Homework (sixth out of 5 Write a code that uses simplified BWT matching, based on today’s assignments, to create a BWT of the E. coli genome (Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa) and find positions of occurrences of sequences from this file ( ecoli_proms.fa ) in it. Please send the code and the resulting positions to me by e-mail before the next lecture. Note – we should absolutely avoid any methods that use quadratic memory size (like generating all suffixes of the genome in memory).

WBO – Assignment 2 – Zadanie 2

In our second assignment, we will try to combine a few of the problems we have encountered: Searching for related sequences, identifying regulatory regions and motifs and enrichment analysis. All of this will be performed on a provided dataset and the expected result will be a program that solves the problem and a report with your findings.

The assignment will have these components:

1. Identify the closest related protein-coding sequences in the E. coli genome for each of the input protein sequences given in the fasta file (protein_fragments.fa). This should be done by downloading the E. coli protein-coding sequences (from this file – genes_e_coli_new.fa), creating a BLAST database from them and performing a local BLAST search against this database. Importantly, we have protein fragments that need to be matched against DNA sequences, so we have to do something to make this compatible. (Here’s a tutorial on running blast locally.) The result of this step should be a comma-separated list containing: input sequence id, best matching E. coli gene id and the associated e-value. (6 pts)

2. For each of the identified E. coli genes, you should find the associated promoter DNA sequence in the given file (~~proms_e_coli_new.fa,~~ proms_e_coli_fixed.fa). Please note that the input file contained sequences from groups A and B, we speculate (based on the empirical evidence), that these groups of genes should have different regulatory mechanisms. We need to identify 10 sequence motifs present in the promoters associated with group A and 10 motifs associated with group B, independently of each other. All motifs should be of length 15 . The result of this step should be two sets of 10 different motif position-specific matrices in a .pfm format. You can do this with a consensus-like method you implement yourself, or you can use the MEME-suite and process its output. (6 pts)

3. Given the two sets of motifs, we would like to select only the motifs that are specific to group A or group B. For this, we need to compute the occurrences of the motifs from .pfm in promoters from both groups. For this purpose, we should scan the promoter sequences of genes from groups A and B, and obtain a number of positions in these promoter sequences that have a log-odds score higher than 0. Now, given the combined lengths of the promoters in each group and the total number of “hits” in each group, we can use the binomial test to identify the motifs that are significantly enriched in one of the groups against in the other. The result of this step should be a list of all 20 motifs with associated, number of hits in the promoters from group A, and promoters from group B and the associated p-values for enrichment in group A and group B respectively. (4 pts)

All of these results and your implementation, should be described in a short report. (4 pts)

Altogether, you should submit your solution (as a commented .py file, results of tasks 1,2,3 and a report in PDF) in an e-mail to me with [WBO] in the subject line before June 9th.

WBO 11 – sequence motifs

Polish version below

Today in the lecture we were discussing sequence motids. Slides are here: wyk11-motifs-eng

Today’s practicals:

1. Download the promoter sequences from the E.coli genome: ecoli_proms.fa

2. Read about the Bio.motifs module and its implementation of position specific matrices in its documentation

3 . (Homework) Implement the simplified consensus method for motif finding (based on slide 20) and use it on the E.coli promoter sequences for motifs of length 3,4,5,6. The algorithm works as follows

based on any position in the first promoter sequence initialize a PWM
for every following step:
choose the position in one of the remaining sequences that is best matching the current PWM
modify your PWM, by adding the word at the chosen position to your PWM and remove this sequence from further considerations
Perform this procedure for all possible starting points in the first sequence and return a few (chosen by the user, by default 5) resulting PWMs that have the highest information content
Possible improvements can be considered: permuting the input sequence set and/or skipping some sequences that have a particularly bad match

4. Compare your results with the known E. coli promoter motifs

5. Use the MEME motif finding suite on your sequences and compare to your previous results and the real motifs

Homework: This week again you can earn additional 2 points for a homework. This time it is the assignment 3 (consensus method). As usual, please send it before the next week lecture as an attached python (.py) file by e-mail with “[WBO]” in the subject.

Polska wersja: Dziś rozmawialiśmy o reprezentacji i wyszukiwaniu motywów sekwencyjnych: Slajdy są tu: wyk11-motifs-eng

Zadania na dziś:

1. Pobierz listę promotorów e_coli w pliku fasta: ecoli_proms.fa

2. Zapoznaj się z modelem macierzy motywów z modułu Bio.motifs i jego dokumentacją

3. Zaimplementuj prostą, zachłanną metodę consensus dla wyszukiwania motywów, zastosuj ją do promotorów e.coli i obejrzyj wyniki dla motywów długości 3,4,5,6. Schemat algorytmu jest następujący:

zaczynając od dowolnego miejsca w pierwszej sekwencji skonstruuj na jego podstawie model PWM
dla każdej kolejnej sekwencji:
wybieraj sekwencję najlepiej pasującą do obecnego modelu PWM
Po wybraniu sekwencji skonstuuj nowy model PWM, wzbogacony o tę sekwencję
Po wykonaniu tej procedury dla wszystkich punktów startowych z sekwencji 1. zwróć kilka (np. 5) najlepszych (w sensie zawartości informacyjnej) macierzy

Powtórz tę procedurę dla kilku różnych permutacji pliku wejściowego.

4. Porównaj wyniki ze znanymi motywami promotorowymi w E. coli

5. Wykonaj analizę tych samych promotorów przy pomocy MEME i porównaj wyniki z otrzymanymi z consensusa

Praca domowa: W tym tygodniu można dostać 2 punkty za przesłanie do mnie swojej implementacji metody consensus. Jak zwykle mają Państwo tydzień czasu, proszę o wysłanie pliku .py mailem z kodem [WBO] w temacie.

WBO 10 – gene families and functions

Slides for today’s lecture are here: WBO-10-gene-families

English version below

Zadania na dziś:

1. Zapoznaj się z testem Fishera (np w pakiecie scipy.stats.fisher_exact) i metodą GSEA zaimplementowaną w Istytucie Broad’a

2. Zapoznaj się z formatami plików OBO i GAF z projektu gene ontology: opis formatów, przykładowy plik GAF, przykładowy plik OBO

3. Napisz prosty program, który testuje testem Fishera, wraz z poprawką Bonferroniego wzbogacenie funkcji GO w zadanym zbiorze genów.

4. Przetestuj to na połączonym zbiorze histonów i homologów PAH

5 (*). Zaimplementuj uproszczoną (wyliczanie statystyki ES dla kolejnych fragmentów rankingu, wykreślanie krzywej wzbogacenia i istotności, bez poprawki FDR) metodę GSEA na podstawie opisu z suplementu do pracy o GSEA 06580SuppText

English version of today’s assignments:

1. Take a look at the Fisher’s exact test as implemented in the scipy.stats.fisher_exact and with the GSEA method implemented at the Broad Institute

2. Look at the ontology ( OBO ) i and gene annotation ( GAF) file formats from the gene ontology project: format description, sample GAF file, sample OBO file

3. Write a simple program that implements gene set enrichment testing on a given set of gene identifiers for a given set of annotations (assume a GAF file is given, and test all the annotated terms in this file against the given set of gene identifiers, score using FIsher’s exact test with bonferroni correction).

4. Test it on the joined set of PAH and histone genes

5 (*). Implement a simplified GSEA approach (computing the ES score and the enrichment curve for a single gene set and gene ranking, without the FDR)based on supplementary methods to the GSEA paper 06580SuppText

WBO 9 – Tree reconciliation

English version below. Today’s lecture will be online, the link can be found in the chat channel.

Dzisiaj na wykładzie zajmowaliśmy się uzgadnianiem drzew.

Slajdy do wykładu:

Wg Pawła Góreckiego (głównie slajdy 1-20) : dlt
Wg Terry’ego Speed’a (głównie slajdy 14-20): Terry-Speed-Reconciliation

Zadania na lab:

Dzisiaj zajmiemy się uzgadnianiem drzew. Rozważmy sytuację, gdy mamy 1 drzewo gatunków S i drzewo genów G zgodne z tym drzewem gatunków (dla każdego genu g\in G , s(g)\in S). Na początek możemy:

0. Drzewa możemy reprezentować jako słowniki, gdzie każdemu identyfikatorowi wierzchołka przypisujemy listę identyfikatorów jego potomków, w takiej reprezentacji, w pythonie, drzewa ze slajdu 5 z wykładu P. Góreckiego można zapisać tak:

G={"abcde" : ["abcd","e"], "abcd" : ["ab", "cd"], "e" : [], "ab" : ["a","b"], "cd" : ["c","d"], "a" : [], "b" : [],"c" : [],"d" : [] }
S={"abcde": ["abcd","e"], "abcd": ["abd","c"], "abd":["a","bd"], "bd":["b","d"],"a" : [], "b" : [],"c" : [],"d" : [] ,"e":[]}

1. Napisać algorytm LCA(G,S), który znajduje mapowanie LCA dla drzew G,S i zwraca je w postaci słownika

2. Napisać funkcje liczące koszty DC(G,S), D(G,S), a następnie wykorzystać wzór ze slajdu 10, aby wyliczyć koszt strat L(G,S)

3. Napisać funkcję fat_tree(G,S), odtwarzającą scenariusz uzgadniający drzewaG i S (nie zawsze optymalny), w którym w korzeniu występują wszystkie duplikacje, zaś wszystkie straty występują możliwie nisko w drzewie gatunków. Scenariusz, to oczywiście etykietowanie drzewa gatunków zdarzeniami ewolucyjnymi, D i L. Zauważmy, że zarówno straty jak i duplikacje występują na krawędziach drzewa gatunków, co oznacza, że możemy zapisać wynik naszej funkcji, jako słownik, który każdej etykiecie węzła w drzewie S, przypisuje listę zdarzeń ewolucyjnych, które w wynikowym scenariuszu następują na krawędzi prowadzącej do tego węzła od jego rodzica.

4. Napisać funkcję optimal_tree(G,S), która znajduje jeden z optymalnych scenariuszy uzgadniających G i S. W tym celu musimy przenieść w drzewie “grubym”, które zrobiliśmy w zadaniu 3, duplikacje możliwie daleko w dół drzewa, a straty w górę, jeśli to możliwe.

English version:

Today’s lecture was about gene and species tree reconciliation.

We have two sets of slides:

More theoretical approach by Paweł Górecki (slides 1-20) : dlt
More applied approach by Terry Speed (slides 14-20): Terry-Speed-Reconciliation

For today’s practicals:

We will implement some algorithms for tree reconiliation. We will consider the scenario, wherewe have 1 species tree S and one gene tree G, such that \forall g \in G s(g) \in S).

0. We will represent trees in python as dictionaries, with tree node identifies as keys and lists of descendant node identifiers as values. In this representation, the sample trees from P. Górecki’s slide 5 can be written as below:

G={"abcde" : ["abcd","e"], "abcd" : ["ab", "cd"], "e" : [], "ab" : ["a","b"], "cd" : ["c","d"], "a" : [], "b" : [],"c" : [],"d" : [] }
S={"abcde": ["abcd","e"], "abcd": ["abd","c"], "abd":["a","bd"], "bd":["b","d"],"a" : [], "b" : [],"c" : [],"d" : [] ,"e":[]}

You can use these trees to test your implementations of the following assignments.

1. Implement the LCA(G,S) function, that will return the LCA mapping for treees G and S as a dictionary, with keys being identifiers from the G tree

2. Implement the deep coalescence [DC(G,S)] and duplication [D(G,S)] cost functions for trees G and S, then use the formula from slide 10, to write a function calculating the loss cost L(G,S)

3. Write the function fat_tree(G,S), that returns the (not always optimal) reconciliation scenario for trees G and S, that has all the needed duplications at the root node and all the losses, as low as possible. A scenario can be represented as a dictionary, assigning lists of events (of type Duplication or Loss) to edges of the species tree. In practice, it’s easiest to use identifiers of nodes from the S tree as keys and lists of events on the edge leading to this node from its parent as values.

4. Finally, using all the previous functions, write the optimal_tree(G,S) function, returning one of the optimal reconciliation scenarios for the trees G and S. This is done by taking the “fat” tree, from task 3, and moving all duplication events as low as possible, and the lossess as high as possible.

WBO 8 – profile HMMs and protein domains

English version below. Today’s lecture is online again, the link was sent by USOS-mail.

Dzisiaj mówiliśmy o zastosowaniach HMM do reprezentacji profilów uliniowień oraz wyszukiwania genów.

Slajdy są tu: WBO-8-profile_hmm

W ramach ćwiczeń, warto jest zapoznać się z bazą PFAM i możliwością wyszukiwania profilów domen przy pomocy programu HMMER. Mieliśmy też wykorzystać moduł biopythona HmmerIO, ale niestety nie działa on po niedawnych zmianach formatu HMMER.

Możemy jednak wykonać ćwiczenia nie korzystając z python’a ale wykorzystując narzędzia pfam+hmmer do wyszukania domen i znalezienia ich modeli. Proponuję zacząć od białka tramtrack, które posiada 2 istotnie różne wersje TTK69 i TTK88. Proszę znaleźć te dwie sekwencje aminokwasowe w bazie flybase, ściągnąć je na swój komputer i wrzucić do wyszukiwarki domen pfam/hmmer. Następnie porównać struktury domen tych białek i zobaczyć jak to się ma do struktury egzonów genu tramtrack. Warto zobaczyć jak wyglądają opisy domen w postaci profilów hmm.

English version:

Today we discussed HMM application to protein domain modeling and gene finding.

The slides are available here: WBO-8-profile_hmm

As the practical assignments for today, we will look at the PFAM database and the HMMER program functionality. In theory, we could also use the biopython’s HmmerIO, however it is buggy after recent changes in HMMER, so we will not use python this week.

We will just use the web interfaces to the pfam db and hmmer to search for protein domains. I would suggest starting from the Tramtrack protein, that has 2 major protein isoforms: TTK69 and TTK88. Your assignment is to find these protein sequences in the flybase database , download them to your computer and submit it for domain search in PFAM hmmer search. Then you can compare the pfam domain search results with the dna gene exon structure and identify which exons correspond to the differences in the identified domains.

WBO 7 – Hidden Markov Models

Today, we will be talking about Hidden Markov Models. The slides are here: wyk7-hmm. Polish version of the notes can be found below. The lecture will be online again, check the chat.mimuw.edu.pl for the link.

Today’s practicals are here:

0. Take a look at the hmmlearn package. We will be using it to train hidden Markov Models

1.CpG islands are found frequently in diverse genomes, and human genome in particular. Try to define a 2-state HMM (CpG and background) and train it on the cpg.fa input file. You can try and do it with four possible emissions (ACGT), and for 16 di-nucleotide emissions (AA,AC,AG,AT, etc.) Can you interpret the emission matrix and assign the learned states to the CpG and background sequences? Perform the learning process (Baum-Welch) several times and see if the results are similar. Can you interpret the transition matrix probabilities? What does it tell us about the average length of CpG islands?

It might be worth to take a look at the MultinomialHMM usage example

2. We can also use HMMs for quantitative signal segmentation. Let us take a sample file – histone methylation data from A. thaliana plant (H3K9me2-crh6-3-chr1.sgr) giving us methylation quantification for different positions in the genome (this is a text file with a position and a value in each line).Use GaussianHMM class to identify 2-state and 3-state models from this data. What are the mean emission values for these data? You can find an example of GaussianHMM here.

Homework (2 pts):

Use the trained CpG model on the following 30 sequences from a file cpg_test.fa. As a result, send me the code (in .py) and the results (in a text file). This time you have 2 weeks for this task.

Polish version below:

Dziś na wykładzie omówimy ukryte modele Markowa i algorytmy rekonstrukcji parametrów z danych. Slajdy są tu: wyk7-hmm

Zadania na dziś:

0. Zapoznaj się z modułem hmmlearn. Będziemy go wykorzystywać do uczenia modeli z emisjami

1. Wyspy CpG znajdują się często w genomach, w szczególności genomie ludzkim. Spróbuj zdefiniować ukryty model Markowa, który ma 2 stany i spróbuj nauczyć go na sekwncji zawartej w pliku cpg.fa. Zrób to zarówno dla sekwencji liter (ACGT), jak i dla sekwencji dwunukleotydów (AA,AC,AG,AT, itp…) Czy możesz zinterpretować macierze emisji i przypisać jeden ze stanów do wysp CpG? Wykonaj kilkakrotnie proces uczenia (Baum-Welch) i zobacz czy wyniki są podobne. Jak interpretujesz prawdopodobieństwa w macierzy przejść. Czy coś możesz powiedzieć o średniej długości wysp CpG?

Warto przyjrzeć się przykładowi użycia klasy MultinomialHMM

2. czasami potrzebujemy użyć łańcuchów Markowa do segmentacji sygnału. Weżmy przykładowy plik – dane o methylacji histonów w rzodkiewniku (H3K9me2-crh6-3-chr1.sgr) zawierający poziom metylacji w różnych pozycjach (plik jest tekstowy, każda linia zawiera informacje o pozycji i wartośći). Wykorzystaj HMM z emisjami Gaussowskimi do segmentacji genomu na 2 lub 3 stany. Jakie są wysetymowane wartości średnie dla różnych stanów? Przykład wykorzystania HMMów z emisjami Gaussowskimi można znaleźć tu

praca domowa (2 pkt):

Wykorzystaj model nauczony na danych o CpG i przetestuj które z 30 sekwencji w pliku cpg_test.fa są naprawdę wyspami CpG. Jako wynik proszę przysłać program w pythonie i wynik w pliku tekstowym.

WBO 6 – Searching for distant relatives- BLAST

The lecture is again online. I will send the link before. Labs are again after the lecture. English version below:

Dziś rozmawialiśmy o algorytmach przybliżonego wyszukiwania sekwencji w bazach danych. Slajdy są tu: wyk6-blast.

Na ćwiczeniach spróbujemy zająć się wykorzystaniem programu blast: 0. Wczytaj plik w formacie FastQ microbial_reads.fastq (jest już na naszym serwerze jupyter w katalogu WBO, nie trzeba go tam ładować), przy pomocy SeqIO.parse(). Są to odczyty z mikrobiomu jelitowego myszy.

Continue reading “WBO 6 – Searching for distant relatives- BLAST”