WBO 6 – Searching for distant relatives- BLAST

The lecture is again online. I will send the link before. Labs are again after the lecture. English version below:

Dziś rozmawialiśmy o algorytmach przybliżonego wyszukiwania sekwencji w bazach danych. Slajdy są tu: wyk6-blast.

Na ćwiczeniach spróbujemy zająć się wykorzystaniem programu blast: 0. Wczytaj plik w formacie FastQ microbial_reads.fastq (jest już na naszym serwerze jupyter w katalogu WBO, nie trzeba go tam ładować), przy pomocy SeqIO.parse(). Są to odczyty z mikrobiomu jelitowego myszy.

1. Wybierz kilka losowych, dość długich sekwencji DNA i uruchom dla nich program BLAST online, obejrzyj wyniki (jeśli nic się nie “trafiło”, możesz wybrać inne, dłuższe sekwencje, np. powyżej 200 par zasad)

2. Wykonaj wyszukiwanie dla tych samych sekwencji przy pomocy interfejsu API biopython’a do blasta online (NCBIWWW) i parsera xml (NCBIXML). Pamiętaj o podaniu swojego adresu e-mail:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"

3. Znajdź, która z twoich wybranych sekwencji ma najistotniejsze trafienia do bazy NCBI (w sensie najmniejszego e-value).

4. Przeanalizuj “trafioną” sekwencję. Czy to możliwe, aby ta sekwencja była dokładnie z tego organizmu, który był badany? Obejrzyj wyniki algorytmu Smith’a-Waterman’a (localxx z pakietu Bio.pairwise2) porównania obu sekwencji. Jak dużo błędów dopasowania widzisz?

English version:

Today we discussed approximate sequence search algorithms. The lecture slides are here: wyk6-blast.

During the practicals, we will learn the usage of the BLAST algorithm:

0. Read the file in the FastQ format that contains short mirobial sequences (you can download it here: microbial_reads.fastq, but it is also available on our jupyter server in the WBO folder, you don’t need to upload it there). You can use SeqIO.parse() function. The data comes from a study on mouse gut microbiome.

1. Choose a few (e.g. 5) random sequences with length above 100 bp from this file and put them manually into the BLAST online. If you don’t have any hits, you can try longer reads.

2. For these sequences that you got some hits, try using the biopython API for BLAST online (NCBIWWW) and the associated xml parser (NCBIXML). Remember to give your e-mail address here:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"

3. Choose the sequence that gave you the most significant hit (the smallest e-value).

4. Analyze your “hit” alignment. Do you think it actually came from the organism that is in the database? USe the Smith-Waterman algorithm (localxx from Bio.pairwise2) to compare your sequence and the hit. How many mismatches there are?

Leave a Reply Cancel reply