Skip to content

Add new sequences with gui.py

Ariel Vina-Rodriguez edited this page Sep 27, 2019 · 6 revisions

gui.py is a Python script that will help you select and add sequences with a few clicks and an internet connection.

Prerequisite: python 3 with biopython installed in your machine. None of them need Admin permissions and are free to download.

Led assume you want to subtype some HEV sequences. All you have is a list of the GenBank ID. You will need to compare these IDs with the IDs already in the alignment (which are also in the workbook). You probably will also want to collect from the GenBank all the sequences which are related to these new sequences. At the end you want to have:

  • a list of all the new and unique ID, including all the sequences that are similar to the originals.
  • a fasta file with the sequences to be added to the alignment (*.fasta).
  • a csv file with the related data to be included in the workbook (*.csv).

Launch gui.py.

HEV gui.py

You will be presented with a window with three columns for IDs where you can cut, copy, paste, delete or edit IDs as you wish. At the top of each list you find a Load button that will ask for a file with ID and will ADD the ID to the list. At the bottom of each column you have three buttons: clear, save, and get.

  • clear will ... clear the list.
  • save will ask for a text file where you want to have a copy of the list.
  • get will connect to the NCBI site, will Get from GenBank online the sequences corresponding to the list of IDs, will ask for a file to save the sequences in flat GenBank format (*.gb) and will call the parser to generate a file in fasta format (*.fasta) and a file with sequence information in CSV format (*.csv). The sequences are internally downloaded in chunks of 100 to avoid problems with the connection to the NCBI, but only one big *.gb file is created and parsed. If you first parsed an alignment and/or a BLAST file(in-line or off-line) it will try to locate the approximate coordinates of each sequence in the alignment and will add "'-'" at before and after the downloaded here sequence. This will help you to quickly (hopefully) manually align the sequences.
  • BLAST Take the set of the ID in the "add" list (center list) and make on-line an NCBI.BLAST. Ask for a file to save the results of the BLAST. Calls Load BLAST to parse the results.
  • Load BLAST Load a BLAST result in XML format. Uses NCBIXML.parse to parse the result of the BLAST and identify each sequence there: Look for each sequence in the self.ref_seq dictionary to know if it is a reference sequence (the ones added with parse alignment) and add it to self.new_seq if not a ref.
  • Seq from GB file Load and parse a GenBank sequence flat file and generate a file in fasta format (*.fasta) and a file with sequence information in CSV format (*.csv).