FAQ
Questions
- Sequencing
- What is whole-genome shotgun sequencing?
- What is an assembly?
- What does the name "Contig 1.XXX" mean?
- What is a sequence contig?
- Are the contigs ordered?
- What is a sequence supercontig?
- Are the supercontigs ordered?
- How big is the Fusarium genome?
- What strain was sequenced?
- What is the current state of the assembly?
- How complete is the current assembly?
- Are the contigs ordered? For example, is contig 1.5 flanked by contigs 1.4 and 1.6?
- How has the sequence been generated for the Fusarium project?
- Will the genome be finished?
- How will we know the assembly is correct?
- What data are available?
- Are the clones being sequenced available to Fusarium investigators?
- What about fosmid end sequences?
- Downloading
- What format is the download file in?
- Why does gunzip tell me the file is not in gzip format?
- The download fails. What should I do?
- BLASTing
- Why is my BLAST job taking so long?
- Why are my BLAST results split into multiple email messages?
- What sequences can I BLAST against?
- Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?
- After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
- What is low-complexity sequence?
- Misc
Answers
- Sequencing
-
What is whole-genome shotgun sequencing?
Whole genome shotgun sequencing is a technique for determining the DNA sequence of a genome by randomly shearing the DNA, sequencing multiple overlapping fragments, and inferring the original sequence from fragments that overlap. This method has been successfully used for bacterial genomes or subclones, like Fosmids. See Assembly for details.
-
What is an assembly?
An assembly is a representation of the computationally derived relative positions of a set of sequenced fragments. When these individual sequences overlap, a consensus sequence is derived representing the most likely base at each position in the assembly. In this way, increased sequence redundancy improves the quality of the assembly and the confidence in the consensus. See Assembly for details.
-
What does the name "Contig 1.XXX" mean?
A contig is a sequence fragment created by assembling whole-genome shotgun reads. See Assembly for details.
Every assembly contains multiple contigs. Each assembly is numbered sequentially. The number preceding the decimal point indicates the assembly number. Contigs within an assembly are also numbered sequentially. Thus "Contig 1.177" indicates contig #177 within assembly 1.
-
What is a sequence contig?
A sequence contig is the extended contiguous sequence that is produced by the assembly process that joins overlapping sequences. See Assembly for details.
-
Are the contigs ordered?
Contigs within the same supercontig are ordered. See Assembly for details.
-
What is a sequence supercontig?
A supercontig consists of one or more sequence contigs known to occur in a specific order and orientation. Because we sequence each end of the subclones of plasmids, Fosmids, and BACs, we can recognize that when one end of a clone lies in one sequence contig and the other end of the clone lies in a different sequence contig, these two contigs probably lie close to each other. To create supercontigs we require that two or more such linking clones join two sequence contigs. See Assembly for details.
-
Are the supercontigs ordered?
No, the supercontigs are not ordered by number.
-
How big is the Fusarium genome?
Our current total unique contig length of 36 Mb.
-
What strain was sequenced?
The strain chosen for sequencing by the International Gibberella zeae Genomics Consortium (IGGR) is designated PH-1 (NRRL 31084) and is a member of lineage 7 of Fusarium graminearum (Gibberella zeae). Lineage 7 is the predominant population of the wheat and barley scab fungus found in North America and Europe and is distributed worldwide (O'Donnell et al., 2000).
-
What is the current state of the assembly?
The current assembly contains 511 sequence contigs >2 kb.
-
How complete is the current assembly?
Since the estimated genome size of Fusarium graminearum is ~40 Mb, the current release represents 90% of the Fusarium genome and is covered to a depth of ~10X. It excludes very highly conserved repetitive sequence, and ribosomal RNA genes.
-
Are the contigs ordered? For example, is contig 1.5 flanked by contigs 1.4 and 1.6?
The contigs are numbered sequentially within larger supercontig fragments. Contigs within the same supercontig are positionally ordered. See Fusarium Contig Numbering for details.
-
How has the sequence been generated for the Fusarium project?
Our data consist of over 0.6 million individual sequencing reads obtained by sequencing each end of plasmids, Fosmids and BACs from libraries containing randomly sheared fragments of 4 kb, 40 kb and 110 kb average insert size respectively. See Assembly for details.
-
Will the genome be finished?
Unfortunately there are no plans to finish the genome.
-
How will we know the assembly is correct?
The quality of the assembly will be assessed in several ways. In addition to requiring that the paired plasmid and Fosmid ends occur in a logical manner, our assembly of the Fusarium genome will be verified through: 1) integration of BAC end sequences, 2) comparison with available genomic sequences, and 3) correlation with the genetic map, which is in process.
-
What data are available?
In this version of our data release, all sequence contigs over 2 kb are available. Smaller contigs are sparsely covered and often include poor quality or contaminated DNA. Sequence contig data can be accessed in several ways: either through a BLASTN or TBLASTN search with an option for contig subsequence retrieval, or through FTP download of the entire genome. Contig sequences are subject to change throughout this project, so each data release version number will be appended to the contig number as a prefix (e.g. 1.235 denotes assembly version 1, contig #235).
-
Are the clones being sequenced available to Fusarium investigators?
The BAC clones will be available from the FGSC. You can find clones that overlap a region of interest by using the Region search link.
We do not have the resources necessary to make available the 4 kb plasmid and 40 kb Fosmid clones.
-
What about fosmid end sequences?
These sequences were crucial for ordering and orienting the genome as well as providing templates for gaps that are not captured by plasmids. They can be accessed using the file fusarium_graminearum_1_endreads.csv.gz.
- Downloading
-
What format is the download file in?
The genome data is pure text in multiple FASTA format. The text file has been compressed using gzip. To uncompress the file:
gunzip fusarium_1.fasta.gz
-
Why does gunzip tell me the file is not in gzip format?
Some browsers (like newer versions of Netscape) automatically unzip files after download. If this is the case, the file should be 36 MB (rather than 11 MB of the compressed file). You can just rename the file to remove the .gz suffix.
-
The download fails. What should I do?
Downloading through the browser uses the http protocol. You can also try accessing the ftp site directly via the URL:
- BLASTing
-
Why is my BLAST job taking so long?
BLAST jobs are queued and handled with other internal Broad processes in a general Load Sharing Facility. The delay for receiving your BLAST results depends on the current load.
-
Why are my BLAST results split into multiple email messages?
Some email programs are configured with a maximum message size and will automatically split large files into smaller pieces. If this is undesirable, you will need to reconfigure your email program.
-
What sequences can I BLAST against?
You can BLAST your query sequence against our entire assembly or special sequences set excluded from the assembly.
-
Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?
From the NCBI Blast FAQ:
This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the sections of the BLAST FAQs on Q: What is low-complexity sequence? and also Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
-
After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
From the NCBI Blast FAQ:
You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.
-
What is low-complexity sequence?
From the NCBI Blast FAQ:
Regions with low-complexity sequence have an unusual composition and this can create problems in sequence similarity searching (Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits (please also see Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?)
In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
- Misc
-
What's the Broad Institute?
The Eli and Edythe L. Broad Institute is a partnership among MIT, Harvard and affiliated hospitals and the Whitehead Institute for Biomedical Research. Its mission is to create the tools for genomic medicine and make them freely available to the world and to pioneer their application to the study and treatment of disease.
-
What's FGI?
Fungal Genome Initative, http://www.broad.mit.edu/annotation/fungi/fgi/
-
How do I cite the sequence for publication?
Publications should include the following citation:
Fusarium graminearum Sequencing Project. Broad Institute of MIT and Harvard (http://www.broad.mit.edu)
-
Who do I contact with questions about the sequencing?
For additional help or to send feedback about the website, please email annotation-webmaster@broad.mit.edu.
-
Where are the beautiful photos from?
The photos on the Fusarium graminearum home page come courtesy (top to bottom):
- Frances Trail, Department of Plant Biology at Michigan State University
- H. Corby Kistler, USDA, ARS Cereal Disease Lab and Department of Plant Pathology, University of Minnesota
- Frances Trail, Department of Plant Biology at Michigan State University
- Jin-Rong Xu, Department of Botany and Plant Pathology, Purdue University
- H. Corby Kistler, USDA, ARS Cereal Disease Lab and Department of Plant Pathology, University of Minnesota
