Faq
Questions
- Sequencing
- What is whole-genome shotgun sequencing?
- What is an assembly?
- What does the name "Contig 7.XXX" mean?
- What is a sequence contig?
- Are the contigs ordered?
- What is a sequence supercontig?
- Are the supercontigs ordered?
- Are supercontig (contig) numbers preserved between different assemblies?
- How big is the Neurospora genome?
- What strain was sequenced?
- What is the current state of the assembly?
- How complete is the current assembly?
- Are the contigs ordered? For example, is contig 3.5 flanked by contigs 3.4 and 3.6?
- What contig in release 7 corresponds to my contig 3.XXX from release 3?
- How has the sequence been generated for the Neurospora project?
- Will the genome be finished?
- How will we know the assembly is correct?
- What data are available?
- Clones
- Are the clones being sequenced available to Neurospora investigators?
- What about cosmid end sequences?
- There are no clones covering my region of interest. How can this be?
- How do I order clones that contain my gene/region of interest?
- Chromosomes, Genes, and Regions of Interest
- Which chromosome does supercontig XXX reside on?
- Has my favorite gene/region XXX been sequenced?
- What additional information do you have on my favorite gene/region XXX?
- Is contig XXX near any known marker genes?
- How can I see the features neighboring my gene of interest?
- Annotated Genes
- Is gene XXX annotated in the sequence?
- Gene XXX is annotated incorrectly in your sequence - can I submit an update to your gene name?
- How were the genes annotated?
- There seem to be two different ways of seeing my gene in the FeatureMap - what's going on here?
- Why do I see different PFAM domains in the "Feature Detail" summary and the graphical view?
- What changed between release 3 and release 7?
- Genetic Markers
- Why don't I see my marker when I do a Marker Feature Search?
- Why isn't my marker shown on the Linkage Group Map?
- Downloading
- BLASTing
- Why is my BLAST job taking so long?
- Why are my BLAST results split into multiple email messages?
- What sequences can I BLAST against?
- Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?
- After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
- What is low-complexity sequence?
- Genome Browser
- Misc.
Answers
- Sequencing
-
What is whole-genome shotgun sequencing?
Whole genome shotgun sequencing is a technique for determining the DNA sequence of a genome by randomly shearing the DNA, sequencing multiple overlapping fragments, and inferring the original sequence from fragments that overlap. This method is often used for bacterial genomes or subclones, like cosmids. Additional information from paired end reads, cosmid ends, and other linkage information will be added in future releases. See Assembly for details.
-
What is an assembly?
An assembly is a representation of the computationally derived relative positions of a set of sequenced fragments. When these individual sequences overlap, a consensus sequence is derived representing the most likely base at each position in the assembly. In this way, increased sequence redundancy improves the quality of the assembly and the confidence in the consensus. See Assembly for details.
-
What does the name "Contig 7.XXX" mean?
A contig is a sequence fragment created by assembling whole-genome shotgun reads. See Assembly for details.
Every assembly contains multiple contigs. Each assembly is numbered sequencially. The number preceding the decimal point indicates the assembly number. Contigs within an assembly are also numbered sequencially. Thus "Contig 7.177" indicates contig #177 within release 7. Contig numbers are not conserved between assemblies, and so "Contig 7.177" bears no relationship to "Contig 3.177".
-
What is a sequence contig?
A sequence contig is the extended contiguous sequence that is produced by the assembly process that joins overlapping sequences. See Assembly for details.
-
Are the contigs ordered?
Contigs within the same supercontig are ordered. See Assembly for details.
-
What is a sequence supercontig?
A supercontig consists of one or more sequence contigs known to occur in a specific order and orientation. Because we sequence each end of the plasmid (or cosmid) subclones, we can recognize that when one end of a clone lies in one sequence contig and the other end of the clone lies in a different sequence contig, these two contigs probably lie close to each other. To create supercontigs we require that two or more such linking clones join two sequence contigs. See Assembly for details.
-
Are the supercontigs ordered?
No, supercontigs are not ordered. We have ascertained the relative positions of supercontigs using genetic and optical mapping techniques, but the numbering of the supercontigs does not reflect these positions. See Maps for more information.
-
Are supercontig (contig) numbers preserved between different assemblies?
No. Supercontig 3.5 (supercontig 5 in release 3) bears no relation to supercontig 7.5 (supercontig 5 in release 7). Similarly for contig numbers.
-
How big is the Neurospora genome?
Our current total unique contig length of 39 Mb base pairs (bp) suggests the genome is approximately 40 Mb. Earlier estimates based on Pulse Field Gel separation of Neurospora chromosomes suggested ~43 Mb.
-
What strain was sequenced?
The normal wild type strain, OR74A was sequenced; mating type A. The specific strain was 74-0r23-1VA (FGSC #2489). If you are interested in the lineage, see Fungal Genetics Newsletter 34, 46-51 (1987).
-
What is the current state of the assembly?
In Release 7 all but 37 of the original 674 sequence gaps have been closed and, with the exception of a few regions containing ambiguous bases, the entire sequence has been brought to "finished" standards. This sequence release will serve as the basis for future releases and gene predictions.
-
How complete is the current assembly?
We estimate that the current release represents 97% of the Neurospora genome and is covered to a depth of > 10X. It excludes very highly conserved repetitive sequence, and ribosomal RNA genes.
-
Are the contigs ordered? For example, is contig 3.5 flanked by contigs 3.4 and 3.6?
The contigs are numbered sequentially within larger supercontig fragments. Contigs within the same supercontig are positionally ordered. See Neurospora Contig Numbering for details.
-
What contig in release 7 corresponds
to my contig 3.XXX from release 3?
Unfortunately there is no automatic way of correlating contig numbers across different assemblies. You can always BLAST your region of interest against the new assembly to get the contig numbers within the latest assembly.
-
How has the sequence been generated for the Neurospora project?
Our data consist of over 1 million individual sequencing reads obtained by sequencing each end of plasmids from a library containing randomly sheared fragments of 4 kb average size. These sequences do not include sequences from the German consortium or sequences previously deposited in Genbank. Future sequence data to be assembled will include cosmid and BAC end sequences as well as finishing data. See Assembly for details.
-
Will the genome be finished?
The goal of this project is a finished Neurospora genome. The finishing process occurs in 3 stages: 1) high throughput automated prefinishing; 2) a second round of high throughput finishing that resolves most remaining gaps; and 3) the final closure phase requires experienced personnel using a variety of customized techniques.
-
How will we know the assembly is correct?
The quality of the assembly will be assessed in several ways. In addition to requiring that the paired plasmid ends occur in a logical manner, our assembly of the Neurospora genome will be verified through: 1) integration of cosmid and BAC end sequences, 2) comparison with available genomic sequences, and 3) correlation with the genetic map.
-
What data are available?
In this version of our data release, all sequence contigs over 2 kb are available. Smaller contigs are sparsely covered and often include poor quality or contaminated DNA. Sequence contig data can be accessed in several ways: either through a BLASTN or TBLASTN search with an option for contig subsequence retrieval, or through FTP download of the entire genome. Contig sequences are subject to change throughout this project, so each data release version number will be appended to the contig number as a prefix (e.g. 2.235 denotes assembly version 2, contig #235).
We also provide precomputed BLAST results against NT and NR. These sequence similarity results can be searched (based on name, GI, species name, etc) and viewed graphically along with the underlying Neurospora sequence.
BAC and cosmid clones have been integrated into the current assembly, and you can search and view the locations of these clones within the sequence contigs.
The current assembly has been correlated with the genetic map and over 80% of the assembly is anchored to a linkage group. You can view the physical and genetic maps by using the "Genetic Map" link above. You may also search for particular genetic markers which have been located in the current assembly, using the "Features" search link.
We have annotated the current sequence with putative genes, based on gene prediction tools and similarity to known genes. These genes are available for download, search by name/locus, and BLASTX and BLASTP searches.
- Clones
-
Are the clones being sequenced available to Neurospora investigators?
The BAC and cosmid clones are available from the FGSC. You can find clones that overlap a region of interest by using the Region search link.
We do not have the resources necessary to make available the 500,000 4 kb plasmid clones. At the completion of this project, these clones will be given to the FGSC.
-
What about cosmid end sequences?
As part of this project, we are sequencing cosmid and BAC ends from three different libraries: pMOcosX, pLORIST and pBeloBAC-KAN. These sequences will be crucial for ordering and orienting the genome as well as providing templates for gaps that are not captured by plasmids. All of these clones are currently available through the Fungal Genetics Stock Center.
-
There are no clones covering my region of interest. How can this be?
The contigs are created with sequence reads from small insert plasmids (around 4000bp) along with larger insert cosmids and BACs. The cosmid and BAC libraries are available from the FGSC, but the plasmid libraries are not. If your region of interest is made up only of DNA sequenced from plasmid clones then there may be no cosmids or BACs containing this sequence region. Unfortunately we cannot provide clones to order for these regions.
-
How do I order clones that contain
my gene/region of interest?
You can find the BAC or cosmid clones overlapping a particular region of interest by using the Regions link above.
Type in the contig name, and start/stop position if available, and then click the Clones button.
This will return a list of clones overlapping the region of interest. There's a link from this search result page to allow you to order clones from the FGSC.
- Chromosomes, Genes, and Regions of Interest
-
Which chromosome does supercontig XXX reside on?
Some of our supercontigs have been anchored to one of the seven linkage groups. Use the Assembly Structure Table to see if your contig or supercontig has been assigned to a linkage group. You can also use the Region search to look up Linkage Group information for a particular contig.
-
Has my favorite gene/region XXX been sequenced?
The whole genome has been shotgun sequenced to greater than 10X depth and therefore we expect 97% of the genome to be represented in our assembly.
You can use the Features search to search for BLASTN or BLASTX alignments containing the name of your gene. You can also search for all BLAST alignments to a particular species of interest.
We have annotated the sequence with automatic gene prediction, and you can search for a particular gene using the "Features" link. You can also use the Linkage Group Genetic Maps to look for regions containing markers of interest.
-
What additional information do you have
on my favorite gene/region XXX?
Use the Region search to view a particular region in one of two graphical viewers. These viewers will display all the genetic markers, blast alignments, and clone ends within the region of interest.
You can also look for your region in the correlation between the genetic and physical maps.
-
Is contig XXX near any known
marker genes?
You can use the Features Search to look for markers on your contig of interest. Use the Features link, and then choose "Advanced Search" for the MARKER type and enter your contig number into the search form.
You can search for marker genes on nearby contigs within the same supercontig by viewing the appropriate Linkage Group Genetic Map. Use the Assembly Structure Page to see if your contig has been assigned to a linkage group.
-
How can I see the features neighboring
my gene of interest?
Using the FeatureMap or GenomeBrowser you can visually see the features in a region neighboring your gene of interest:
- You can bring up any region of a contig using the Region search.
- You can also expand any currently viewed region by modifying the Start and Stop coordinates below the display panel and clicking "Redraw".
- You can also search for features in your region of interest by using the Advanced Search link from the Features link above. The Advanced Search lets you narrow your search by entering start and stop positions on a contig.
- Annotated Genes
-
Is gene XXX annotated in the sequence?
Maybe. We have run automated tools for finding putative genes, relying on ab initio gene finders and sequence similarity to known proteins.
You can search for a gene by name, or by a blastx hit to a known gene. However the gene names are extremely preliminary, and you will find most genes are either named 'predicted protein' (meaning no or weak homology to known genes), 'hypothetical protein' (indicating weak homology to known genes), or 'hypothetical protein (name)' (indicating strong homology).
The genome has not been manually annotated yet, and this process is still premature since we are continuing to sequence.
The Munich Information Center for Protein Sequences (mips) has independently sequenced and annotated two of the chromosomes and their annotations are available from the MIPs website: http://www.mips.biochem.mpg.de/proj/neurospora/
We are not yet in a position to curate manual annotations.
-
Gene XXX is annotated incorrectly in
your sequence - can I submit an update to your gene name?
Unfortunately, we are not yet in a position to curate manual annotations. We are currently still discussing future annotation plans.
-
How were the genes annotated?
-
There seem to be two different ways of seeing my gene in the FeatureMap - what's going on here?
You are right. There are two different ways of looking at a gene in the FeatureMap or GenomeBrowser.
- Gene within a contig (e.g. title "Contig 3.77")
If you bring up the FeatureMap or GenomeBrowser on a region of a contig, then you are seeing the result of DNA-based analyses. You'll know you are in this mode if the title of the FeatureMap gives a contig number.
This graphical view shows the results of analyses performed on the nucleotide sequence. For example:
- De novo gene prediction programs: Fgenesh, Genscan
- Blastn searches against NT
- Blastn searches against ESTs
- Blastx searches of the translated nucleotide sequences against proteins in NR
- HMMER searches of the translated DNA against PFAM
You can get to these FeatureMaps by using the Search Regions page.
- A single gene by itself (e.g. title "NCU#####")
You can also bring up the FeatureMap or GenomeBrowser on a protein sequence corresponding to a particular gene. In this view you will see the results of protein-based analyses on the amino-acid sequence. For example:
- Blastp searches of the protein against proteins in NR
- HMMER searches of the protein against PFAM
You can get the FeatureMap of a particular gene from the Feature Detail page corresponding to that locus
- Gene within a contig (e.g. title "Contig 3.77")
-
Why do I see different PFAM domains in the "Feature Detail" summary and the graphical view?
The graphical view shows the results of analyses performed on the nucleotide sequence. For example:
- De novo gene prediction programs: Fgenesh, Genscan
- Blastn searches against NT
- Blastn searches against ESTs
- Blastx searches of the translated nucleotide sequences against proteins in NR
- HMMER searches of the translated DNA against PFAM
However the HMMER searches found at the DNA level can be misleading, since they do not take the exon structure of the gene into account.
In addition to the HMMER searches of the DNA, we also perform HMMER searches against our predicted gene set. These HMMER protein searches are likely to be more accurate, thus we present the protein based PFAM results in the "Feature Detail" summary. We also used the protein-based PFAM results when searching for genes by PFAM domain, in the Advanced Search for Annotated Genes and the gene index of Genes by PFAM
The Feature Search mechanism provides access to the results of the DNA analyses, thus the HMMER Feature Search will show the results of the DNA-based HMMER program.
DNA-based HMMER results:
- HMMER Feature Search
- pink arrows in the FeatureMap and GenomeBrowser on contigs
Protein-based HMMER results:
- Genes by PFAM
- Advanced Search for Annotated Genes
- pink arrows in the FeatureMap and GenomeBrowser on genes
-
What changed between release 3 and release 7?
A list of all the loci changes between the releases is available for download. Additionally, the details for each gene indicate what changes, if any, occured from the previous release.
- Genetic Markers
-
Why don't I see my marker when I do
a Marker Feature Search?
The Marker Feature Search will only show markers that have been found in the currently assembly.
Your marker may not be located in a contig for several reasons; Most likely, there is no known sequence for your marker and thus we cannot find it in our assembly. If the marker does have associated sequence, the sequence may fall into a gap within the current, unfinished assembly.
You can use the file markers.csv to view the GenBank accession number(s) of sequences associated with each marker (see Downloads Page for data details).
-
Why isn't my marker shown on the Linkage
Group Map?
The Linkage Group map has three panels, and all markers in that linkage group should be displayed in one of those three panels.
The left-most panel shows markers that are either:
- on the well-ordered linkage group map (marker names to the right of the genetic map ruler), or
- on supercontigs that are anchored to the well-ordered linkage group map (marker names to the left of the supercontig ruler)
The lower panel shows markers that are not on the well-ordered linkage group map but are found on contigs.
The bottom-most panel shows the remaining markers in the linkage group; those markers not on the well-ordered map which have not been located in the current assembly.
If you cannot find your marker in any of these three panels, please check the 2000 Compendium FGSC genetic maps, and let us know about missing markers.
- In the genetic map, why is the
length of the contigs longer than the length of the
associated supercontig?
The length of the supercontigs and contigs depicted in the Genetic Map images are to scale relative to the number of base pairs. The supercontig bars appear shorter than the associated contig bars since the contigs are separated by spaces. - I know my gene is between marker X
and Y, how do I order clones for this region?
Use the genetic map to determine the contig or supercontig corresponding to markers X and Y. It is possible that neither X nor Y has been located on the physical map, and if this is the case then you're out of luck.If you find the contig numbers for genetic markers X and Y, then you can find the overlapping BAC or cosmid clones by using the Regions Search.
Type in the contig name, and start/stop position if available, and then click the Clones button.
This will return a list of clones overlapping the region of interest. There's a link from this search result page to allow you to order clones from the FGSC.
- Downloading
-
What format is the download file in?
The genome data is pure text in multiple FASTA format. The text file has been compressed using gzip. To uncompress the file:
-
The download fails. What should
I do?
Downloading through the browser uses the http protocol. You can also try accessing the ftp site directly via the URL:
- BLASTing
-
Why is my BLAST job taking so long?
BLAST jobs are queued and handled with other internal Broad processes in a general Load Sharing Facility. The delay for receiving your BLAST results depends on the current load.
-
Why are my BLAST results split
into multiple email messages?
Some email programs are configured with a maximum message size and will automatically split large files into smaller pieces. If this is undesirable, you will need to reconfigure your email program.
-
What sequences can I BLAST against?
You can BLAST your query sequence against our entire assembly, including putative genes and proteins.
-
Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence"?
From the NCBI Blast FAQ:
This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the sections of the BLAST FAQs on Q: What is low-complexity sequence? and also Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?
-
After running a search why do I see a string
of "X"s (or "N"s) in my query sequence that I did not put there?
From the NCBI Blast FAQ:
You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton & Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.
-
What is low-complexity sequence?
From the NCBI Blast FAQ:
Regions with low-complexity sequence have an unusual composition and this can create problems in sequence similarity searching (Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
- Genome Browser
-
Does the Genome Browser Java applet run on Macintosh computers?
We are pleased to announce that the Genome Browser can now run on both Windows and Macintosh platforms.
Requirements for Windows:
Windows 9x & NT platforms or better
Java 1.4
Netscape Navigator 4+, Internet Explorer 5+, Mozilla 1.* or other browser that can display Java applets
Requirements for Macintosh:
OS X
Java 1.4 (Software Update)
Safari - Misc.
-
What's the Broad Institute?
The Eli and Edythe L. Broad Institute is a partnership among MIT, Harvard and affiliated hospitals and the Whitehead Institute for Biomedical Research. Its mission is to create the tools for genomic medicine and make them freely available to the world and to pioneer their application to the study and treatment of disease.
-
How do I cite the sequence for publication?
Publications should refer to the specific version of the data release (e.g. "release 7") and include the following citation:
Galagan, James E. et al. 2003. The genome sequence of the filamentous fungus Neurospora crassa. Nature 422, 859 - 868 (2003)
-
Where are the beautiful photos from?
The lovely Neurospora crassa electron micrographs are courtesy of Matt Springer, Stanford University, and the Fungal Genetics Stock Center. If you have images that you are willing to share, please email annotation-webmaster@broad.mit.edu
gunzip neurospora_crassa_7.fasta.gz
