("Single Nucleotide Differences (SNDs) in the dbSNP Database May Lead to Errors in Genotyping and Haplotyping Studies" Lucia Musumeci et al., 2010 より)
The creation of single-nucleotide polymorphism (SNP) databases (such as NCBI dbSNP) has facilitated scientific research in many fields.
一塩基多型(SNP)データベース(NCBI dbSNPなど)の生成は、多くの学術分野で研究に貢献してきた。
SNP discovery and detection has improved to the extent that there are over 17 million human reference (rs) SNPs reported to date (Build 129 of dbSNP).
SNP databases are unfortunately not always complete and/or accurate.
In fact, half of the reported SNPs are still only candidate SNPs and are not validated in a population.
We describe the identification of SNDs (Single Nucleotide Differences) in humans, that may contaminate the dbSNP database.
我々はヒトにおけるSNDs(Single Nucleotide Differences、一塩基差異) について述べる。それらSNDsとは、dbSNPデータベースを汚染する可能性があるものである。
These SNDs, reported as real SNPs in the database, do not exist as such, but are merely artifacts due to the presence of a paralogue (highly similar duplicated) sequence in the genome.
Using sequencing we showed how SNDs could originate in two paralogous genes and evaluated samples from a population of 100 individuals for the presence/absence of SNPs.
Moreover using bioinformatics, we predicted as many as 8.32% of the biallelic, coding SNPs in the dbSNP database to be SNDs.
Our identification of SNDs in the database will allow researchers to not only select truly informative SNPs for association studies, but also aid in determining accurate SNP genotypes and haplotypes.
次の図は、dbSNPにSNPがどんどん登録されて巨大化していった様子を示すために、パブリックドメインで提供されているセミナー資料‡から引き出したものである(Courtesy:NHGRI)。1998年から2005年までしか描かれていないが、新しいものはパブリックドメインでは見つけられなかった。やはり検証済みのSNP("Validated")はトータルSNPsの約半分ということのようだ。レジェンドの真ん中には"2-hit SNPs"とオレンジのラインで表示されている。dbSNPにおける検証の概要について引用翻訳した後に、調べてみる。
("What exactly does it mean when a SNP is validated? Could you explain what validation is?", NCBI, 2014年10月18日閲覧、パブリックドメインとして提供されている より)
What exactly does it mean when a SNP is validated? Could you explain what validation is?
SNPをいつ検証したかって、いったい何がききたいの? 検証って何か教えてよ?
In order for a RefSNP(rs) to be validated, at least one of its clustered submitted SNPs (ss) must either have been ascertained using a non-computational method or have frequency information associated with it.
RefSNP(rs)を検証された状態に保つために、クラスター化されて提出されたSNPs(submitted SNPs:ss)のうち少なくとも一つが、非計算機的手法を用いて確認されるか、または、人口頻度の情報が関連付けられなければならない。
When an ss is withdrawn from a validated rs cluster, and the withdrawn ss was the only ss in that cluster to have frequency information or to be ascertained using a non-computational method, then the rs cluster changes to"non-validated" status.
For example, the submitter "SNP500CANCER" found all their SNPs using non-computational methods, and routinely withdrew SNPs during their quality control cycles.
So when "SNP500CANCER" submitted a ss into dbSNP and it clustered into a non-validated rs, that rs became validated.
When"SNP500CANCER" later withdrew the same ss, the rs cluster it was associatedwith lost its validation status.
You can also find information on variation validation by going to the dbSNP Handbook, and search for the text: "Validation" (scroll to the bottom of the page).
You will find the following statement:
“dbSNP accepts individual assay records (ss numbers) without validation evidence. When possible, however, we try to distinguish high-quality validated data from unconfirmed (usually computational) variation reports.
Assays validated directly by the submitter through the VALIDATION section show the type of evidence used to confirm the variation.
Additionally, dbSNP will flag an assay as validated (Table 4) when we observe frequency or genotype data for the record.top link.” (04/21/08)
分からない部分もあるが、ともかく検証されているものとされてないものは明確に区別されるということのようだ。"2 hit SNPs"について調べる。
("Double Hit SNP Computation" NCBI より)
What criteria does dbSNP use to determine double-hit SNPs independently of Dr. Jim Mullikin's algorithm?
Dr. Jim Mullikinのアルゴリズムとは別に、dbSNPがダブルヒットSNPsを決定するのにどんな基準を用いているのでしょうか?
I have not made the double-hit, two-allele computation for some time now.
Currently, we rely exclusively on Dr. Mullikin's data AFAIK.
現在、我々はDr. Mullikinのdata AFAIKに頼りきっている。
As for my double-hit computation, I made the initial calculation of double-hit SNPs based on submitter-supplied clone accessions.
If we can establish that two different submitters working with different clone libraries had independently identified each allele, we confirmed the SNP as a double hit.
I believe this mined something on the order of 10 K of double-hit SNPs.
We also knew that we had a bolus of TSC SNPs mined from traces known to be from sources other than the clone libraries used in the human genome.
I'm a bit fuzzy on the details now;
I do, however, recall that the individuals supplying the TSC traces were pooled together, but that a statistical argument, based on the number of individuals in the pool, allowed us to consider each trace as an independent sample with high confidence.
Additionally, the allele appearing on the genome itself constituted one hit of that allele.
If we found at least one other trace with the genomic variant and two traces of the variant not on the genome in the TSC dataset, the SNP was classified as double-hit, two-allele.
I believe we classified about 100 K double-hit SNPs by this method.
dbSNPの公的なガイドとして最も簡単なのは、おそらく2014年10月現在のところこの4ページだけのFact Sheet‡である。なぜFTPで提供されているかはよく分からないものの、とにかく入門者向けの啓蒙用として配られているようだ。日本語でもっともわかりやすいと思ったのは、こちらのプレゼン資料§であった。著作権上パブリックドメイン†の公式Fact Sheetから抜き出す。(Courtesy: National Library of Medicine)
The NCBI Short Genetic Variations database, commonly known as dbSNP, catalogs short variations in nucleotide sequences from a wide range of organisms.
These variations include single nucleotide variations, short nucleotide insertions and deletions, short tandem repeats and microsatellites.
Short Genetic Variations may be common, thus representing true polymorphisms, or they may be rare.
Some of these rare human entries have additional information associated with them, including disease associations, genotype information and allele origin, as some variations are somatic rather than germline events.
Searching for and displaying SNP records
Searches can be performed from the homepage by typing a query term in the search box and clicking the Search button (A).
The Limits (B) page has an extensive list of options that restrict search results to desired categories, while the Advanced (C) page provides a query construction function for use in creating complex queries to produce more precise results.
The search below, “hfe[gene] AND human[orgn] AND utr_5[fxn]”, retrieves variations mapped to the 5’-UTR of human HFE gene.
以下の検索結果は“hfe[gene] AND human[orgn] AND utr_5[fxn]”と入力したものであり、ヒトHFE遺伝子の5'-UTRへとマップされたバリエーションを引き出します。
Options in the Display settings popup can be used to show SNPs in other formats, such as FlatFile (D), or sort retrieved variations in a different order, such as chromosome
base position (E).
[Display Settings:]をクリックするとポップアップ形式のオプションが表示され、FlatFile(D)といった他の形式でSNPsを表示したり、chromosome base position(E)といった異なった順番で検索結果である複数のバリエーションをソートすることができます。
検索して得られたバリエーションは、Send to(F)オプションを用いてローカルなファイルに保存できます。
Links to separate displays to highlight specific aspects, such as gene-centric listing (GeneView, G) and graphical presentation under the context of genome or mRNA sequences through the HGVS names (H), are also provided.
ハイライトにより強調された各表示方法へのリンクをクリックすると、gene-centric listing(GeneView, G)、および、HGVS名を通じてゲノムやmRNA配列のコンテキストに基づくグラフィック表示(H)を行うこともできます。
The reference SNP cluster report
Details of a variation record are given in the Reference SNP Cluster Report (shown in sections below and on the next page).
バリエーションレコードの詳細は、Reference SNP Cluster Reportで表示されます(以下の節および次ページを参照)。
This display is linked from the rsID (rs1800730) and provides a summary of the allele (A) and mapping information in Human Genome Variation Society (HGVS) nomenclature (B).
この表示は、rsID (rs1800730)からリンクされており、そのアレルの概要(A)およびHuman Genome Variation Society(HGVS)命名法を用いたマッピング情報(B)を提供します。
The VarView icon (C) links to a new genecentric display (see pg.4).
The detailed genome mapping information is summarized in the table below (D).
The chromosomal coordinates (E) links to the same gene-centric display as VarView icon.
The magnifying glass (F) links to the 1000 Genomes Browser providing genotyping details, if the rsID is also reported by that project.
Clicking the Go button (G) in the GeneView section (right), activates the SNP:GeneView (p.4) display, detailing the variations mapped to the gene.
The mapping coordinates and protein coding changes are summarized in the tables (H) below, which are followed by a graphical display of the variation on the genome assembly (I).
Variations with different characteristics are listed in different tracks (J) and hyperlinked to provide additional details in popup.
Alleles and flanking sequences from submitter SNPs (K) included in the reference SNP cluster are summarized in a table below the graphical display.
リファレンスSNPクラスター中に含まれる提出者(submitter) SNPs(K)からのアレルおよび隣接する配列は、グラフィック表示の下の表にまとめられています。
The ssIDs (L) link to submitter records providing additional details.
0 件のコメント: