以文本方式查看主题

-  中文XML论坛 - 专业的XML技术讨论区  (http://bbs.xml.org.cn/index.asp)
--  『 生物信息学 』   (http://bbs.xml.org.cn/list.asp?boardid=46)
----  生物芯片(5)  (http://bbs.xml.org.cn/dispbbs.asp?boardid=46&rootid=&id=10383)


--  作者:admin
--  发布时间:9/23/2004 2:05:00 AM

--  生物芯片(5)


生物芯片(5)                         


发信人: teddy (沈小聪聪), 信区: Bioinformatics
标  题: 生物芯片(5)
发信站: 北大未名站 (2001年03月14日16:01:19 星期三), 站内信件

发信人: Teddy (real), 信区: Electronics
标  题: 数据分析(转载)
发信站: 大话西游站 (2001年03月14日15:12:05 星期三), 站内信件

【 以下文字转载自 Classroom 讨论区 】
【 原文由 Teddy 所发表 】

Olga Ermolaeva1,2,Mohit Rastogi3,Kim D.Pruitt2,Gregory D.Schuler2,
Michael L.Bittner1,Yidong Chen1,Richard Simon4,Paul Meltzer1,Jeffrey M.

Trent1 & Mark S.Boguskj2,3

Microarray technology makes it possible to simultaneously study the
expression of thousands of genes during a single experiment.We have
developed an information system,ArrayDB,to manage and analyse
large-scale expression data.The underlying relational database was
designed to allow flexibility in the nature and structure of data
input and also in the generation of standard or customized reports
through a web-browser interface.ArrayDB provides varied options for
data
retrieval and analysis tools that should facilitate the
interpretation of complex hybridization results.A sampling of ArrayDB
storage,retrieval and analysis capabilities is available(www.nhgri.nih.

gov/DIV/LCG/15K/HTML),along with information on a set of approximately
15,000 genes used go fabricate several widely used microarrays.
15,000 genes used go fabricate several widely used microarrays.
Information stored in ArrayDB is used to provide inetgrated gene
expression reports by linking array target sequences with NCB1's
Entrez retrieval system,Unigene and KEGG pathway views.The integration
of esternal information resources is essenteal in inerpreting
intrinsic patterns and relationships in large-scale gene expression
data.

Our modern concept of gene expression datas to 1961,when messenger RNA
was discovered,the genetic code deciphered and the theory or genetic
regulation lr protein synthesis described1-3.The first attempts at
global surveys of gene expression were undertaken in the mid-1970s.
Kinetic studies of the hybridization of mRNA pools with radioactively
labelled cDNA produced the general concepts os varying mRNA abundance
classes that are related to the functional class(structural,catalytic
and so on)of the translated proteins4,5.These experiments also
provided insight into:(i)the number of members of these classes;
(ii)the presence of a large number of ubiquitously expressed(`
house-keeping')genes thought to be necessary for the structural and
functional integrity of all cell types;and(iii)the  existence of
significant numbers of genes that are apparently cell-type-specific.
This
period coincided with the establishment and popularization of the
phrase‘gene expression’through its usage in the titles of a series
phrase‘gene expression’through its usage in the titles of a series
of influential books6-9.Interest in gene expression increased steadily
during the 1980s,as shown by the fact that the frequency of usage of
the
phrase increased more than 10-fold in the titles of publications over
this decade(unpub.obs).

In the 1990s,a new of era or gene expression studies has unfolded as a
result of data sufficiency(that is,complete genomes of comprehensive
cDNA surveys)and technological advances10-12.As a consequence of
large-scale DNA sequencig activities,there are now more DNA sequences
in
GenBank than there are related publicarions in the literature (Fig.1).

Thus,we have reached a turning point in biomedical research:in the
past we have had many publications about a relatively small number of
genes,whereas now,and in the future,single publications will begin to
encompass aspects of thousands of genes12-17.Large-scale study of gene
expression is a hallmark of the transition from ‘structural’to ‘
functional’genomics18,where knowing the complete sequence of a genome
is only the first step in understanding how it works.

There are several new technologies for studying the simultaneous
expression of large numbers of genes.These technologies may be
generally
divided into serial and parallel methods.The serial methods involve
direct,large-scale sequencing of cDNA (for revirw,see ref.19);the
direct,large-scale sequencing of cDNA (for revirw,see ref.19);the
parallel approaches are based on hybridization to cDNA immobilized on
glass (termed ‘microarrays’;ref.11)or to synthetic oligonucleotides
immobilized on silica wafers or ‘chips’(termed ‘probe arrays’;refs
10,20).In both parallel methods,hybridized probes are detected using
incorporated fluorescent nucleotide analogs.These methods are the
conceptual descendents of filter-immobilized targets detected by
radioactive probes21,22,and filter-based technology is undergoing a
renaissance as a low-cost alternative to the newer methods.Regardless,
arrays of hybridization targets,generated at high density in small
areas(for example,10,000 cDNAs on a 2×2cm filter or glass slide)are
now
commonly referred to as microar-

Fig.1 Cumulative growth of molecular biology and genetics
literature(blue)compared with DNA sequences(green).Articles in the
"G5"(molecular biology and genetics)subset of MEDLINE are plotted
alongside DNA sequence records in GenBank over the same time period.
The former data was obtained with the help of R.M.Woodsmall of NCB1
and the latter data is available(ft[://ncbi.nim.nih.gov/genbank/gbrel.
txt).No attempt has been made to eliminate data redundancy among
either the DNA sequence rdcords or information contained in the
direct,large-scale sequencing of cDNA (for revirw,see ref.19);the
parallel approaches are based on hybridization to cDNA immobilized on
glass (termed ‘microarrays’;ref.11)or to synthetic oligonucleotides
immobilized on silica wafers or ‘chips’(termed ‘probe arrays’;refs
10,20).In both parallel methods,hybridized probes are detected using
incorporated fluorescent nucleotide analogs.These methods are the
conceptual descendents of filter-immobilized targets detected by
radioactive probes21,22,and filter-based technology is undergoing a
renaissance as a low-cost alternative to the newer methods.Regardless,
arrays of hybridization targets,generated at high density in small
areas(for example,10,000 cDNAs on a 2×2cm filter or glass slide)are
now
commonly referred to as microar-

Fig.1 Cumulative growth of molecular biology and genetics
literature(blue)compared with DNA sequences(green).Articles in the
"G5"(molecular biology and genetics)subset of MEDLINE are plotted
alongside DNA sequence records in GenBank over the same time period.
The former data was obtained with the help of R.M.Woodsmall of NCB1
and the latter data is available(ft[://ncbi.nim.nih.gov/genbank/gbrel.
txt).No attempt has been made to eliminate data redundancy among
either the DNA sequence rdcords or information contained in the
either the DNA sequence rdcords or information contained in the
literature.

Box1·The 10K/15K gene sets

The initial resources required to design and fabricate gene expression
microarrays include cDNA sequence data,cDNA clones,orboth.
ldentification
of genes and clones of interest is problematic due to the quantity
and redundancy of sequence data available.Some problems associated
with the large-scale application of genome resources have been faced
before in the context of building a transcript map of the human
genome24
,and databases consisting of non-redundant collections of human and
mouse genes and ESTs have been developed25 .The UniGene collection of
human sequences (http://www,ncbi.nlm.nih,gov/UniGene/)currently
represents more than 45,000 genes and it is possible to fabricate
arrays
containing this entire collection.lnitial work in our laboratories
focused on a smaller,but still significant subset of approximately 10,
000-15,000 transcribed human sequences referred to as the 10K and 15K
sets, originally conceived by P.Brown,J.M.T.and M.L.B.developed by G.S.

and arrayed by J.Hudson.Detailed information on the composition of
these sets is available (http://www.nhgri.nih.gov/DlR/LCG/15K/HTML/).
Briefly,the sets were designed to include a selection of human genes
of known function, ESTs on the human transcript map24,ESTs with
of known function, ESTs on the human transcript map24,ESTs with
significant similarities to genes in other organisms and some
handpicked
genes of specific research interest.

rays.Detailed discussion of these technologies is beyond the scope of
this article(see http://www.ncbi.nlm.nih.
gov/ncicgap/expression_tech_info.html and http://www.nhgri.nih.gor/).
but
we note that bioinformatics needs are similar and equally essential
for
all methods.

Although a great deal of effort has gone into the development of the
enabling technologies, relatively little attention has been paid to
the computational biology underlying data analysis and interpretation
We
describe here some general aspects of gene expression informatics as
well as our specific implementation of an integrated data management
and
analyses system(ArrayDB), designed as a database-backed web site23.
Informatics plays an important role at every step in the process, from
the design of arrays through through laboratory information management,

to the processing and interpretation of experimental results. We also
discuss the role of the public database in this new era of biomedical
research.

Design considerations
Design considerations

Array-based experiments aim to simultaneously catalogue the expression
behaviour of thousands of genes in a single experiment. It is also
expected that comparisons will be carried out across tissues,
developmental and pathological states, or as temporal responses
following a defined alteration to cells or their environment. Such
experiments require the ability to manage large quantities of data
both before and after the experiment. The design and construction of
arrays that will detect gene expression requires direct access to all
sequences, annotations and physical DNA resources for genes of an
organism(Box 1).

Following hybridization and readout of relative expression levels
observed in various sites on an array, the data collected must be
stored
and preserved in a way to make it readily available for image
processing26 and statistical and biological analysis. The latter
includes identifying the changing and unchanging levels of expression
and correlating these changes to identify sets of genes with similar
profiles. Easy access to existing biological knowledge of gene
function and interaction is necessary to fully interpret the
biological implications of the observed patterns. An information
system must also be flexible enough to accommodate new statistical
system must also be flexible enough to accommodate new statistical
data mining tools as they become available.

Laboratory information management systems(LIMS)

The successful use of large-scale functional genomics technologies
depends on robust and efficient systems for tracking and managing
material and information flow. An overview of the types of practical
problems addressed by our LIMS is shown(Fig.2). The individual
components and detailed design of LIMS is connected with specific
laboratory environments, particularly for those technologies still
under
development,but some general principles have guided our work. These
include the use of an industry standard relational database management
system combined with platformindependent web browser interfaces for
data
entry and retrieval23.

The microarray LIMS, ArrayDB, was developed to store, retrieve and
analyse microarray experiment information. The ArrayDB system
integrates
the multiple processes involved in microarray expression experiments,
including data management,user interface, robotic printing, array
scanning, array scanning and image processing26. Data stored in the
ArrayDB system includes information about the experimental resources,
experimental parameters and conditions, and raw and processed
The design  of ArrayDB allows for flexibility in the exact nature of
the
data stored. This design strategy permits data input from different
sources. Most clone information stored in the ArrayDB is extracted
from UniGene(for example,sequence definition and accession number).
However, the design accommodates addition of newly isolated clones for
which accession numbers or meaningful names are not yet available.
Many data input and processing tasks are automated. Software
automatically scans a directory for new intensity data that are
uploaded
into the database without requiring an operator’s assistance.
Additional automated

Fig.2 Schematic overview of the ArrayDB information management system.
The basic information in the database consists of arrays,‘ probes’
and
images, Arrays of specific cDNA clone inserts(and accompanying
annotation) are as described in the text, Box 1 and the legend to
Figure
3. The section labelled‘probes ’signifies details of a particular
experiment as described in the text. Details regarding image
processing are provided26, An ad hoc raw image format is used for
processing, but this is converted to standard formats (JPEG,GIF)for
subsequent analysis and display. A complete relational schema of the
database is available on request.
database is available on request.

Fig.3 Screen captures of various data retridval and analysis tools
within ArrayDB. a, ArrayViewer histogram (additional details in test).
b, ArrayViewer image and results. The ArrayViewer Java Applet displays
the scanned array image in the top window. Boxes and the ranking
number are overlaid on the image for clones that have satisfied the
query criteria; clones are ranked according to ascending ratio value.
The boxed clones, and related quantitative data, are listed under the
image. Quantitative data presented in the lower window include: the
ranking number, IMAGE clone ID  the ratio, probe Aintensity, probe B
intensity, probe size, probe B pixel size and the clone tatle. c.
ArrayViewer cluster report. The example shows a report fot
Tryptophanyl-tRNA synthetase.‘Cl_id’is an internal database
identifier. The ‘Clone’field contains the IMAGE clone identifier and
is hyperlinked to the dbEST records containing the sequences of this
clone. ‘FIags’summarizes the criteria by which this sequence was
included in the 10K/15K sets. ‘Txmap’refers to the location of an
STS derived from this sequence on the human transcript map24 and ‘
Clust’ indicates the UniGene cluster dontaining this sequence(http:
//www.ncbi.nlm.nih.gov/UniGene).‘EC’ contains the enzyme commission
nomenclature number for this enzyme and ‘KEGG’links it to the
biochemical pathway reports available through the KEGG web site (http:
biochemical pathway reports available through the KEGG web site (http:
//www.genome.ad.jp/kegg/)27.‘Pl/Row/Col’ refers to the microtitre
plate and well from which the original clone was obtained. The ‘Genes’

field contains GenBank accession numbers for
annotated(non-EST)versions
of the sequence and the ‘3’EST’and‘5’EST' fieles contain GenBank
accession numbers for all ESTs corresponeing to the cDNA sequence
Lastly. the 'Sequence' field contains only those accession numbers
referring to those EST sequences derived from the actual lMAGE clone
insert sdldcted for inclusion in the array.

processes were developed to facilitate integration of intensity dara
with clones data; for example, ArrayDB maintains the association
between
a spot on an image and all the data related to the clone located at
that position on the microarray.

The web-based user interface to the ArrayDB  system allows convenient
retieval of distinct types of information, ranging from clone data to
intensity data to analysis results. ArrayDB supports database queries
by
different fields, such as clone ID, title, experiment number,
sequence accession number, or microtiter plate number, with a
resulting report of the relevant clone (s). Additional information
about
each clone is avaible through hypertext links to other databases such
as dbEST, GenBank or UniGene. Furthermore, metabolic pathway
information
as dbEST, GenBank or UniGene. Furthermore, metabolic pathway
information
is also available through links to the Kyoto encyclopedia of genes
and genomes(KEGG)web site27.

The inconsistency in gene nomenclature makes it more efficient and
accurate to search for a gene of interest by doing a sequence
similarity
search. ArrayDB supports BLASTN searches against the 10K/15K set so
that anyone can quickly detrmine if a gene of interest is included on
our arrays. Matches against individual sequences are linked to a
'cluster report'(Fig.3c),and from there to further annotation in
external databases via hypertext links as described above.

 

Data analysis

The ultimate goal of ArrayDB is to identify patterns and relationships
among intensity ratios both in individual and across multiple
experiments. The ArrayViewer tool supports retrieval and analysis of
single experiments; MultiExperiment viewer supports analysis of data
from multiple experiments. In addition, the option to download
intensity
data, images and some analysis results to a local disk adds
flexibility
to the end-users analysis options: once downloaded, intensity data
to the end-users analysis options: once downloaded, intensity data
can be imported into other software packages for analysis.

ArrayViewer facilitates identification of statistically significant
hybridization results in single experiments. The data est for a single
experiment includes intensity ratio data for two fluorescent
hybridization probes. However, the inherent flexibility in the ArrayDB
design strategy is compatible with results derived from single
intensity
(for example, radioactive probe) data. In the case of radioactive
probes, a single image consists of the intensity data from two
separate hybridization experiments using two different

Fig.4 MultiExperiment viewer window.a,The main panel of the
MultiExperiment viewer is divided into three sectios.The left side is
composed of the control panel where the query criteria are selected .
One
also selects the experiments to analyse and other filters such as
keywords,minium intensities and minimum pixel sizes.The data returned
from a query can be downloaded in a tab delimited text file by
selecting
download list in this panel .the control panel can also be used to
alter the y-axis format and scale of the data represented in the
window on the right side .This window is a dot plot of the
experimental data returned from the query.Selecting particular
'dots'with a mouse highlights the ratio data for that clone across all
'dots'with a mouse highlights the ratio data for that clone across all
selected experiments in both the dot plot and the quantitative data in
the lower right window .The lower right window displays the calibrated
ratio of the returned genes (clones).Selecting the ranking number
highlights that data in the dot -plot .The IMAGE consortium Clone id
is linked to the cluster reports (Fig.3,legend).Selecting ratio and
title will launch a new window (B)that displays the red and green
intensities and sizes for that clone .by selecting advanced options in
the control panel ,a new winow (C)islaunched that allows greater
flexibility and control in defining a query .Greater precision is
achieved by allowing one to specify experiments where only
up-regulated clones or only down-regulated clones are of interest.

Probes.Ratios of the imtensity values obtained with each probe,for
each clone ,are determined and stored in the database.(The
mathematical basis for our image analysis approach is reported
elsewhere26)The basic premise of Array Viewer is that significant
hybridization result can be determined from the ratio values.Therefore,

ArrayViewer initially displays a histogram that is created on demand
using the ratios stored in ArrayDB(Fig.3a).

From the ArrayViewer  histogram,there are three basic ways to query
the data and return information on the nature and expression of
specific
the data and return information on the nature and expression of
specific
genes .The first method uses a confidence algorithm26.Querying by
confidence values will  return a list of those genes with
statistically significant ratio values that are less than a lower
confidence limit and greater than an upper confidence limit ,The
default
confidence value is 99%,but this can be changed and the lower and
upper
confidence limits re-calculated .The second method allows the user to
select a range of ratios on the histogram and will return informaion
on genes with expression ratios in this range.The last method is to
simply view the image of the hybridization results and select spots in
the array using a computer mouse or other pointing device .One can
further refine the ArrayViewer query by adjusting optional filters for
minimum intensity,maximum intensity ,minimum size ,or keyword.

Query results are provided in a new window that displays the array
image
and a list of clones with their associated intensity data (Fig.3b).
Additional information about each data point or clone can be obtained
by
clicking on the ranking number(red)or the  clone Id number (blue),
respectively,Selecting the ranking number opens a new window
presenting A×10magnification of the  hybridized target spot plus a
reiteration of the hybridization result ,Selecting the clone Id number
open a new window con -taining a cluster report for that clone (Fig.
3c).Lastly,the data in the results window can be downloaded to a tab
3c).Lastly,the data in the results window can be downloaded to a tab
delimited text file by clicking on 'Download List".

To realize the full potential of microarray expression analysis,
MultiExperiment viewer was developed.This wed-based tool edables users
to query the database across multiple experiments to identify clones
that share some pattern of espression across those experiments.For
example,one can use this tool to identify genes that are up-regulated
or
down-regulated across a series of experiments.In addition,the user
can track the behaviour of a particular gene or genes of interest by
specifying key words from gene descriptions in the 15K set .Analysis
results are presented in both a graphical and tabular rormat.Also
provided is a download option of the result table to facilitate
storage of results for future reference and/or additional analysis.

The MultiExperiment viewer window(Fig.4)provides a control panel for
selecting the query criteria,an area to display a dot plot of the
query results and a section where the table of quantitative
information is displayed.To develop the query,one must first select
the experiments from the list in the upper left corner;several filters
are also provided which enables the user to ‘fine-tune’the query.The
are also provided which enables the user to ‘fine-tune’the query.The
MultiExperiment viewer then queries the database to identify clones
exhibiting ratios thar meet the query requirements,returns the ratio
for
each clone and draws a dot-plot of the results for each experiment
selected.This provides a convenient method to identify clones with
particularly high or low ratios in an experimental series,such as a
time
course.There are two ways to visualize the expression pattern shown
by an individual clone across the selected set of experiments.The
position of the clone is highlighted in the dot plot diagram (Fig.4,
red boxes)for each experiment by either clicking on a desired spot in
the diagram or by clicking on the ranking number(left column)of a
cline with interesting quantitative data.As previously described,
additional information about each gene product is readily available in
the clone's cluster report (Fig.3c) via the hyperlinked clone ID
column.

The comparison of data across multiple experiments requires a way of
normalizing ratio results between experiments;to date,

Box 2·public access to expression array data

As large-scale gene expression data accumulates,public data access
becomes critical issue.What is the best forum for making the data
becomes critical issue.What is the best forum for making the data
accessible?Summaries and conclusions of individual experiments will,of
course,be published in traditional peerreviewed journals,but
electronic access to full data sets is essential.There are three
models for data publication:first authors can make data available on
their own wed sites (for example,http://cmgm.stanford.
edu/pbrown/explore14);second,journals that publish the results of
these studies can provide the complete data sets as electronic
supplements (this approach fulfills the traditional archival
responsibility of the literature);and the third approach is to submit
the data to a centralized public data repos itory such as GenBank.The
primary disadvantages of the first two models are that data is widely
dispersed and lacks uniform structure and retrieval modalities.In
addition,the first case is complicated further by an uncertain life
span
for the data and the second case incurs new expenses for curating and
maintaining this data that journals may not wish to bear.Clearly,the
successful history of public sequence databases provides an attractive
model for the most efficient management of and vonvenient access to
large-scale expression data.However,it would be highly dwsirable to
arrive at some type of data formast standards that are independent of
particular expression technology.this has only been possible by using
a single reference state as the source of one of the hybridization
probe
mixtures for all of the experiments to be compared.For example,such
mixtures for all of the experiments to be compared.For example,such
an approach has been used in comparing points along a time course,and
in
comparing multiple samples of  a particular type of tumour
(unpublished
observations).In diauxic shift experiments14,the reference sample was
cDNA prepared from yeast cells harvested at the first interval after
inoculation.Although the use of such a reference comparator alllows
ratio comparisons within a series of experiments,there is clearly a
need
for a more broadly applicable reference standard to serve as a
benchmark for all expression experiments .A number of microarray
laboratories have given thought to formulating such a standard.An
ideal standard would provide modest signals for every human gene,so
that
expresssion of  any gene in the experimental probe xould be assigned
a rdliable ratio value.The standard would also need to be readily and
reproducibly generated and easily disseminated.Efforts to produce such
a
reference standard are underway.

Discussion

Given the great potential of large-scale expression analysis,and
biologists'desire to exploit this new technology,we anticipate a
deluge of data soon.The acpacity to ask questions and perform analyses
across hundreds,thousands,or tens of thousands of experiments should
dramatically enhance our ability to identify 'fingerprints'of gene
dramatically enhance our ability to identify 'fingerprints'of gene
expression that exemplify particular diseases or other biological
states.But first we will need to empirically define 'housekeeping'
genes,identify reproducible artifacts and detect subtle patterns
through
the application of powerful statistical analysis techniques.

This potential cannot be fully realized without efficient data
management and analysis sysems.ArrayDB provides a first-gener-ation,
convenient,flexible and extendable microarray data management and
analysis system.Planned future extensions to the ArrayDB include more
sophisticated links between the database and external data sources and
more powerful data mining capabilinies.Currently,querying multiple
databases such as NCBI's PubMed,GenBank,or dbEST databases can
assemble a great deal of valuable information,but it can be a tedious
and time-consuming process to repeatedly query each database for
information on even a small number of genes.However,by fully
exploiting the applications programming interfaces in the Entrez
system,sophis ticated 'executive summaries' can,in principle,be
generated.

Althoug these types of reports can be generated by the thoughtful
integration of external data resources,the larger probitself.In the
world outside of biological databases,the term'data mining' has been
information on even a small number of genes.However,by fully
applied to this type of knowledge discovery28,Because of the
complexity of the data,data mining tools are essential to fully
exploit the power of microarray expression analysis.Data mining tools,
similar to mathematical techniques that identify patterns in complex
data sets,will enable identification of multiple expression profiles
in complex  biological processes.This will provide a means to identify
genes that share an expression profile,genes that are expressed in
succession,or genes showing opposing expression profiles.For instance,
cluster analysis29 of a time course experiment can identify different
expression profiles exhibited by groups of genes.We are currently
developing a data mining tool for the ArrayDB system to help address
this need.
--
※ 来源:·大话西游站 dhxy.dhs.org·[FROM: 大话西游站] --
※ 转载:·大话西游站 dhxy.dhs.org·[FROM: 大话西游站]

--
※ 来源:·北大未名站 bbs.pku.edu.cn·[FROM: 159.226.61.251]



W 3 C h i n a ( since 2003 ) 旗 下 站 点
苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
156.250ms