GAAS - Technical References

 

The GAAS package is composed of two software: Gene Array Assembler Software and Gene Array Analyzer Software.


The Assembler performs pre-processing of gene expression data transforming any input data structure in MS-Excel format into a built-in database-based data structure in MS-Access format.


The Analyzer uses a built-in database-based gene expression data structure to perform fast differential gene expression analyses across multiple replica experiments. It is structured in the following sections.

Management section
The management framework is based on the relational MasterDB system database accessed and administered through software tools integrated in GAAS. MasterDB is composed of several tables; the more relevant are described below. All tables can be accessed and managed through the MasterDB management window of the Gene Array Analyzer Software.

The MasterDB InputStructure table contains in each entry a template specifying the input data type analyzable with GAAS. In each template, the characteristics of the represented data type are described: acquisition technology (e.g. nylon filters, microarrays), the variables represented by the table columns of the input raw dataset file, and how spot background values have been coded in the input dataset (for each spot a value in a specific background column, or for a group of spots a value in a specifically coded background line). The structure (i.e. maximum number and distribution of spots) of the arrays used to generate the input data is defined in the MasterDB ArrayStructure table.
The pre-processing program Gene Array Assembler uses the templates stored in MasterDB to transform the input data structures to the GAAS data structure that is then stored into a database. When in MasterDB there are no templates fitting the input data, the system asks the user to define a new template specifying meaning of the columns in the input data structure. This can be done as here specified. A devoted graphical user interface in the "Input Structure" section of the Gene Array Analyzer Software MasterDB management window allows the user to manage input structure templates.

The InputStructure table is directly linked to the MasterDB DataInfo table, which contains information related to each dataset processed in GAAS (either by the Gene Array Assembler or Gene Array Analyzer). The first time a new dataset is entered in this table, a specific input structure, a label database file containing the information of the genes whose expression values the dataset contains (e.g. Accession number, Clone ID, gene description, clone type, localization on the array), a default analysis parameter set, the current GAAS processing status, a default normalization type, and a default output data structure are associated to it.

The MasterDB Labels and LabelStructure tables contain information concerning types and templates of label databases associated to input datasets. In particular, these tables have been designed to deal with two different label database types: databases in which all information about the clones spotted on the array are listed into a single table, and databases containing in several tables the information of all clones available in the clone plate library used to spot the specific array the label database refers to. In the last case, only the information of the specific genes spotted on the array are selected at run-time by using a list of the plates actually used to spot the array the label database refers to. Accordingly, GAAS was developed to discriminate automatically between the two label database types.
See here how creating a LabelStructure template. Label types and LabelStructure templates can be managed through a devoted graphical user interface in the "Labels" and "Label Structure" sections, respectively, of the Gene Array Analyzer Software MasterDB management window.

The MasterDB Parameters table contains the user analysis parameter sets (e.g. thresholds applied to determine background and spot quality, ranges used to threshold either noisy or saturated raw expression data, adopted normalization methods, selected regulation thresholds). The importance of this solution is that each user can store multiple different analysis parameter sets and retrieve them in subsequent analysis sessions.

The management framework also provides the user with the ability to select which spot values to use in the data normalization processing. The MasterDB FilterCloneType table contains the SQL queries to be used for filtering and selecting the spot values to be considered in the data normalization. Such filtering SQL queries can be defined as here described. Any SQL filter can be built in terms of several characteristics of the spotted clones (e.g. name (such as GAPDH), Accession number, Clone ID, words in clone description) considered singularly or cumulatively. To allow elaborated selections of normalization spot values, the elementary parts of articulated SQL filter queries are separately defined in single rows of the MasterDB FilterCloneType table (see here an example). Starting from their multi-row representation, a specifically developed algorithm reconstructs at run-time the articulated and nested SQL queries. By means of a dedicated interface in the "Clone Filter" section of the Gene Array Analyzer Software MasterDB management window, defined clone filters can be interactively managed.

The flexibility of the multi-user environment allows each user to store in the MasterDB ViewStructure table an updateable custom data visualization configuration.

The MasterDB OutputStructure table contains for each user specific data output result structures. Actually, analysis results can be formatted interactively by the user and then stored into external databases providing maximal flexibility for further statistical and gene clustering analyses.

The MasterDB Users table contains user identifications (i.e. name, password, registration date, time and date of last login). This table is used to manage the multi-user environment in which each user can store its own settings.


Analysis section
The analysis framework enables management and customization of all implemented data processing procedures subdivided in background, normalization and gene differential expression analysis steps.

Background analysis and quality evaluation
Input data contain raw values for each spot, measured by an image analyzer on an array image. These values correspond to the grey scale intensity of each spot in the array image, and the spot area on which the intensity is computed. Each spot is also characterized by an associated background intensity, generally computed in a surrounding area of the spot. The background intensity is a key measurement for the evaluation of the hybridized clone quality because it represents a signal threshold. According to the experimental condition and the technology utilized to produce the array, a single background value can refer either to a set of spots or to a single spot.
On nylon filters, generally clones are spotted in regular blocks named primaries, and a primary background intensity, computed as mean intensity per background unit area, is associated to each spot in the primary block. Conversely, in microarrays each spot is generally associated to a background intensity computed as mean intensity of the pixels in one or more areas surrounding the spot.
The pursued approach for background correction consists in the subtraction of the background intensity from the corresponding spot intensity value. However, each user can decide whether applying background correction (Exclude background correction) for the specific analyzed dataset. This procedure has been empowered with an a priori evaluation of the background intensity quality determined through statistical comparison to neighbor background intensities. Each background value is compared to a robust linear regression (M-estimator) computed on the homogeneous intensity values of neighbor backgrounds. Recursively, if according to an user-defined threshold (Background Std Threshold) a neighbor background value is significantly different from the mean of the neighbor background intensities values, it is excluded from the calculation of the mean neighbor background intensity. When the background value differs from the mean of its homogeneous intensity neighbor backgrounds more than an user-defined threshold (Background Ratio Reference), a background quality label is set to low quality.
A spot quality label has been introduced accounting for a difference among the spot intensities of the same clone spotted on the array (generally two or more spots per clone). If this difference is greater than an user-defined threshold (Spot Perc. Diff. Reference), the spot quality label of the clone spots is set to low quality. Mean background-subtracted intensity of no low quality spots of a clone in the array is calculated and defined as clone expression intensity.
A clone quality label is also computed as intersection of background and spot quality labels. Moreover, according to user-defined thresholds (see Gene expression bound window), if a clone expression intensity is too low, or too high, in comparison to the whole array clone intensity distribution or background mean intensity and standard deviation (noise affected or saturated clone intensity), the clone is not considered in the differential expression analysis.
All quality labels are mapped into graphical appearance to show to the user the specific cause of removal from the analysis of each excluded clone.

Data normalization
Because a number of sources of systematic variation can alter expression levels especially in across-array experiments (inter-array variability), evaluation of gene differential expression can be biased. Therefore, gene expression levels need to be normalized to smooth out systematic error effects. In GAAS a within-array normalization has been implemented based on two alternative procedures. The first adopts a global approach consisting in the computation of a normalization factor for each analyzed array by using all clone intensities in the array, though clones evaluated as low quality can be excluded. The second uses a subset of clone expression intensities (e.g. housekeeping genes or control spikes of heterologous genes) assumed not to vary significantly across the analyzed conditions. Either mean or median clone expression intensity values can be used as normalization factor for the array data (Normalization option window). In addition, for both procedures, the user can define an expression level window excluding low (noise affected) or high (saturated) clone intensity levels (Gene expression bound window). Windowing setting involves the selection of upper and lower bounds as a percentage of the distribution of array clone intensity values. Moreover, the lower bound can be computed also as a value proportional to the mean and standard deviation of the array background intensity levels.

Gene differential expression evaluation
Gene regulations (statistical significant expression differences) in a single experiment (test vs. control condition) are determined by detecting relevant changes of expression ratios with respect to automatically determined confidence intervals or a user defined folding threshold (Confidence level for regulation: Folding). Confidence intervals are automatically determined on log-ratio distribution according to user-defined regulation significance (Confidence level for regulation: Significance).
Gene differential expressions on multiple replica experiments are evaluated on the basis of single experiment gene regulations, according to a conditional probability model or a regulation reproducibility cut-off. In the first case, for a given gene, the regulation probability determined for each experiment is used to calculate the regulation conditional probability in the considered replica experiments. This last is compared to an user-defined replica regulation probability threshold (Regulation Probability Cut-Off) to define the differential expression of the given gene on the considered multiple replica experiments.
In case log-normal distribution of expression intensity ratios is not verified, the user defined folding threshold (Confidence level for regulation: Folding) and regulation reproducibility cut-off (Majority Reproducibility Cut-Off) provide acceptable results without any a priori assumption on differential expression distribution.

Parameter settings
All data processing algorithms of the analysis framework can be customized by setting appropriate parameter values in the Parameter setting window of the Gene Array Analyzer Software. In this window parameters are subdivided, according to the array analysis processing step they refer to, into single, pair and replica sections as follow:

Single:

Pair:

Replica:


Visualization section

The visualization framework enables navigating visually, both in tabular and graphical format, the produced data analysis results. The tabular format (Input data window, Background, Single, Pair, and Replica panels) presents input values, quality labels, expression levels, and regulation results along with several gene identifiers (e.g. GenBank accession number, clone ID, gene description). The graphical format visualizes expression level distributions in Histogram and Scatter Plot panels. Histogram plots allow easy comparisons of intensity distributions of multiple experiments in order to visualize systematic noise. Scatter plots of expression levels give insights into gene regulation. In the Scatter Plot panel, users can visualize information about each plotted gene by simply mouse clicking on its graphic representation. Moreover, by using the implemented clone search and navigation procedures the user can move interactively from input data to tabular results and plots (and vice versa) to better investigate single gene behavior.

 

 


© Marco Masseroli, PhD E-mail masseroli@elet.polimi.it - Last update on .