Extracting a low-dimensional description of multiple gene expression datasets reveals a potential driver for tumor-associated stroma in ovarian cancer

Table 1 Methods we compared with the INSPIRE framework; To our knowledge, there are no published methods for learning modules and their dependencies that can handle variable discrepancy. We adapted the following five state-of-the-art methods that can run on a single dataset: GLasso - standard graphical lasso [54], UGL - unknown group L ₁ regularization [62], SLFA - the structured latent factor analysis [22], WGCNA - weighted gene co-expression network analysis [8], and MGL - module graphical lasso [11] (see “Methods” for details). We adapted the input datasets such that we can apply these methods to datasets with variable discrepancy (Additional file 2: Figure S1B): “---1”, learning a model from only Dataset1 that contains all genes; “Inter---”, learning a model from the data on the overlapping genes (blue-shaded region in Fig. 1) and assigning the rest of the genes to learned modules by using the k-nearest neighbor approach (i.e. based on the Euclidean distance between the gene’s expression and the expression of each of the modules); and “Imp---”, imputing missing values in Dataset2 and learning a model from the imputed data (see “Methods” for details on imputation) (Additional file 2: Figure S1B). These adaptations lead to 13 competitors: (1) GLasso1; (2) ImpGLasso; (3) UGL1; (4) ImpUGL; (5) WGCNA1; (6) InterWGCNA; (7) ImpWGCNA; (8) SLFA1; (9) InterSLFA; (10) ImpSLFA; (11) MGL1; (12) InterMGL; and (13) ImpMGL. In the experiments on synthetic data, we compared to all 13 methods, while in the experiments with two genome-wide ovarian cancer gene expression datasets which we will discuss in the subsequent sections, we only used the methods that are scalable (see Additional file 3: Figure S2) These methods are indicated by the purple-shaded region in the table. The “Inter---” method is not applicable to GLasso and UGL, because GLasso and UGL learn a network of genes, not modules, and it is not obvious how to connect the genes that are present only in Dataset1 to the learned network. We do not consider an adaptation that applies the methods to Dataset2 only (“---2”). This is because, other than the genes in the overlap, Dataset2 has no genes (in the synthetic data experiments) or a very small number of genes (in the experiments with genome-wide expression data), which makes “---2” that uses only the samples from Dataset2 unlikely to outperform “Inter---” that uses all samples

Method	Description	Different ways to deal with missing data			Scalability (see Additional file 3: Figure S2)
Method	Description	---1	Inter---	Imp---	Scalability (see Additional file 3: Figure S2)
GLasso	Standard graphical lasso [54]	GLasso1	X	ImpGLasso	No
UGL	Unknown group L ₁ regularization [62]	UGL1	X	ImpUGL	No
SLFA	Structured latent factor analysis [22]	SLFA1	InterSLFA	ImpSLFA	No
WGCNA	Weighted gene co-expression network analysis [8]	WGCNA1	InterWGCNA	ImpWGCNA	Yes
MGL	Module graphical lasso [11]	MGL1	InterMGL	ImpMGL	Yes

ISSN: 1756-994X