<p>This function obtains the Geuvadis SNP data. It downloads missing genotype data from ArrayExpress, transforms variant call format to binary files, removes SNPs with a low minor allele frequency, labels SNPs in the format “chromosome:position”, and changes sample identifiers.</p>
<p>This function obtains the BBMRI SNP data. It limits the analysis to specified biobanks, reads in genotype data in chunks, removes SNPs with missing values (multiple biobanks/technologies), removes SNPs with a low minor allele frequency, and fuses data from multiple biobanks/technologies.</p>
<p>This function obtains the Geuvadis exon data. It retains exons on the autosomes, labels exons in the format “chromosome_start_end”, and extracts the corresponding gene names.</p>
<p>This function obtains the BBMRI exon data. It loads quality controlled gene expression data, extracts sample identifiers, removes samples without SNP data, loads exon expression data, extracts sample identifiers, retains samples that passed quality control, and retains exons on the autosomes.</p>
<p>This function removes duplicate samples from each matrix, only retains samples appearing in all matrices, and brings the samples into the same order.</p>
<p>The <spanclass="math inline">\(n \times q\)</span> matrix <spanclass="math inline">\(\boldsymbol{Y}\)</span> represents the exons, and the <spanclass="math inline">\(n \times p_{chr}\)</span> matrices <spanclass="math inline">\(\boldsymbol{X}_{chr}\)</span> represents the SNPs, where <spanclass="math inline">\(chr \in \{1,\ldots,22\}\)</span>. The row names contain the sample identifiers, and the column names indicate the genomic location of the variables.</p>
<p>This function adjusts RNA-seq expression data for different library sizes. The <spanclass="math inline">\(n \times q\)</span> matrix <spanclass="math inline">\(\boldsymbol{Y}\)</span> contains the exon data. The library size are <spanclass="math inline">\(\boldsymbol{s}=(s_1,\ldots,s_n)^T\)</span>, where <spanclass="math inline">\(s_i=\sum_{j=1}^p Y_{ij}\)</span> for all <spanclass="math inline">\(i\)</span>. The mean library size is <spanclass="math inline">\(\bar{s}=\sum_{i=1}^n s_i / n\)</span>. We use edgeR to compute the normalisation factors <spanclass="math inline">\(\boldsymbol{\eta}=(\eta_1,\ldots,\eta_n)^T\)</span>. We then calculate the adjusted normalisation factors <spanclass="math inline">\(\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_n)^T\)</span>, where <spanclass="math inline">\(\gamma_i=\eta_i*s_i / \bar{s}\)</span> for all <spanclass="math inline">\(i\)</span>. The adjusted value equals <spanclass="math inline">\(Y_{ij}/\gamma_i\)</span> for all samples <spanclass="math inline">\(i\)</span> and all covariates <spanclass="math inline">\(j\)</span>.</p>
<p>This function adjusts exon expression data for different exon lengths. We do this separately for each chromosome to decrease memory usage. For this adjustment, we temporarily transform matrices to vectors. An <spanclass="math inline">\(n \times p\)</span> matrix becomes a vector of length <spanclass="math inline">\(n \times p\)</span>, with the first <spanclass="math inline">\(p\)</span> entries corresponding to covariate <spanclass="math inline">\(1\)</span> and samples <spanclass="math inline">\(1\)</span> to <spanclass="math inline">\(n\)</span>, and the last <spanclass="math inline">\(p\)</span> entries corresponding to covariate <spanclass="math inline">\(p\)</span> and samples <spanclass="math inline">\(1\)</span> to <spanclass="math inline">\(n\)</span>. Let the vector <spanclass="math inline">\(\boldsymbol{y}=(Y_{11},\ldots,Y_{n1} \boldsymbol{,} \ldots \boldsymbol{,} Y_{1q},\ldots,Y_{nq})^T\)</span> represent exon expression. Let <spanclass="math inline">\(\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_1 \boldsymbol{,} \ldots \boldsymbol{,} \gamma_q \ldots \gamma_q)^T\)</span> represent exon lengths. And let <spanclass="math inline">\(\boldsymbol{k}=(k_1,\ldots,k_1 \boldsymbol{,} \ldots \boldsymbol{,} k_q,\ldots,k_q)^T\)</span> represent gene names. So, <spanclass="math inline">\(\boldsymbol{\gamma}\)</span> and <spanclass="math inline">\(\boldsymbol{k}\)</span> contain <spanclass="math inline">\(q\)</span> blocks of <spanclass="math inline">\(n\)</span> equal entries. We regress <spanclass="math inline">\(\boldsymbol{y}\)</span> (exon expression) on a fixed effect for <spanclass="math inline">\(\gamma\)</span> (exon length) and a random effet for <spanclass="math inline">\(\boldsymbol{k}\)</span> (gene name). The residuals from this mixed model become our adjusted exon data.</p>
<p>These functions select the variables for the spliceQTL test. First, we retrieve all protein-coding genes, excluding pseudogenes and other transcripts. Second, we attribute exons to genes, including exons within the gene. Third, we attribute SNPs to genes, including SNPs between (1) <spanclass="math inline">\(10\,000\)</span> base pairs before the start position of the gene, and (2) the end position of the gene. Although this might not occur in practice, exons or SNPs may be attributed to more than one gene. Finally, we exclude genes without any SNPs or with a single exon. It does not make sense to test whether these genes show alternative splicing.</p>
<p>We want to test for alternative splicing along the whole genome. We do not calculate <spanclass="math inline">\(p\)</span>-values from an asymptotic distribution, but estimate them by permutation. If we tested a single gene, we could use a large number of permutations and obtain a precise estimate. We need at least <spanclass="math inline">\(21\)</span> permutations (including the identity) to reach the <spanclass="math inline">\(5\%\)</span> significance level. If one or two test statistics for the permuted data are larger than the one for the observed data, the estimated <spanclass="math inline">\(p\)</span>-value equals <spanclass="math inline">\(0.0476\)</span> (<spanclass="math inline">\(<0.05\)</span>) or <spanclass="math inline">\(0.0952\)</span> (<spanclass="math inline">\(>0.05\)</span>), respectively. If we test multiple genes, we will need more permutations to reach Bonferroni-significance. Using a fixed number of permutations would be too computationally expensive. This is why we invest less in genes with large <spanclass="math inline">\(p\)</span>-values and more in genes with small <spanclass="math inline">\(p\)</span>-values. For each gene, we use between <spanclass="math inline">\(100\)</span> and <spanclass="math inline">\(p/0.05+1\)</span> permutations, where <spanclass="math inline">\(p\)</span> is the number of genes. From <spanclass="math inline">\(100\)</span> permutations onwards, we repeatedly check whether two or more test statistics for the permuted data are larger than the one for the observed data. If yes, we interrupt permutation for this gene. If one or two test statistics for the permuted data are larger than the one for the observed data, the Bonferroni-adjusted estimated <spanclass="math inline">\(p\)</span>-value equals <spanclass="math inline">\(0.05*p/(p+0.05)\)</span> (<spanclass="math inline">\(<0.05\)</span>) or <spanclass="math inline">\(0.1*p/(p+0.05)\)</span> (<spanclass="math inline">\(>0.05\)</span>), respectively. These values converge to <spanclass="math inline">\(0.05\)</span> and <spanclass="math inline">\(0.1\)</span> when <spanclass="math inline">\(p\)</span> tends to infinity. Bonferroni-significance requires between <spanclass="math inline">\(8\,000\)</span> and <spanclass="math inline">\(60\,000\)</span> permutations on the chromosome level, depending on the number of genes, and about <spanclass="math inline">\(400\,000\)</span> permutations on the genome level. We therefore adjust for multiple testing for each chromosome, and not for the whole genome.</p>
<aclass="sourceLine"id="cb1-24"data-line-number="24"> y <-<spanclass="st"></span><spanclass="kw">as.numeric</span>(wheat.Y[nsel,rep]) <spanclass="co"># try different phenotypes</span></a>
<aclass="sourceLine"id="cb1-25"data-line-number="25"> X <-<spanclass="st"></span>wheat.X[nsel,psel]</a>
<aclass="sourceLine"id="cb1-29"data-line-number="29"> loss <-<spanclass="st"></span><spanclass="kw"><ahref="../reference/colasso_compare.html">colasso_compare</a></span>(<spanclass="dt">y=</span>y,<spanclass="dt">X=</span>X)</a>
<aclass="sourceLine"id="cb2-14"data-line-number="14"> y <-<spanclass="st"></span>y[cond]</a>
<aclass="sourceLine"id="cb2-15"data-line-number="15"> X <-<spanclass="st"></span>X[cond,]</a>
<aclass="sourceLine"id="cb2-16"data-line-number="16"> loss <-<spanclass="st"></span><spanclass="kw">rbind</span>(loss,<spanclass="kw"><ahref="../reference/colasso.html">colasso</a></span>(<spanclass="dt">y=</span>y,<spanclass="dt">X=</span>X))</a>