Gitlab is now using https://gitlab.lcsb.uni.lu as it's primary address. Please update your bookmarks. FAQ.

Commit bd11feca authored by Sascha Herzinger's avatar Sascha Herzinger
Browse files

Merge branch 'beta'

parents b64b5cf8 45988b82
variables:
PYPI_USER: SECURE
PYPI_PASS: SECURE
DOCKER_USER: SECURE
DOCKER_PASS: SECURE
PYPI_USER: secure
PYPI_PASS: secure
DOCKER_USER: secure
DOCKER_PASS: secure
DOCKER_DRIVER: overlay2
before_script:
......@@ -62,7 +62,7 @@ test:
> /config.py
&& export FRACTALIS_CONFIG=/config.py
&& celery worker -D -A fractalis:celery -l debug --concurrency=1
&& pytest --color=yes --verbose --capture=no --cov=/usr/lib/python3.4/site-packages/fractalis tests/
&& pytest --color=yes --verbose --capture=no --cov=/usr/lib/python3.6/site-packages/fractalis tests/
"
dependencies:
- build:image
......
......@@ -9,7 +9,7 @@ Please have a look at this playlist to see a demo of the visual aspects of Fract
### Installation (Docker)
The easiest and most convenient way to deploy Fractalis is using Docker.
All necessary information can be found in the docker folder.
All necessary information can be found [here](docker).
### Installation (Manual)
If you do not want to use docker or want a higher level of control of the several components, that's fine. In fact it isn't difficult to setup Fractalis manually:
......@@ -21,18 +21,18 @@ If you do not want to use docker or want a higher level of control of the severa
- Run and expose the Fractalis web service with whatever tools you want. We recommend **gunicorn** and **nginx**, but others should work, too.
- Run the celery workers on any machine that you want within the same network. (For a simple setup this can be the very same machine that the web service runs on).
Note: The [docker-compose.yml](https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/docker/docker-compose.yml) describes how the different services are started and how they connect with each other.
Note: The [docker-compose.yml](docker/docker-compose.yml) describes how the different services are started and how they connect with each other.
### Configuration
Use the environment variable `FRACTALIS_CONFIG` to define the configuration file path.
This variable must be a) a valid python file (.py) and b) be available on all instances that host a Fractalis web service or a Fractalis worker.
Tip: Use the [default settings](https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/fractalis/config.py) as an example for your own configuration file.
Tip: Use the [default settings](fractalis/config.py) as an example for your own configuration file.
Please note, that all this files combines [Flask settings](http://flask.pocoo.org/docs/0.12/config/), [Celery settings](http://docs.celeryproject.org/en/latest/userguide/configuration.html), and Fractalis settings, which are all listed and documented within this file.
Please don't overwrite default settings if you don't know what you are doing. This might have severe implications for security or might cause Fractalis to not work correctly.
### Add support for new services
Support for other services is exclusively implemented within [this folder](https://git-r3lab.uni.lu/Fractalis/fractalis/tree/master/fractalis/data/etls). We recommend looking at the *ada* example implementation. Just make sure you use the class inheritance (ETL, ETLHandler) correctly, and you will get readable error messages if something goes wrong.
Please refer to [this document](fractalis/data).
### Citation
Manuscript is in preparation.
......@@ -7,12 +7,11 @@ RUN mkdir /app/ \
&& printf "[lcsb-base]\nname=LCSB Base Repo\nbaseurl=http://lcsb-cent-mirr-server.uni.lu/CentOS/7/os/x86_64/\nenabled=1" > /etc/yum.repos.d/lcsb-base.repo \
&& rm -rf /var/cache/yum \
&& yum clean all \
&& yum install --nogpg -y python34 python34-pip python34-devel readline-devel R wget \
&& wget https://bioconductor.org/packages/release/bioc/src/contrib/limma_3.36.1.tar.gz \
&& R CMD INSTALL limma_*.tar.gz \
&& rm limma_*.tar.gz
&& yum install --nogpg -y https://centos7.iuscommunity.org/ius-release.rpm \
&& yum install --nogpg -y python36u python36u-pip python36u-devel readline-devel libcurl-devel libxml2-devel R wget \
&& R -e 'source("https://bioconductor.org/biocLite.R"); biocLite(); biocLite(c("limma", "DESeq2"))'
COPY tests/ /app/tests/
WORKDIR /app/
ARG SDIST
COPY $SDIST /app/
RUN pip3 install -i https://pypi.lcsb.uni.lu/simple fractalis-*.tar.gz gunicorn
\ No newline at end of file
RUN pip3.6 install -i https://pypi.lcsb.uni.lu/simple fractalis-*.tar.gz gunicorn
......@@ -2,16 +2,44 @@
This folder contains all files necessary to setup the Fractalis service in a production environment.
### Usage
`docker-compose up` That's all! This will expose the service on port 443 and 80 by default.
This behavior can be changed by setting the environment variables `FRACTALIS_HTTP_PORT` and `FRACTALIS_HTTPS_PORT`.
For more detailed information please look into the files. They are rather self-explanatory and good place to do own modifications.
We assume that Docker and Docker-Compose are already installed on the system and
are up-to-date. It is possible to check this by opening a terminal and running
the following commands:
```
> docker --version
docker version 18.03.0-ce, build 0520e24
> docker-compose --version
docker-compose version 1.20.1, build 5d8c71b
```
If these commands fail or if the versions are much older than the one displayed
above please consult https://docs.docker.com/install/ and
https://docs.docker.com/compose/install/
If docker is properly installed on your system please run the following commands:
```
> git clone https://git-r3lab.uni.lu/Fractalis/fractalis.git
> cd fractalis/docker
> docker-compose up
```
The last command might require root access to connect with the Docker engine.
Depending on your network connection, this step will take a few minutes. Once
all the services are up and running you can open Chrome, Firefox, or Safari and
navigate to `http://localhost` or, if you use docker-machine, to http://`docker-machine ip`.
If you see the
Fractalis welcome screen, your system just became a Fractalis node that can be used
for statistical computation, as long as Docker is running. If this fails for you,
make sure docker is properly installed and port 80 and 443 are not used by other
services on your system. If they are either stop the services or use the environment
variables `FRACTALIS_HTTP_PORT` and `FRACTALIS_HTTPS_PORT` to change the ports
used by Fractalis.
### Configuration (Fractalis / Celery / Flask)
1. Modify [docker/fractalis/config.py](https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/docker/config/fractalis/config.py) before `running docker-compose up`.
1. Modify [docker/fractalis/config.py](config/fractalis/config.py) before `running docker-compose up`.
2. Replace the [dummy certificates](https://git-r3lab.uni.lu/Fractalis/fractalis/tree/master/docker/config/nginx/certs) with your own. The included one are only for development purposes.
2. Replace the [dummy certificates](config/nginx/certs) with your own. The included one are only for development purposes.
Tip: Use the [default settings](https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/fractalis/config.py) as an example for your own configuration file.
Tip: Use the [default settings](../fractalis/config.py) as an example for your own configuration file.
Please note, that all this files combines [Flask settings](http://flask.pocoo.org/docs/0.12/config/), [Celery settings](http://docs.celeryproject.org/en/latest/userguide/configuration.html), and Fractalis settings, which are all listed and documented within this file.
Please don't overwrite default settings if you don't know what you are doing. This might have severe implications for security or might cause Fractalis to not work correctly.
......
......@@ -68,12 +68,13 @@
<br/>
<br/>
<div class="links">
<div><a href="demo.html"><i class="material-icons">apps</i></a><span>Demo</span></div>
<div><a href="https://www.youtube.com/playlist?list=PLNvp9GB9uBmH1NNAf-qTyj_jN2aCPISFU"><i class="material-icons">video_library</i></a><span>Videos</span></div>
<div><a href="demo.html"><i class="material-icons">apps</i></a><span>Hands-on</span></div>
<div><a href="https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/README.md"><i class="material-icons">library_books</i></a><span>Fractalis</span></div>
<div><a href="https://git-r3lab.uni.lu/Fractalis/fractal.js/blob/master/README.md"><i class="material-icons">library_books</i></a><span>Fractal.js</span></div>
<div><a href="https://git-r3lab.uni.lu/Fractalis"><i class="material-icons">code</i></a><span>Repository</span></div>
<div><a href="https://www.apache.org/licenses/LICENSE-2.0"><i class="material-icons">book</i></a><span>Apache 2.0</span></div>
<div><a href="mailto:sascha.herzinger@uni.lu"><i class="material-icons">mail</i></a><span>Contact</span></div>
</div>
<footer>
<a href="http://lcsb.uni.lu"><img class="footer-logo" src="LCSB_UL_Logo.png"/></a>
......
......@@ -8,8 +8,7 @@ import numpy as np
import scipy.stats
from fractalis.analytics.task import AnalyticTask
from fractalis.analytics.tasks.shared.utils import \
apply_subsets, apply_categories
from fractalis.analytics.tasks.shared import utils
T = TypeVar('T')
......@@ -25,6 +24,7 @@ class BoxplotTask(AnalyticTask):
features: List[pd.DataFrame],
categories: List[pd.DataFrame],
id_filter: List[T],
transformation: str,
subsets: List[List[T]]) -> dict:
""" Compute boxplot statistics for the given parameters.
:param features: List of numerical features
......@@ -32,6 +32,7 @@ class BoxplotTask(AnalyticTask):
features.
:param id_filter: List of ids that will be considered for analysis. If
empty all ids will be used.
:param transformation: Transformation that will be applied to the data.
:param subsets: List of subsets used as another way to group the
numerical features.
"""
......@@ -40,10 +41,12 @@ class BoxplotTask(AnalyticTask):
"non empty numerical feature.")
# merge dfs into single one
df = reduce(lambda l, r: l.append(r), features)
df = utils.apply_transformation(df=df, transformation=transformation)
df.dropna(inplace=True)
if id_filter:
df = df[df['id'].isin(id_filter)]
df = apply_subsets(df=df, subsets=subsets)
df = apply_categories(df=df, categories=categories)
df = utils.apply_subsets(df=df, subsets=subsets)
df = utils.apply_categories(df=df, categories=categories)
df['outlier'] = None
results = {
'statistics': {},
......@@ -60,6 +63,7 @@ class BoxplotTask(AnalyticTask):
(df['feature'] == feature)]['value'].tolist()
if len(values) < 2:
continue
# FIXME: v This is ugly. Look at kaplan_meier_survival.py
label = '{}//{}//s{}'.format(feature, category, subset + 1)
group_values.append(values)
stats = self.boxplot_statistics(values)
......
"""Module containing the Celery Task for the Correlation Analysis."""
import logging
from typing import List, TypeVar
from typing import List
import pandas as pd
import numpy as np
from scipy import stats
from fractalis.analytics.task import AnalyticTask
from fractalis.analytics.tasks.shared.utils import \
apply_subsets, apply_categories
from fractalis.analytics.tasks.shared import utils
logger = logging.getLogger(__name__)
T = TypeVar('T')
class CorrelationTask(AnalyticTask):
......@@ -25,9 +23,9 @@ class CorrelationTask(AnalyticTask):
def main(self,
x: pd.DataFrame,
y: pd.DataFrame,
id_filter: List[T],
id_filter: List[str],
method: str,
subsets: List[List[T]],
subsets: List[List[str]],
categories: List[pd.DataFrame]) -> dict:
"""Compute correlation statistics for the given parameters.
:param x: DataFrame containing x axis values.
......@@ -48,11 +46,12 @@ class CorrelationTask(AnalyticTask):
raise ValueError("Unknown method '{}'".format(method))
df = self.merge_x_y(x, y)
(x_label, y_label) = (df['feature_x'][0], df['feature_y'][0])
x_label = list(df['feature_x'])[0]
y_label = list(df['feature_y'])[0]
if id_filter:
df = df[df['id'].isin(id_filter)]
df = apply_subsets(df=df, subsets=subsets)
df = apply_categories(df=df, categories=categories)
df = utils.apply_subsets(df=df, subsets=subsets)
df = utils.apply_categories(df=df, categories=categories)
global_stats = self.compute_stats(df, method)
output = global_stats
output['method'] = method
......
......@@ -7,8 +7,7 @@ import logging
import pandas as pd
from fractalis.analytics.task import AnalyticTask
from fractalis.analytics.tasks.heatmap.stats import StatisticTask
from fractalis.analytics.tasks.shared import utils
from fractalis.analytics.tasks.shared import utils, array_stats
T = TypeVar('T')
......@@ -20,12 +19,12 @@ class HeatmapTask(AnalyticTask):
submittable celery task."""
name = 'compute-heatmap'
stat_task = StatisticTask()
def main(self, numerical_arrays: List[pd.DataFrame],
numericals: List[pd.DataFrame],
categoricals: List[pd.DataFrame],
ranking_method: str,
params: dict,
id_filter: List[T],
max_rows: int,
subsets: List[List[T]]) -> dict:
......@@ -58,9 +57,13 @@ class HeatmapTask(AnalyticTask):
for i in range(df.shape[0])]
z_df = pd.DataFrame(z_df, columns=df.columns, index=df.index)
method = 'limma'
if ranking_method in ['mean', 'median', 'variance']:
method = ranking_method
# compute statistic for ranking
stats = self.stat_task.main(df=df, subsets=subsets,
ranking_method=ranking_method)
stats = array_stats.get_stats(df=df, subsets=subsets,
params=params,
ranking_method=method)
# sort by ranking_value
self.sort(df, stats[ranking_method], ranking_method)
......
"""This module provides row-wise statistics for the heat map."""
from copy import deepcopy
from typing import List, TypeVar
import logging
import pandas as pd
from rpy2 import robjects as R
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
from fractalis.analytics.task import AnalyticTask
T = TypeVar('T')
importr('limma')
pandas2ri.activate()
logger = logging.getLogger(__name__)
class StatisticTask(AnalyticTask):
name = 'expression-stats-task'
def main(self, df: pd.DataFrame, subsets: List[List[T]],
ranking_method: str) -> pd.DataFrame:
if ranking_method == 'mean':
stats = self.get_mean_stats(df)
elif ranking_method == 'median':
stats = self.get_median_stats(df)
elif ranking_method == 'variance':
stats = self.get_variance_stats(df)
else:
stats = self.get_limma_stats(df, subsets)
return stats
@staticmethod
def get_mean_stats(df: pd.DataFrame) -> pd.DataFrame:
means = [row.mean() for row in df.values]
stats = pd.DataFrame(means, columns=['mean'])
stats['feature'] = df.index
return stats
@staticmethod
def get_median_stats(df: pd.DataFrame) -> pd.DataFrame:
means = [row.median() for row in df.values]
stats = pd.DataFrame(means, columns=['median'])
stats['feature'] = df.index
return stats
@staticmethod
def get_variance_stats(df: pd.DataFrame) -> pd.DataFrame:
means = [row.var() for row in df.values]
stats = pd.DataFrame(means, columns=['var'])
stats['feature'] = df.index
return stats
@staticmethod
def get_limma_stats(df: pd.DataFrame,
subsets: List[List[T]]) -> pd.DataFrame:
"""Use the R bioconductor package 'limma' to perform a differential
gene expression analysis on the given data frame.
:param df: Matrix of measurements where each column represents a sample
and each row a gene/probe.
:param subsets: Groups to compare with each other.
:return: Results of limma analysis. More than 2 subsets will result in
a different structured result data frame. See ?topTableF in R.
"""
# prepare the df in case an id exists in more than one subset
if len(subsets) < 2:
error = "Limma analysis requires at least " \
"two non-empty groups for comparison."
logger.error(error)
raise ValueError(error)
if df.shape[0] < 1 or df.shape[1] < 2:
error = "Limma analysis requires a " \
"data frame with dimension 1x2 or more."
logger.error(error)
raise ValueError(error)
flattened_subsets = [x for subset in subsets for x in subset]
df = df[flattened_subsets]
ids = list(df)
features = df.index
# creating the design vector according to the subsets
design_vector = [''] * len(ids)
subsets_copy = deepcopy(subsets)
for i, id in enumerate(ids):
for j, subset in enumerate(subsets_copy):
try:
subset.index(id) # raises an Exception if not found
subset.remove(id)
design_vector[i] = str(j + 1)
break
except ValueError:
assert j != len(subsets_copy) - 1
assert '' not in design_vector
# create group names
groups = ['group{}'.format(i + 1) for i in list(range(len(subsets)))]
# create a string for each pairwise comparison of the groups
comparisons = []
for i in reversed(range(len(subsets))):
for j in range(i):
comparisons.append('group{}-group{}'.format(i+1, j+1))
# fitting according to limma doc Chapter 8: Linear Models Overview
r_form = R.Formula('~ 0+factor(c({}))'.format(','.join(design_vector)))
r_design = r['model.matrix'](r_form)
r_design.colnames = R.StrVector(groups)
r_data = pandas2ri.py2ri(df)
# the next two lines are necessary if column ids are not unique,
# because the python to r transformation drops those columns otherwise
r_ids = R.StrVector(['X{}'.format(id) for id in ids])
r_data = r_data.rx(r_ids)
r_fit = r['lmFit'](r_data, r_design)
r_contrast_matrix = r['makeContrasts'](*comparisons, levels=r_design)
r_fit_2 = r['contrasts.fit'](r_fit, r_contrast_matrix)
r_fit_2 = r['eBayes'](r_fit_2)
r_results = r['topTable'](r_fit_2, number=float('inf'),
sort='none', genelist=features)
results = pandas2ri.ri2py(r_results)
# let's give the gene list column an appropriate name
colnames = results.columns.values
colnames[0] = 'feature'
results.columns = colnames
return results
"""This module provides statistics for mRNA and miRNA data."""
from copy import deepcopy
from typing import List, TypeVar
from collections import OrderedDict
import logging
import pandas as pd
import numpy as np
from rpy2 import robjects as robj
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
T = TypeVar('T')
importr('limma')
importr('DESeq2')
pandas2ri.activate()
logger = logging.getLogger(__name__)
def get_stats(df: pd.DataFrame, subsets: List[List[T]],
params: dict, ranking_method: str) -> pd.DataFrame:
if ranking_method == 'mean':
stats = get_mean_stats(df)
elif ranking_method == 'median':
stats = get_median_stats(df)
elif ranking_method == 'variance':
stats = get_variance_stats(df)
elif ranking_method == 'limma':
stats = get_limma_stats(df, subsets)
elif ranking_method == 'DESeq2':
stats = get_deseq2_stats(df, subsets, **params)
else:
error = "Static method unknown: {}".format(ranking_method)
logger.exception(error)
raise NotImplementedError(error)
return stats
def get_mean_stats(df: pd.DataFrame) -> pd.DataFrame:
means = np.mean(df, axis=1)
stats = pd.DataFrame(means, columns=['mean'])
stats['feature'] = df.index
return stats
def get_median_stats(df: pd.DataFrame) -> pd.DataFrame:
medians = np.median(df, axis=1)
stats = pd.DataFrame(medians, columns=['median'])
stats['feature'] = df.index
return stats
def get_variance_stats(df: pd.DataFrame) -> pd.DataFrame:
variances = np.var(df, axis=1)
stats = pd.DataFrame(variances, columns=['variance'])
stats['feature'] = df.index
return stats
def get_limma_stats(df: pd.DataFrame, subsets: List[List[T]]) -> pd.DataFrame:
"""Use the R bioconductor package 'limma' to perform a differential
gene expression analysis on the given data frame.
:param df: Matrix of measurements where each column represents a sample
and each row a gene/probe.
:param subsets: Groups to compare with each other.
:return: Results of limma analysis. More than 2 subsets will result in
a different structured result data frame. See ?topTableF in R.
"""
logger.debug("Computing limma stats")
# prepare the df in case an id exists in more than one subset
if len(subsets) < 2:
error = "Limma analysis requires at least " \
"two non-empty groups for comparison."
logger.error(error)
raise ValueError(error)
if df.shape[0] < 1 or df.shape[1] < 2:
error = "Limma analysis requires a " \
"data frame with dimension 1x2 or more."
logger.error(error)
raise ValueError(error)
flattened_subsets = [x for subset in subsets for x in subset]
df = df[flattened_subsets]
ids = list(df)
features = df.index
# creating the design vector according to the subsets
design_vector = [''] * len(ids)
subsets_copy = deepcopy(subsets)
for i, sample_id in enumerate(ids):
for j, subset in enumerate(subsets_copy):
try:
subset.index(sample_id) # raises an Exception if not found
subset.remove(sample_id)
design_vector[i] = str(j + 1)
break
except ValueError:
assert j != len(subsets_copy) - 1
assert '' not in design_vector
# create group names
groups = ['group{}'.format(i + 1) for i in list(range(len(subsets)))]
# create a string for each pairwise comparison of the groups
comparisons = []
for i in reversed(range(len(subsets))):
for j in range(i):
comparisons.append('group{}-group{}'.format(i+1, j+1))
# fitting according to limma doc Chapter 8: Linear Models Overview
r_form = robj.Formula('~ 0+factor(c({}))'.format(','.join(design_vector)))
r_design = r['model.matrix'](r_form)
r_design.colnames = robj.StrVector(groups)
r_data = pandas2ri.py2ri(df)
# py2ri is stupid and makes too many assumptions.
# These two lines restore the column order
r_data.colnames = list(OrderedDict.fromkeys(ids))
r_data = r_data.rx(robj.StrVector(ids))
r_fit = r['lmFit'](r_data, r_design)
r_contrast_matrix = r['makeContrasts'](*comparisons, levels=r_design)
r_fit_2 = r['contrasts.fit'](r_fit, r_contrast_matrix)
r_fit_2 = r['eBayes'](r_fit_2)
r_results = r['topTable'](r_fit_2, number=float('inf'),
sort='none', genelist=features)
results = pandas2ri.ri2py(r_results)
# let's give the gene list column an appropriate name
colnames = results.columns.values
colnames[0] = 'feature'
results.columns = colnames
return results
# FIXME: Add more parameters (e.g. for filtering low count rows)
def get_deseq2_stats(df: pd.DataFrame,
subsets: List[List[T]],
min_total_row_count: int = 0) -> pd.DataFrame:
"""Use the R bioconductor package 'limma' to perform a differential
expression analysis of count like data (e.g. miRNA). See package
documentation for more details.
:param df: Matrix of counts, where each column is a sample and each row
a feature.
:param subsets: The two subsets to compare with each other.
:param min_total_row_count: Drop rows that have in total less than than
min_total_row_count reads
:return: Results of the analysis in form of a Dataframe (p, logFC, ...)
"""
logger.debug("Computing deseq2 stats")
if len(subsets) != 2:
error = "This method currently only supports exactly two " \
"subsets as this is the most common use case. Support " \
"for more subsets will be added later."
logger.exception(error)
raise ValueError(error)
# flatten subset
flattened_subsets = [x for subset in subsets for x in subset]
# discard columns that are not in a subset
df = df[flattened_subsets]
# filter rows with too few reads
total_row_counts = df.sum(axis=1)
keep = total_row_counts[total_row_counts >= min_total_row_count].index
df = df.loc[keep]
# pandas df -> R df
r_count_data = pandas2ri.py2ri(df)
# py2ri is stupid and makes too many assumptions.
# These two lines restore the column order
r_count_data.colnames = list(OrderedDict.fromkeys(flattened_subsets))
r_count_data = r_count_data.rx(robj.StrVector(flattened_subsets))