Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
Menu
Open sidebar
Fractalis
fractalis
Commits
421d2bc1
Commit
421d2bc1
authored
Jun 18, 2018
by
Sascha Herzinger
Browse files
Added ETL readme
parent
212b372d
Pipeline
#5435
failed with stages
in 37 minutes and 13 seconds
Changes
2
Pipelines
1
Show whitespace changes
Inline
Side-by-side
fractalis/data/etl.py
View file @
421d2bc1
...
...
@@ -6,6 +6,7 @@ import logging
import
os
from
Cryptodome.Cipher
import
AES
# noinspection PyProtectedMember
from
celery
import
Task
from
pandas
import
DataFrame
...
...
@@ -100,7 +101,7 @@ class ETL(Task, metaclass=abc.ABCMeta):
"""
pass
def
sanity
C
heck
(
self
):
def
sanity
_c
heck
(
self
):
"""Check whether ETL is still sane and should be continued. E.g. if
redis has been cleared it does not make sense to proceed. Raise an
exception if not sane."""
...
...
@@ -170,14 +171,14 @@ class ETL(Task, metaclass=abc.ABCMeta):
logger
.
info
(
"Starting ETL process ..."
)
logger
.
info
(
"(E)xtracting data from server '{}'."
.
format
(
server
))
try
:
self
.
sanity
C
heck
()
self
.
sanity
_c
heck
()
raw_data
=
self
.
extract
(
server
,
token
,
descriptor
)
except
Exception
as
e
:
logger
.
exception
(
e
)
raise
RuntimeError
(
"Data extraction failed. {}"
.
format
(
e
))
logger
.
info
(
"(T)ransforming data to Fractalis format."
)
try
:
self
.
sanity
C
heck
()
self
.
sanity
_c
heck
()
data_frame
=
self
.
transform
(
raw_data
,
descriptor
)
checker
=
IntegrityCheck
.
factory
(
self
.
produces
)
checker
.
check
(
data_frame
)
...
...
@@ -190,7 +191,7 @@ class ETL(Task, metaclass=abc.ABCMeta):
logging
.
error
(
error
,
exc_info
=
1
)
raise
TypeError
(
error
)
try
:
self
.
sanity
C
heck
()
self
.
sanity
_c
heck
()
if
encrypt
:
self
.
secure_load
(
data_frame
,
file_path
)
else
:
...
...
fractalis/data/etls/README.md
0 → 100644
View file @
421d2bc1
### About
This page contains instructions on how data are loaded into Fractalis.
### General
First, it is important to understand that Fractalis, unlike other analytical
platforms, does not have a persistent database in the traditional sense.
Data are "imported" on-demand into the analysis cache. Whether that happens via
REST API, some sort of data stream, or file import is entirely up to the
MicroETL.
### MicroETLs
MicroETLs in Fractalis are submittable jobs that are responsible for the data
(E)xtraction from the target service, the (T)ransformation into an internal
standard format, and the (L)oading into the analysis cache. MicroETLs can be
very simple or very complex. It highly depends on how easy it is, to extract
data into a workable format, but generally it should only take a few hours to
have some basic implementation. The
[
Ada Integer ETL
](
https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/fractalis/data/etls/ada/etl_integer.py
)
is a good example for a simple MicroETL.
### Implementation
There are very few restrictions on how a MicroETL should look like. It is
entirely up to you how to decide how to extract data from the service you want
to support. If your service offers a REST API, we recommend using the Python
requests module. Nothing stops you from directly accessing the database or some
files, though. Inspiration can be found
[
here
](
https://git-r3lab.uni.lu/Fractalis/fractalis/tree/master/fractalis/data/etls
)
.
The only real requirement is, that your MicroETL must inherit the
[
ETL Class
](
https://git-r3lab.uni.lu/Fractalis/fractalis/blob/master/fractalis/data/etl.py
)
.
This class is responsible for making your MicroETL a submittable celery job and
that your MicroETL produces the correct internal format, among other things.
You don't have to understand the ETL class in order to inherit from it. It is
designed in a way that you should always get a readable error if you do something
wrong. It won't hurt to have a look, though.
### Variables
They are all over the ETL code:
`descriptor`
(dict),
`handler`
(str),
`server`
(str),
`auth`
(dict).
It is up to the front-end to decide what they contain. They are used like this:
-
Fractalis decides which MicroETL group/handler to use based on the
`handler`
(e.g.
`ada`
)
-
The MicroETLs in that group decide whether they can handle the request based
on the information in
`descriptor`
(e.g.
`{'data-type': 'image', ...}`
)
-
The data are extracted from the
`server`
(e.g.
`https://localhost`
)
-
The ETL authenticates itself using authentication
`auth`
(e.g.
`{'token': 123345}`
)
-
The ETL decides what to download from the server based on the
`descriptor`
(e.g.
`{..., 'field': 'Age'}`
)
### Internal Formats
Fractalis technically supports all formats. Yes, all. On a very basic level, Fractalis
is a distributed job framework with MicroETLs that executed Python/R scripts on
extracted data. Nothing stops you from loading brain image data, genomic data, or
financial data into Fractalis and code a visualisation for it. It doesn't mean you
should do that, though. There is two factors that should be taken into account:
1.
**The data size.**
It wouldn't be a good idea to move 50GB of genomic data into
Fractalis on a regular base, albeit not impossible.
Instead you might want to consider connecting analyses or ETLs with other systems
like
[
Hail
](
https://github.com/hail-is/hail
)
to merge analyses results or data
from different sources into a single visualisation.
2.
**How much time you want to spend coding your own visualisation.**
You can of
course import financial or wheather data into Fractalis, but you will likely not
profit much from the existing analysis scripts or visualisations. Fractalis focus
is explorative analysis in the field of translational research, so you should
consider this, when thinking about adding a new format.
TL;DR: To see which formats are currently used and how they are defined, please
look at the
[
integrity check modules
](
https://git-r3lab.uni.lu/Fractalis/fractalis/tree/master/fractalis/data/integrity
)
.
If you want to add a new data type, this is the only place you have to touch.
### FAQ
> Why is there no `load` method in the MicroETLs?
-
There is, but you don't have to add it yourself. That's because the format
returned by
`transform`
is a internal standard format (after passing integrity checks),
so the loading step is the same for all MicroETLs of that type.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment