Skip to content

Commit 75c75f3

Browse files
authored
Merge pull request #33 from Hector-hedb12/new-docs
New docs
2 parents 02d3bcd + 035c0fc commit 75c75f3

29 files changed

Lines changed: 776 additions & 1508 deletions

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,10 @@ instance/
6464

6565
# Sphinx documentation
6666
docs/_build/
67+
docs/cardea.rst
68+
docs/cardea.*.rst
69+
docs/modules.rst
70+
docs/api
6771

6872
# PyBuilder
6973
target/

Makefile

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,7 @@ clean-pyc: ## remove Python file artifacts
4949
find . -name '__pycache__' -exec rm -fr {} +
5050

5151
clean-docs: ## remove previously built docs
52-
rm -f docs/cardea.rst
53-
rm -f docs/cardea.*.rst
54-
rm -f docs/modules.rst
52+
rm -f docs/api/*.rst
5553
$(MAKE) -C docs clean
5654

5755
clean-coverage: ## remove coverage artifacts
@@ -101,7 +99,7 @@ coverage: clean-coverage ## check code coverage quickly with the default Python
10199

102100

103101
docs: clean-docs ## generate Sphinx HTML documentation, including API docs
104-
sphinx-apidoc -o docs/ cardea
102+
sphinx-apidoc --module-first --separate --no-toc --output-dir docs/api/ cardea
105103
$(MAKE) -C docs html
106104
touch docs/_build/html/.nojekyll
107105

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
Advanced use
2+
============
3+
4+
How to define a new machine learning task?
5+
------------------------------------------
6+
7+
The definition of a new Machine Learning task in Cardea can be made in four simple steps:
8+
9+
1. Go to the `problem_definition`_ directory and create a file with a class specifically for
10+
your problem. This class should extend the `ProblemDefinition`_ class and overwrites
11+
accordingly the necessary attributes and methods as needed. Usually, you should pay special
12+
attention to the ``generate_target_label(...)`` and ``generate_cutoff_times(...)`` methods
13+
as you might need to extend them or re-implemented in some cases.
14+
15+
2. Expose your new class definition in the `init`_ file inside the `problem_definition`_ directory
16+
17+
3. If you will be using a dataset in a different format that the expected by Cardea (CSV files),
18+
then you will need to provide a specific loading dataset method for your data in the
19+
`EntitySetLoader`_ class, where you will be creating your collection of entities and
20+
relationships between them using the `featuretools.EntitySet`_ class.
21+
22+
4. Finally, you need to update the `Cardea`_ class to support the new problem definition and be
23+
able to instantiate the proper class when it is necessary in the ``Cardea.select_problem(...)``
24+
method.
25+
26+
Features, primitives and AutoML integration
27+
-------------------------------------------
28+
29+
Once you have defined your problem, following the four steps in the previous section, you will be
30+
able to perform featurization and run different primitives using the AutoML tool as follows:
31+
32+
.. code-block:: python
33+
34+
from cardea import Cardea
35+
cardea = Cardea()
36+
cardea.load_your_custom_data()
37+
problem = cardea.select_problem('YourCustomProblemDefinition')
38+
feature_matrix = cardea.generate_features(problem[:1000]) # a subset
39+
feature_matrix = feature_matrix.sample(frac=1) # shuffle
40+
y = list(feature_matrix.pop('label'))
41+
X = feature_matrix.values
42+
pipeline = [
43+
['sklearn.ensemble.RandomForestClassifier'],
44+
['sklearn.naive_bayes.MultinomialNB'],
45+
['sklearn.neighbors.KNeighborsClassifier']
46+
]
47+
result = cardea.execute_model(feature_matrix=X, target=y, primitives=pipeline)
48+
49+
50+
.. _featuretools.EntitySet: https://docs.featuretools.com/generated/featuretools.EntitySet.html#featuretools.EntitySet
51+
.. _problem_definition: https://github.com/D3-AI/Cardea/tree/master/cardea/problem_definition
52+
.. _ProblemDefinition: https://github.com/D3-AI/Cardea/blob/master/cardea/problem_definition/definition.py
53+
.. _init: https://github.com/D3-AI/Cardea/blob/master/cardea/problem_definition/__init__.py
54+
.. _EntitySetLoader: https://github.com/D3-AI/Cardea/blob/master/cardea/data_loader/entityset_loader.py#L9
55+
.. _Cardea: https://github.com/D3-AI/Cardea/blob/master/cardea/cardea.py

docs/basic_concepts/auditing.rst

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Auditing
2+
========
3+
4+
One element that is essential to prediction problems is the evaluation of the prediction results,
5+
but this might come in various forms and users rely on different metrics to identify the best
6+
model for a specific problem. Commonly, some metrics might be more representative than others
7+
depending on problem.
8+
9+
Therefore, to facilitate the auditing process, Cardea has two components designed specifically
10+
to cover both: data and model auditing, given that prediction problems rely mainly on the data
11+
that is being used. While Cardea provides a set of metrics that can be used as default metrics
12+
for certain prediction problems, it also provides the means to expand them and allow users to
13+
introduce new kind of metrics.
14+
15+
Using Cardea, users have the ability to generate a data summary report describing the data through
16+
the Data Auditor, enhancing users' understandability and engagement. Although the system includes
17+
a set of predefined audits that are commonly applied in the literature, they can also specify special
18+
types of audits that they want to apply on their dataset, using a dictionary of all the possible checks
19+
that must be reported.
20+
21+
These checks are divided in two categories: **data quality checks** and **data representation checks**. While
22+
the data quality checks identifies the missing information in the data; the data representation checks
23+
identifies data represents the users assumptions.
24+
25+
Similarly, Cardea provides full report to users describing the performance and behavior of the model with
26+
the `Model Auditor`_ component, aiming to give users more interpretability and understanding of the machine
27+
learning model.
28+
29+
Currently, prediction problems are categorized in regression or classification problems and each of them
30+
has a wide range of metrics (e.g., accuracy, F1 scores, precision recall, AUC for classification and
31+
mean square errors, mean absolute errors and r squared for regression).
32+
33+
Additionally, given that Cardea provides the ability to run different pipelines composed of different
34+
types of machine learning algorithms, the Model Auditor allows to compare multiple prediction
35+
pipelines and evaluate changes in their behavior using different training and testing data sets.
36+
37+
.. _Model Auditor: https://github.com/HDI-Project/ModelAudit
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
Auto - Featurization
2+
====================
3+
4+
Cardea automatically generates features using the `Featuretools`_ package, specifically,
5+
the `Deep Feature Synthesis (DFS)`_ algorithm to generate a feature matrix from a given dataset.
6+
Aiming to fully automate this process, it determines the focus values of the automated feature engineering
7+
task using the **target entity**, **cutoff times**, and **label** of the prediction problem.
8+
9+
.. _Featuretools: https://www.featuretools.com/
10+
.. _Deep Feature Synthesis (DFS): https://docs.featuretools.com/automated_feature_engineering/afe.html#deep-feature-synthesis

docs/basic_concepts/auto_ml.rst

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
Auto - ML
2+
=========
3+
4+
Cardea makes use of two packages to automate and simplify the modeling step in the Machine
5+
Learning tasks: `MLPrimitives`_ and `MLBlocks`_.
6+
7+
MLBlocks is a simple framework for seamlessly combining any possible set of Machine Learning
8+
tools developed in Python, whether they are custom developments or belong to third party
9+
libraries, and build Pipelines out of them that can be fitted and then used to make predictions.
10+
This is achieved by providing a simple and intuitive annotation language that allows the user to
11+
specify how to integrate with each tool, called **primitives**, in order to provide a common uniform
12+
interface to each one of them.
13+
14+
In the other hand, MLPrimitives is a repository containing primitive annotations to be used by the
15+
MLBlocks library.
16+
17+
Thanks to the use of these two packages, the Machine Learning algorithm selection and the
18+
hyper-parameter tuning steps can be done easily using JSON annotations as follow:
19+
20+
.. code-block:: python
21+
22+
pipeline = [
23+
['sklearn.ensemble.RandomForestClassifier'],
24+
['sklearn.naive_bayes.MultinomialNB'],
25+
['sklearn.neighbors.KNeighborsClassifier']
26+
]
27+
result = cardea.execute_model(..., primitives=pipeline)
28+
29+
Where, for example, the ``sklearn.naive_bayes.MultinomialNB`` primitive is defined in the
30+
`MLPrimitives`_ package, with the following structure:
31+
32+
.. code-block:: python
33+
34+
{
35+
"name": "sklearn.naive_bayes.MultinomialNB",
36+
"contributors": [...],
37+
"documentation": "...",
38+
"description": "...",
39+
"classifiers": {
40+
"type": "estimator",
41+
"subtype": "classifier"
42+
},
43+
"modalities": ["text"],
44+
"primitive": "sklearn.naive_bayes.MultinomialNB",
45+
"fit": {
46+
"method": "fit",
47+
"args": [
48+
{
49+
"name": "X",
50+
"type": "ndarray"
51+
},
52+
{
53+
"name": "y",
54+
"type": "array"
55+
}
56+
]
57+
},
58+
"produce": {
59+
"method": "predict",
60+
"args": [
61+
{
62+
"name": "X",
63+
"type": "ndarray"
64+
}
65+
],
66+
"output": [
67+
{
68+
"name": "y",
69+
"type": "array"
70+
}
71+
]
72+
},
73+
"hyperparameters": {
74+
"fixed": {
75+
"fit_prior": {
76+
"type": "bool",
77+
"default": true
78+
},
79+
"class_prior": {
80+
"type": "iterable",
81+
"default": null
82+
}
83+
},
84+
"tunable": {
85+
"alpha": {
86+
"type": "float",
87+
"default": 1.0,
88+
"range": [0.0, 1.0]
89+
}
90+
}
91+
}
92+
}
93+
94+
95+
.. _MLPrimitives: https://hdi-project.github.io/MLPrimitives/
96+
.. _MLBlocks: https://hdi-project.github.io/MLBlocks/

docs/basic_concepts/concepts.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.. _concepts:
2+
3+
Basic Concepts
4+
==============
5+
6+
Before diving into advanced usage and contributions, let's review the basic concepts of the
7+
library to help you get started.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
Data Loading
2+
============
3+
4+
Cardea makes use of a module to plugin the user's data and automatically organize it into the framework.
5+
It expects data in Fast Healthcare Interoperability Resources (FHIR), a standard for health care data
6+
exchange, published by HL7®. Among the advantages of FHIR over other standards are:
7+
8+
* Fast and easy to implement
9+
* Specification is free for use with no restrictions
10+
* Strong foundation in Web standards: XML, JSON, HTTP, OAuth, etc.
11+
* Support for RESTful architectures
12+
* Concise and easily understood specifications
13+
* A human-readable serialization format for ease of use by developers
14+
15+
By default, Cardea loads a dataset hosted in `Amazon S3`_, representing a formatted version of the
16+
Kaggle dataset: `Medical Appointment No Shows`_, but it also allows user to load datasets providing a
17+
local path with CSV files, using the ``load_data_entityset(...)`` method. As an example, the following piece
18+
of code will load the default Kaggle dataset:
19+
20+
.. code-block:: python
21+
22+
from cardea import Cardea
23+
cardea = Cardea()
24+
cardea.load_data_entityset()
25+
26+
While local files can be loaded using the same method with a ``folder_path`` parameter:
27+
28+
.. code-block:: python
29+
30+
cardea.load_data_entityset(folder_path="your/local/path/")
31+
32+
Cardea handles datasets as a collection of entities and the relationships between them because they
33+
are useful for preparing raw, structured datasets for feature engineering. For this, it uses
34+
the `featuretools.EntitySet`_ class.
35+
36+
Using the following command, you will be able to summarize the dataset:
37+
38+
.. code-block:: python
39+
40+
cardea.es
41+
Entityset: fhir
42+
Entities:
43+
Address [Rows: 81, Columns: 2]
44+
Appointment_Participant [Rows: 6100, Columns: 2]
45+
Appointment [Rows: 110527, Columns: 5]
46+
CodeableConcept [Rows: 4, Columns: 2]
47+
Coding [Rows: 3, Columns: 2]
48+
Identifier [Rows: 227151, Columns: 1]
49+
Observation [Rows: 110527, Columns: 3]
50+
Patient [Rows: 6100, Columns: 4]
51+
Reference [Rows: 6100, Columns: 1]
52+
Relationships:
53+
Appointment_Participant.actor -> Reference.identifier
54+
Appointment.participant -> Appointment_Participant.object_id
55+
CodeableConcept.coding -> Coding.object_id
56+
Observation.code -> CodeableConcept.object_id
57+
Observation.subject -> Reference.identifier
58+
Patient.address -> Address.object_id
59+
60+
Showing, in this case, the resources that were loaded into the framework (**Entities** section)
61+
and the relationship between the resources (**Relationships** section).
62+
63+
64+
.. _Amazon S3: https://s3.amazonaws.com/dai-cardea/
65+
.. _Medical Appointment No Shows: https://www.kaggle.com/joniarroba/noshowappointments
66+
.. _featuretools.EntitySet: https://docs.featuretools.com/generated/featuretools.EntitySet.html#featuretools.EntitySet
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
Machine Learning Tasks
2+
======================
3+
4+
The Problem Definition is considered a fundamental component that formulates the task for
5+
Machine Learning models. It includes generating and identifying two main concepts:
6+
the **target variable** and the **cutoff times**.
7+
8+
Therefore, the first step to work with Cardea is defining a Machine Learning Task (or using one
9+
of the already defined tasks). For example, **Missed Appointment** is a common task that aims
10+
to predict whether the patient showed to the appointment or not, helping hospitals to optimize
11+
their scheduling policies and resources efficiently.
12+
13+
Outcome to predict
14+
------------------
15+
16+
Following with the previous example, the **Missed Appointment** task is currently defined as
17+
a binary classification task in the system, determining whether a patient showed to the appointment
18+
or not from the point of appointment scheduling.
19+
20+
Usually, the outcome is defined over the FHIR data schema, using the resource id values for
21+
references between instances.
22+
23+
Cutoff times and Labels
24+
-----------------------
25+
26+
As it was stated before, the success of the Problem Definition step and its outcome depends on
27+
two main concepts: the **target variable** and the **cutoff times**. The target variable is
28+
generated automatically by Cardea if it does not exist in the dataset and its objective is to
29+
set the definition of the model output. In the other hand, the objective of cutoff times is to
30+
split the data in such manner that any events before the cutoff time are used for training while
31+
events after the cutoff time are used for testing. The following code shows the format for these
32+
values in the **Missed Appointment** task:
33+
34+
.. ipython:: python
35+
36+
from cardea import Cardea
37+
cardea = Cardea()
38+
cardea.load_data_entityset()
39+
cardea.select_problem('MissedAppointmentProblemDefinition')
40+
41+
Current Prediction Problems
42+
---------------------------
43+
44+
Cardea encapsulates six different prediction problems for users to explore easily,
45+
these are described as follows:
46+
47+
1. Diagnosis Prediction:
48+
a. Predicts whether a patient will be diagnosed with a specified diagnosis.
49+
2. Length of Stay:
50+
a. Predicts how many days the patient will be in the hospital.
51+
3. Missed Appointment:
52+
a. Predicts whether the patient showed to the appointment or not.
53+
4. Mortality Prediction:
54+
a. Predicts whether a patient will suffer from mortality.
55+
5. Prolonged Length of Stay:
56+
a. Predicts whether a patient stayed in the hospital more or less than a period of time (a week by default).
57+
6. Readmission:
58+
a. Predicts whether a patient will revisit the hospital within certain period of time (a month by default).
59+
60+
You can see the list of problems using the ``list_problems(...)`` method, example:
61+
62+
.. ipython:: python
63+
64+
from cardea import Cardea
65+
cardea = Cardea()
66+
cardea.list_problems()

0 commit comments

Comments
 (0)