fabridamicelli
diff --git a/‎book/lab_explore_data.ipynb‎
Lines changed: 63 additions & 18 deletions b/‎book/lab_explore_data.ipynb‎
Lines changed: 63 additions & 18 deletions
@@ -17,7 +17,7 @@
    "id": "86550260-7939-4970-a420-5eb24e8158c2",
    "metadata": {},
    "source": [
-    "We want to take a look at this real-world dataset: [https://github.com/OpenNeuroDatasets/ds005420](https://github.com/OpenNeuroDatasets/ds005420)"
+    "We will work here with this real-world dataset of resting state EEG signals [https://github.com/OpenNeuroDatasets/ds005420](https://github.com/OpenNeuroDatasets/ds005420). "
    ]
   },
   {
@@ -80,9 +80,7 @@
     "\n",
     "1) Write a unit test (inside `/pycourse/tests/test_data.py`) to make sure the number of subject sub-directories corresponds to actual the number of subjects. **Hint:** Look at the metadata.\n",
     "2) Verify that all subject directories have a eeg sub-directory.  \n",
-    "3) Verify that all data in a subject directories matches with the subject number.  \n",
-    "4) Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.  \n",
-    "5) (Optional) Write a file (`discarded_subjects.txt`) with the subject numbers that do not match that criterion.  "
+    "3) Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.  "
    ]
   },
   {
@@ -92,9 +90,7 @@
    "source": [
     "## Exploratory data analysis \n",
     "Now we want to look at the data.\n",
-    "We find that the data is in a particular format `.edf` that we cannot directly read in python.  \n",
-    "**Hint:**\n",
-    "We need to install a third-party library `mne` to read `.edf` files.  \n",
+    "We find that the data is in a format called [European Data Format](https://en.wikipedia.org/wiki/European_Data_Format) (`.edf`) and we need to install a third-party library, `mne`, to read it.\n",
     "You can check out the [library documentation here](https://mne.tools/dev/)"
    ]
   },
@@ -114,34 +110,83 @@
    "id": "bb817e5c-3096-4d05-9120-ba7b05844403",
    "metadata": {},
    "source": [
-    "1) Plot one time series.  \n",
-    "2) Plot all time series with labels according to channel name.  \n",
-    "3) Plot the channels that start with \"T\" and \"O\".  \n",
-    "4) Plot a correlation plot of the \"T\" and \"O\" channels as a heatmap.\n",
-    "5) Plot a histogram of `RecordingDuration` across all subjects.  "
+    "**Hints:**\n",
+    "\n",
+    "- Look at function `mne.io.read_raw_edf` to load data.\n",
+    "- Look at the method `.to_data_frame` of the loaded data.\n",
+    "\n",
+    "1) Plot one time series.\n",
+    "2) Clean the column names removing \"EEG\", eg \"EEG C4-A1A2\" -> \"C4-A1A2\"\n",
+    "3) Plot all time series with labels according to channel name. **Hint:** Look at `melt` method of dataframes\n",
+    "4) Plot the channels that start with \"P\", \"T\" or \"O\".  \n",
+    "5) Plot a correlation plot of all-vs-all the \"P\", \"T\" and \"O\" channels as a heatmap. **Hint:** Look up seaborn's documentation on heatmaps.\n",
+    "6) Save the correlation plot in svg format."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "196f319e-51a8-48bd-91f0-931978a5ec04",
    "metadata": {},
    "source": [
-    "## Process data\n",
+    "## Single-subject data\n",
     "After having taken this quick look at the data, we want to start processing the data.\n",
+    "So far we are working with data coming from one subject.\n",
+    "\n",
+    "1) Substract the mean from each channel\n",
+    "2) Plot the time series with substracted mean for all channels\n",
+    "3) Standarize and plot all time series again.\n",
+    "Standarization means: \n",
+    "$$\n",
+    "y = (x - mean) / standardDeviation\n",
+    "$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d788061b-93ce-4105-8cf7-9de25efa4f2f",
+   "metadata": {},
+   "source": [
+    "## Multi-subject data\n",
+    "Here we are going to work with data of more than one subject at the time.\n",
     "\n",
-    "1) Clean the column names removing \"EEG\", eg \"EEG C4-A1A2\" -> \"C4-A1A2\"\n",
-    "2) Substract the mean from each channel  \n",
-    "3) Plot correlation matrix of all-vs-all channels. **Hint:** Look at seaborn documentation on heatmaps.\n",
-    "4) Save the correlation plot as vector graphics."
+    "1) Plot a histogram of `RecordingDuration` across all subjects. **Hint:** assume we want data in \"oc_eeg\"\n",
+    "2) Pick 3 EEG channels and plot the time series (aggregated across all subjects) in one plot. Differentiate the lines by channel. **Hint:** Use seaborn and look up the `hue` parameter.\n",
+    "3) Plot a grid of subplots with each plot representing 1 channel (aggregated across subjects). **Hint:** Adapt [this example](https://seaborn.pydata.org/examples/timeseries_facets.html) \n",
+    "4) Pick 5 channels and only 3 time points. Simulate subjects belong to 3 groups.\n",
+    "5) Adapt [this example](https://seaborn.pydata.org/examples/pointplot_anova.html) to plot a comparison between channels/subjects/time."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9b7edbe2-0c6f-4dd1-bf92-74f080b2e00f",
+   "id": "14452053-710d-4c46-9d9a-ee075692ec7b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce196df5-f404-448a-88cd-1b164a2aafdf",
    "metadata": {},
    "outputs": [],
    "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "503eef31-86a5-41c9-8cd0-246ab18b4a2d",
+   "metadata": {},
+   "source": [
+    "## Consolidate pipeline\n",
+    "\n",
+    "Let's consolidate our workflow into a pipeline.\n",
+    "\n",
+    "- read and assert subfolders\n",
+    "- clean column names\n",
+    "- standarize values\n",
+    "- plot correlations\n",
+    "- run tests"
+   ]
   }
  ],
  "metadata": {