|
| 1 | +# --- |
| 2 | +# jupyter: |
| 3 | +# jupytext: |
| 4 | +# text_representation: |
| 5 | +# extension: .py |
| 6 | +# format_name: percent |
| 7 | +# format_version: '1.3' |
| 8 | +# jupytext_version: 1.17.0 |
| 9 | +# kernelspec: |
| 10 | +# display_name: default |
| 11 | +# language: python |
| 12 | +# name: python3 |
| 13 | +# --- |
| 14 | + |
| 15 | +# %% [markdown] |
| 16 | + |
| 17 | +# # Example 3: Exploring Cover Algorithms |
| 18 | + |
| 19 | +# In this notebook, we make a comparison among all the cover algorithms offered by this library |
| 20 | +# with the goal of offering some guidance on how to choose the best cover algorithm for your |
| 21 | +# specific dataset and analysis goals. Each algorithm captures different aspects of the data, and |
| 22 | +# the choice of cover can significantly influence the resulting Mapper graph. It is important to |
| 23 | +# experiment with different cover algorithms and parameters to find the best fit for your specific |
| 24 | +# dataset and analysis goals. The choice of cover algorithm can reveal different patterns and |
| 25 | +# structures in the data, and understanding these differences can help you gain deeper insights |
| 26 | +# into the underlying data distribution and relationships. It's important to remind that the cover |
| 27 | +# algorithm is applied to the lens data. So whenever we use the word "space" in this notebook, we |
| 28 | +# are referring to the lens data. |
| 29 | + |
| 30 | +# We will use the **Digits dataset** as a case study, applying different cover algorithms to see |
| 31 | +# how they affect the Mapper graph structure. The goal is to understand how different covers can |
| 32 | +# reveal various aspects of the data and how they can be used to highlight different features in |
| 33 | +# the Mapper analysis. In the following examples we will skip the clustering step, as we are |
| 34 | +# interested in the cover algorithms only. |
| 35 | + |
| 36 | +# %% |
| 37 | +import numpy as np |
| 38 | +from sklearn.datasets import load_digits |
| 39 | +from sklearn.decomposition import PCA |
| 40 | + |
| 41 | +from tdamapper.learn import MapperAlgorithm |
| 42 | +from tdamapper.plot import MapperPlot |
| 43 | + |
| 44 | +X, labels = load_digits(return_X_y=True) |
| 45 | +y = PCA(2, random_state=42).fit_transform(X) |
| 46 | + |
| 47 | + |
| 48 | +def mode(arr): |
| 49 | + values, counts = np.unique(arr, return_counts=True) |
| 50 | + max_count = np.max(counts) |
| 51 | + mode_values = values[counts == max_count] |
| 52 | + return np.nanmean(mode_values) |
| 53 | + |
| 54 | + |
| 55 | +# %% [markdown] |
| 56 | + |
| 57 | +# ## 1. CubicalCover |
| 58 | + |
| 59 | +# The `CubicalCover` covers the space into a grid-like structure, where each cell has a fixed size |
| 60 | +# and overlaps with adjacent cells. |
| 61 | + |
| 62 | +# ### Parameters |
| 63 | + |
| 64 | +# The `n_intervals` parameter controls the number of intervals in each dimension, and |
| 65 | +# `overlap_frac` controls the overlap between adjacent intervals. You can adjust these parameters |
| 66 | +# to see how they affect the Mapper graph. A larger number of intervals and a smaller overlap |
| 67 | +# fraction will create a finer cover, potentially revealing more detail in the data, but also |
| 68 | +# possibly introducing noise. Conversely, a smaller number of intervals with a larger overlap |
| 69 | +# fraction will create a coarser cover, which may smooth out some of the finer details but can also |
| 70 | +# help to reduce noise and highlight broader patterns. The choice of these parameters can |
| 71 | +# significantly influence the structure of the Mapper graph, so it's important to experiment with |
| 72 | +# different values to find the best fit for your data. |
| 73 | + |
| 74 | +# Additionally, the `CubicalCover` has an `algorithm` parameter that allows you to choose between |
| 75 | +# different algorithms for creating the cover. The default algorithm is `proximity`, which creates |
| 76 | +# a grid that is enough to cover the dataset. However, you can also choose `standard`, which |
| 77 | +# creates a grid which contains all the cells that cover the dataset. The `proximity` algorithm is |
| 78 | +# the default because it is more efficient and scales well in high dimensions, while the`standard` |
| 79 | +# algorithm is more straightforward and it's consistent with the original Mapper algorithm described |
| 80 | +# in the original paper. The choice of algorithm can affect the structure of the Mapper graph, |
| 81 | +# especially in high-dimensional spaces, where the `proximity` algorithm can help to reduce the |
| 82 | +# computational complexity and improve the performance of the Mapper analysis, producing more |
| 83 | +# compact graphs with lower noise. The `standard` algorithm, on the other hand, can produce more |
| 84 | +# detailed graphs, as it captures all the cells that cover the dataset, but it may also introduce |
| 85 | +# more noise and complexity, especially in high-dimensional spaces where the number of cells can |
| 86 | +# grow exponentially with the number of dimensions, making it more usable in low-dimensional |
| 87 | +# spaces. |
| 88 | + |
| 89 | +# ### Advantages |
| 90 | +# One advantage of using `CubicalCover` is that it is the most widely used cover algorithm in |
| 91 | +# Mapper analysis, and it is often the default choice in many Mapper implementations. It is |
| 92 | +# computationally efficient, as it does not require calculating distances between all pairs of |
| 93 | +# points, and it can work well in high-dimensional spaces. It's also the default choice in many |
| 94 | +# research papers and applications, making it a familiar and well-understood method for many |
| 95 | +# researchers and practitioners. |
| 96 | + |
| 97 | +# ### Disadvantages |
| 98 | +# One disadvantage of using `CubicalCover` is that it can be sensitive to the choice of parameters, |
| 99 | +# particularly the number of intervals and the overlap fraction. If these parameters are not |
| 100 | +# chosen carefully, the resulting Mapper graph may not accurately reflect the underlying data |
| 101 | +# distribution. For example, if the number of intervals is too small or the overlap fraction is |
| 102 | +# too large, the cover may merge distinct clusters or fail to capture important local structures in |
| 103 | +# the data. This can lead to a loss of information and make it difficult to interpret the |
| 104 | +# resulting Mapper graph. Therefore, it is important to carefully choose the parameters and |
| 105 | +# consider the specific characteristics of the dataset when using `CubicalCover`. |
| 106 | + |
| 107 | + |
| 108 | +# %% |
| 109 | +from tdamapper.cover import CubicalCover |
| 110 | + |
| 111 | +mapper = MapperAlgorithm( |
| 112 | + cover=CubicalCover( |
| 113 | + n_intervals=10, |
| 114 | + overlap_frac=0.25, |
| 115 | + ), |
| 116 | + verbose=False, |
| 117 | +) |
| 118 | + |
| 119 | +graph = mapper.fit_transform(X, y) |
| 120 | + |
| 121 | +plot = MapperPlot(graph, dim=3, iterations=400, seed=42) |
| 122 | + |
| 123 | +fig = plot.plot_plotly( |
| 124 | + colors=labels, |
| 125 | + cmap=["jet", "viridis", "cividis"], |
| 126 | + agg=mode, |
| 127 | + node_size=[0.25 * x for x in range(9)], |
| 128 | + title="mode of digits", |
| 129 | +) |
| 130 | + |
| 131 | +fig.show(config={"scrollZoom": True}, renderer="notebook_connected") |
| 132 | + |
| 133 | +# %% |
| 134 | +from tdamapper.cover import CubicalCover |
| 135 | + |
| 136 | +mapper = MapperAlgorithm( |
| 137 | + cover=CubicalCover( |
| 138 | + n_intervals=10, |
| 139 | + overlap_frac=0.25, |
| 140 | + algorithm="standard", |
| 141 | + ), |
| 142 | + verbose=False, |
| 143 | +) |
| 144 | + |
| 145 | +graph = mapper.fit_transform(X, y) |
| 146 | + |
| 147 | +plot = MapperPlot(graph, dim=3, iterations=400, seed=42) |
| 148 | + |
| 149 | +fig = plot.plot_plotly( |
| 150 | + colors=labels, |
| 151 | + cmap=["jet", "viridis", "cividis"], |
| 152 | + agg=mode, |
| 153 | + node_size=[0.25 * x for x in range(9)], |
| 154 | + title="mode of digits", |
| 155 | +) |
| 156 | + |
| 157 | +fig.show(config={"scrollZoom": True}, renderer="notebook_connected") |
| 158 | + |
| 159 | +# %% [markdown] |
| 160 | + |
| 161 | +# ## 2. BallCover |
| 162 | + |
| 163 | +# The `BallCover` algorithm creates a cover based on balls of a specified radius around points. |
| 164 | + |
| 165 | +# ### Parameters |
| 166 | +# The key parameters in the `BallCover` is the `radius`, which determines the size of the balls |
| 167 | +# used to cover the space. A larger radius will create a coarser cover, potentially merging nearby |
| 168 | +# points into the same ball, while a smaller radius will create a finer cover, allowing for more |
| 169 | +# detailed local structures to be captured. The choice of radius can significantly affect the |
| 170 | +# resulting Mapper graph, as it determines how points are grouped together and how the overall |
| 171 | +# structure of the data is represented. Another important parameter is the `metric`, which defines |
| 172 | +# the distance function used to measure the distance between points. The default metric is |
| 173 | +# Euclidean distance, but you can also use other metrics such as cosine distance or Manhattan |
| 174 | +# distance, depending on the characteristics of your dataset and the specific analysis goals. The |
| 175 | +# choice of metric can also influence the structure of the Mapper graph, as different metrics may |
| 176 | +# capture different aspects of the data distribution and relationships between points. |
| 177 | + |
| 178 | +# ### Advantages |
| 179 | +# One advantage of using `BallCover` is that it's computationally efficient, as it does not |
| 180 | +# require calculating distances between all pairs of points, being based on an efficient indexing |
| 181 | +# of the space. Moreover, it can work on any metric space, as it only requires a distance function |
| 182 | +# to define the balls. This makes it a versatile choice for many different types of datasets. |
| 183 | + |
| 184 | +# ### Disadvantages |
| 185 | +# One disadvantage of using `BallCover` is that it can be sensitive to the density of points in the |
| 186 | +# dataset. In regions with high point density, the balls may overlap significantly, leading to a |
| 187 | +# more complex graph structure. In contrast, in regions with low point density, the balls may not |
| 188 | +# overlap much, resulting in isolated nodes or small clusters. Using the same radius for the entire |
| 189 | +# dataset may not capture the local structure effectively and this can lead to a loss of |
| 190 | +# information and make it difficult to interpret the resulting Mapper graph. Therefore, it is |
| 191 | +# important to carefully choose the radius and consider the density of points in the dataset when |
| 192 | +# using `BallCover`. A good choice of radius can help to balance the trade-off between capturing |
| 193 | +# local structures and avoiding noise in the Mapper graph. In practice, it may be beneficial to |
| 194 | +# experiment with different radius values and analyze the resulting Mapper graphs to find the |
| 195 | +# optimal radius for a given dataset. Chosing a good radius can be tricky especially in |
| 196 | +# high-dimensional spaces, where the distance between points can become less meaningful due to the |
| 197 | +# curse of dimensionality. In such cases, it may be beneficial to use a cover that adapts to the |
| 198 | +# local density of points, such as the `KNNCover`, which uses k-nearest neighbors to define the |
| 199 | +# cover. |
| 200 | + |
| 201 | +# %% |
| 202 | +from tdamapper.cover import BallCover |
| 203 | + |
| 204 | +mapper = MapperAlgorithm( |
| 205 | + cover=BallCover(radius=5.0), |
| 206 | + verbose=False, |
| 207 | +) |
| 208 | + |
| 209 | +graph = mapper.fit_transform(X, y) |
| 210 | + |
| 211 | +plot = MapperPlot(graph, dim=3, iterations=400, seed=42) |
| 212 | + |
| 213 | +fig = plot.plot_plotly( |
| 214 | + colors=labels, |
| 215 | + cmap=["jet", "viridis", "cividis"], |
| 216 | + agg=mode, |
| 217 | + node_size=[0.25 * x for x in range(9)], |
| 218 | + title="mode of digits", |
| 219 | +) |
| 220 | + |
| 221 | +fig.show(config={"scrollZoom": True}, renderer="notebook_connected") |
| 222 | + |
| 223 | +# %% [markdown] |
| 224 | + |
| 225 | +# ## 3. KNNCover |
| 226 | + |
| 227 | +# The `KNNCover` algorithm uses k-nearest neighbors to define the cover. The cover is created by |
| 228 | +# choosing a set of points in the dataset and then connecting each point to its k-nearest |
| 229 | +# neighbors. |
| 230 | + |
| 231 | +# ### Parameters |
| 232 | +# The key parameter in the `KNNCover` is the `neighbors`, which determines how many nearest |
| 233 | +# neighbors to consider when creating the cover. A larger number of neighbors will create a more |
| 234 | +# connected cover, potentially capturing more global structure, while a smaller number of neighbors |
| 235 | +# will create a more localized cover, focusing on the immediate neighborhood of each point. The |
| 236 | +# choice of the number of neighbors can significantly affect the resulting Mapper graph, as it |
| 237 | +# determines how points are grouped together and how the overall structure of the data is |
| 238 | +# represented. |
| 239 | + |
| 240 | +# ### Advantages |
| 241 | +# One advantage of using `KNNCover` is that it can adapt to the local density of points, allowing |
| 242 | +# for a more nuanced representation of the data. In regions with high point density, the cover will |
| 243 | +# create more sets of points, while in regions with low point density, the cover will create fewer |
| 244 | +# sets. This can help to reveal local structures and patterns that may not be visible with other |
| 245 | +# cover methods. Similarly to `BallCover`, it is also computationally efficient, as it does not |
| 246 | +# require calculating distances between all pairs of points, being based on an efficient indexing |
| 247 | +# of the space. Moreover, it can work on any metric space, as it only requires a distance function |
| 248 | +# to define the nearest neighbors. This makes it a versatile choice for many different types of |
| 249 | +# datasets. |
| 250 | + |
| 251 | +# ### Disadvantages |
| 252 | +# One possible disadvantage of using `KNNCover` is that chosing the number of neighbors can |
| 253 | +# significantly affect the resulting Mapper graph. If the number of neighbors is too small, the |
| 254 | +# cover may not capture the global structure of the data, leading to a fragmented graph with many |
| 255 | +# isolated nodes or small clusters. On the other hand, if the number of neighbors is too large, the |
| 256 | +# cover may merge distinct clusters or fail to capture important local structures in the data. |
| 257 | +# Additionally, small isolated clusters (smaller than the number of neighbors) may not be captured |
| 258 | +# effectively. If a cluster is smaller than the number of neighbors specified, it may be merged with |
| 259 | +# other larger clusters, leading to a loss of information about the smaller cluster. |
| 260 | + |
| 261 | +# %% |
| 262 | +from tdamapper.cover import KNNCover |
| 263 | + |
| 264 | +mapper = MapperAlgorithm( |
| 265 | + cover=KNNCover(neighbors=15), |
| 266 | + verbose=False, |
| 267 | +) |
| 268 | + |
| 269 | +graph = mapper.fit_transform(X, y) |
| 270 | + |
| 271 | +plot = MapperPlot(graph, dim=3, iterations=400, seed=42) |
| 272 | + |
| 273 | +fig = plot.plot_plotly( |
| 274 | + colors=labels, |
| 275 | + cmap=["jet", "viridis", "cividis"], |
| 276 | + agg=mode, |
| 277 | + node_size=[0.125 * x for x in range(9)], |
| 278 | + title="mode of digits", |
| 279 | +) |
| 280 | + |
| 281 | +fig.show(config={"scrollZoom": True}, renderer="notebook_connected") |
| 282 | + |
| 283 | +# %% [markdown] |
| 284 | + |
| 285 | +# ## Conclusions |
| 286 | + |
| 287 | +# As a final remark, in the example dataset that we used, despite a significative difference in the |
| 288 | +# structure of the Mapper graph, the relationship between the different parts of the data are still |
| 289 | +# preserved. This means that even though the cover algorithms create different structures, they |
| 290 | +# still capture the same underlying relationships between the data points. This is an important |
| 291 | +# aspect of Mapper analysis, as it allows for flexibility in choosing the cover algorithm while |
| 292 | +# still maintaining the integrity of the data relationships. In practice, it is often beneficial to |
| 293 | +# try multiple cover algorithms and compare the resulting Mapper graphs to gain a comprehensive |
| 294 | +# understanding of the data. |
0 commit comments