Skip to content

Commit 1acffed

Browse files
committed
Added example about cover algorithms
1 parent 6033562 commit 1acffed

2 files changed

Lines changed: 295 additions & 0 deletions

File tree

docs/source/examples.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@ Examples
66

77
notebooks/circles
88
notebooks/digits
9+
notebooks/cover
910

docs/source/notebooks/cover.py

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# ---
2+
# jupyter:
3+
# jupytext:
4+
# text_representation:
5+
# extension: .py
6+
# format_name: percent
7+
# format_version: '1.3'
8+
# jupytext_version: 1.17.0
9+
# kernelspec:
10+
# display_name: default
11+
# language: python
12+
# name: python3
13+
# ---
14+
15+
# %% [markdown]
16+
17+
# # Example 3: Exploring Cover Algorithms
18+
19+
# In this notebook, we make a comparison among all the cover algorithms offered by this library
20+
# with the goal of offering some guidance on how to choose the best cover algorithm for your
21+
# specific dataset and analysis goals. Each algorithm captures different aspects of the data, and
22+
# the choice of cover can significantly influence the resulting Mapper graph. It is important to
23+
# experiment with different cover algorithms and parameters to find the best fit for your specific
24+
# dataset and analysis goals. The choice of cover algorithm can reveal different patterns and
25+
# structures in the data, and understanding these differences can help you gain deeper insights
26+
# into the underlying data distribution and relationships. It's important to remind that the cover
27+
# algorithm is applied to the lens data. So whenever we use the word "space" in this notebook, we
28+
# are referring to the lens data.
29+
30+
# We will use the **Digits dataset** as a case study, applying different cover algorithms to see
31+
# how they affect the Mapper graph structure. The goal is to understand how different covers can
32+
# reveal various aspects of the data and how they can be used to highlight different features in
33+
# the Mapper analysis. In the following examples we will skip the clustering step, as we are
34+
# interested in the cover algorithms only.
35+
36+
# %%
37+
import numpy as np
38+
from sklearn.datasets import load_digits
39+
from sklearn.decomposition import PCA
40+
41+
from tdamapper.learn import MapperAlgorithm
42+
from tdamapper.plot import MapperPlot
43+
44+
X, labels = load_digits(return_X_y=True)
45+
y = PCA(2, random_state=42).fit_transform(X)
46+
47+
48+
def mode(arr):
49+
values, counts = np.unique(arr, return_counts=True)
50+
max_count = np.max(counts)
51+
mode_values = values[counts == max_count]
52+
return np.nanmean(mode_values)
53+
54+
55+
# %% [markdown]
56+
57+
# ## 1. CubicalCover
58+
59+
# The `CubicalCover` covers the space into a grid-like structure, where each cell has a fixed size
60+
# and overlaps with adjacent cells.
61+
62+
# ### Parameters
63+
64+
# The `n_intervals` parameter controls the number of intervals in each dimension, and
65+
# `overlap_frac` controls the overlap between adjacent intervals. You can adjust these parameters
66+
# to see how they affect the Mapper graph. A larger number of intervals and a smaller overlap
67+
# fraction will create a finer cover, potentially revealing more detail in the data, but also
68+
# possibly introducing noise. Conversely, a smaller number of intervals with a larger overlap
69+
# fraction will create a coarser cover, which may smooth out some of the finer details but can also
70+
# help to reduce noise and highlight broader patterns. The choice of these parameters can
71+
# significantly influence the structure of the Mapper graph, so it's important to experiment with
72+
# different values to find the best fit for your data.
73+
74+
# Additionally, the `CubicalCover` has an `algorithm` parameter that allows you to choose between
75+
# different algorithms for creating the cover. The default algorithm is `proximity`, which creates
76+
# a grid that is enough to cover the dataset. However, you can also choose `standard`, which
77+
# creates a grid which contains all the cells that cover the dataset. The `proximity` algorithm is
78+
# the default because it is more efficient and scales well in high dimensions, while the`standard`
79+
# algorithm is more straightforward and it's consistent with the original Mapper algorithm described
80+
# in the original paper. The choice of algorithm can affect the structure of the Mapper graph,
81+
# especially in high-dimensional spaces, where the `proximity` algorithm can help to reduce the
82+
# computational complexity and improve the performance of the Mapper analysis, producing more
83+
# compact graphs with lower noise. The `standard` algorithm, on the other hand, can produce more
84+
# detailed graphs, as it captures all the cells that cover the dataset, but it may also introduce
85+
# more noise and complexity, especially in high-dimensional spaces where the number of cells can
86+
# grow exponentially with the number of dimensions, making it more usable in low-dimensional
87+
# spaces.
88+
89+
# ### Advantages
90+
# One advantage of using `CubicalCover` is that it is the most widely used cover algorithm in
91+
# Mapper analysis, and it is often the default choice in many Mapper implementations. It is
92+
# computationally efficient, as it does not require calculating distances between all pairs of
93+
# points, and it can work well in high-dimensional spaces. It's also the default choice in many
94+
# research papers and applications, making it a familiar and well-understood method for many
95+
# researchers and practitioners.
96+
97+
# ### Disadvantages
98+
# One disadvantage of using `CubicalCover` is that it can be sensitive to the choice of parameters,
99+
# particularly the number of intervals and the overlap fraction. If these parameters are not
100+
# chosen carefully, the resulting Mapper graph may not accurately reflect the underlying data
101+
# distribution. For example, if the number of intervals is too small or the overlap fraction is
102+
# too large, the cover may merge distinct clusters or fail to capture important local structures in
103+
# the data. This can lead to a loss of information and make it difficult to interpret the
104+
# resulting Mapper graph. Therefore, it is important to carefully choose the parameters and
105+
# consider the specific characteristics of the dataset when using `CubicalCover`.
106+
107+
108+
# %%
109+
from tdamapper.cover import CubicalCover
110+
111+
mapper = MapperAlgorithm(
112+
cover=CubicalCover(
113+
n_intervals=10,
114+
overlap_frac=0.25,
115+
),
116+
verbose=False,
117+
)
118+
119+
graph = mapper.fit_transform(X, y)
120+
121+
plot = MapperPlot(graph, dim=3, iterations=400, seed=42)
122+
123+
fig = plot.plot_plotly(
124+
colors=labels,
125+
cmap=["jet", "viridis", "cividis"],
126+
agg=mode,
127+
node_size=[0.25 * x for x in range(9)],
128+
title="mode of digits",
129+
)
130+
131+
fig.show(config={"scrollZoom": True}, renderer="notebook_connected")
132+
133+
# %%
134+
from tdamapper.cover import CubicalCover
135+
136+
mapper = MapperAlgorithm(
137+
cover=CubicalCover(
138+
n_intervals=10,
139+
overlap_frac=0.25,
140+
algorithm="standard",
141+
),
142+
verbose=False,
143+
)
144+
145+
graph = mapper.fit_transform(X, y)
146+
147+
plot = MapperPlot(graph, dim=3, iterations=400, seed=42)
148+
149+
fig = plot.plot_plotly(
150+
colors=labels,
151+
cmap=["jet", "viridis", "cividis"],
152+
agg=mode,
153+
node_size=[0.25 * x for x in range(9)],
154+
title="mode of digits",
155+
)
156+
157+
fig.show(config={"scrollZoom": True}, renderer="notebook_connected")
158+
159+
# %% [markdown]
160+
161+
# ## 2. BallCover
162+
163+
# The `BallCover` algorithm creates a cover based on balls of a specified radius around points.
164+
165+
# ### Parameters
166+
# The key parameters in the `BallCover` is the `radius`, which determines the size of the balls
167+
# used to cover the space. A larger radius will create a coarser cover, potentially merging nearby
168+
# points into the same ball, while a smaller radius will create a finer cover, allowing for more
169+
# detailed local structures to be captured. The choice of radius can significantly affect the
170+
# resulting Mapper graph, as it determines how points are grouped together and how the overall
171+
# structure of the data is represented. Another important parameter is the `metric`, which defines
172+
# the distance function used to measure the distance between points. The default metric is
173+
# Euclidean distance, but you can also use other metrics such as cosine distance or Manhattan
174+
# distance, depending on the characteristics of your dataset and the specific analysis goals. The
175+
# choice of metric can also influence the structure of the Mapper graph, as different metrics may
176+
# capture different aspects of the data distribution and relationships between points.
177+
178+
# ### Advantages
179+
# One advantage of using `BallCover` is that it's computationally efficient, as it does not
180+
# require calculating distances between all pairs of points, being based on an efficient indexing
181+
# of the space. Moreover, it can work on any metric space, as it only requires a distance function
182+
# to define the balls. This makes it a versatile choice for many different types of datasets.
183+
184+
# ### Disadvantages
185+
# One disadvantage of using `BallCover` is that it can be sensitive to the density of points in the
186+
# dataset. In regions with high point density, the balls may overlap significantly, leading to a
187+
# more complex graph structure. In contrast, in regions with low point density, the balls may not
188+
# overlap much, resulting in isolated nodes or small clusters. Using the same radius for the entire
189+
# dataset may not capture the local structure effectively and this can lead to a loss of
190+
# information and make it difficult to interpret the resulting Mapper graph. Therefore, it is
191+
# important to carefully choose the radius and consider the density of points in the dataset when
192+
# using `BallCover`. A good choice of radius can help to balance the trade-off between capturing
193+
# local structures and avoiding noise in the Mapper graph. In practice, it may be beneficial to
194+
# experiment with different radius values and analyze the resulting Mapper graphs to find the
195+
# optimal radius for a given dataset. Chosing a good radius can be tricky especially in
196+
# high-dimensional spaces, where the distance between points can become less meaningful due to the
197+
# curse of dimensionality. In such cases, it may be beneficial to use a cover that adapts to the
198+
# local density of points, such as the `KNNCover`, which uses k-nearest neighbors to define the
199+
# cover.
200+
201+
# %%
202+
from tdamapper.cover import BallCover
203+
204+
mapper = MapperAlgorithm(
205+
cover=BallCover(radius=5.0),
206+
verbose=False,
207+
)
208+
209+
graph = mapper.fit_transform(X, y)
210+
211+
plot = MapperPlot(graph, dim=3, iterations=400, seed=42)
212+
213+
fig = plot.plot_plotly(
214+
colors=labels,
215+
cmap=["jet", "viridis", "cividis"],
216+
agg=mode,
217+
node_size=[0.25 * x for x in range(9)],
218+
title="mode of digits",
219+
)
220+
221+
fig.show(config={"scrollZoom": True}, renderer="notebook_connected")
222+
223+
# %% [markdown]
224+
225+
# ## 3. KNNCover
226+
227+
# The `KNNCover` algorithm uses k-nearest neighbors to define the cover. The cover is created by
228+
# choosing a set of points in the dataset and then connecting each point to its k-nearest
229+
# neighbors.
230+
231+
# ### Parameters
232+
# The key parameter in the `KNNCover` is the `neighbors`, which determines how many nearest
233+
# neighbors to consider when creating the cover. A larger number of neighbors will create a more
234+
# connected cover, potentially capturing more global structure, while a smaller number of neighbors
235+
# will create a more localized cover, focusing on the immediate neighborhood of each point. The
236+
# choice of the number of neighbors can significantly affect the resulting Mapper graph, as it
237+
# determines how points are grouped together and how the overall structure of the data is
238+
# represented.
239+
240+
# ### Advantages
241+
# One advantage of using `KNNCover` is that it can adapt to the local density of points, allowing
242+
# for a more nuanced representation of the data. In regions with high point density, the cover will
243+
# create more sets of points, while in regions with low point density, the cover will create fewer
244+
# sets. This can help to reveal local structures and patterns that may not be visible with other
245+
# cover methods. Similarly to `BallCover`, it is also computationally efficient, as it does not
246+
# require calculating distances between all pairs of points, being based on an efficient indexing
247+
# of the space. Moreover, it can work on any metric space, as it only requires a distance function
248+
# to define the nearest neighbors. This makes it a versatile choice for many different types of
249+
# datasets.
250+
251+
# ### Disadvantages
252+
# One possible disadvantage of using `KNNCover` is that chosing the number of neighbors can
253+
# significantly affect the resulting Mapper graph. If the number of neighbors is too small, the
254+
# cover may not capture the global structure of the data, leading to a fragmented graph with many
255+
# isolated nodes or small clusters. On the other hand, if the number of neighbors is too large, the
256+
# cover may merge distinct clusters or fail to capture important local structures in the data.
257+
# Additionally, small isolated clusters (smaller than the number of neighbors) may not be captured
258+
# effectively. If a cluster is smaller than the number of neighbors specified, it may be merged with
259+
# other larger clusters, leading to a loss of information about the smaller cluster.
260+
261+
# %%
262+
from tdamapper.cover import KNNCover
263+
264+
mapper = MapperAlgorithm(
265+
cover=KNNCover(neighbors=15),
266+
verbose=False,
267+
)
268+
269+
graph = mapper.fit_transform(X, y)
270+
271+
plot = MapperPlot(graph, dim=3, iterations=400, seed=42)
272+
273+
fig = plot.plot_plotly(
274+
colors=labels,
275+
cmap=["jet", "viridis", "cividis"],
276+
agg=mode,
277+
node_size=[0.125 * x for x in range(9)],
278+
title="mode of digits",
279+
)
280+
281+
fig.show(config={"scrollZoom": True}, renderer="notebook_connected")
282+
283+
# %% [markdown]
284+
285+
# ## Conclusions
286+
287+
# As a final remark, in the example dataset that we used, despite a significative difference in the
288+
# structure of the Mapper graph, the relationship between the different parts of the data are still
289+
# preserved. This means that even though the cover algorithms create different structures, they
290+
# still capture the same underlying relationships between the data points. This is an important
291+
# aspect of Mapper analysis, as it allows for flexibility in choosing the cover algorithm while
292+
# still maintaining the integrity of the data relationships. In practice, it is often beneficial to
293+
# try multiple cover algorithms and compare the resulting Mapper graphs to gain a comprehensive
294+
# understanding of the data.

0 commit comments

Comments
 (0)