Skip to content

WIP Add module on dimensionality reduction#876

Open
ArturoAmorQ wants to merge 1 commit into
INRIA:mainfrom
ArturoAmorQ:dimred_module
Open

WIP Add module on dimensionality reduction#876
ArturoAmorQ wants to merge 1 commit into
INRIA:mainfrom
ArturoAmorQ:dimred_module

Conversation

@ArturoAmorQ
Copy link
Copy Markdown
Collaborator

Adds module on PCA. Adds the wiki_news dataset, notebooks and exercises.
Probably a good idea to merge this after the clustering module in #836.

Note: This is still WIP. We are missing at least one quiz and the wrap-up quiz.

ax.bar_label(bars)
ax.set_xlim([0, 14])
ax.set_yticks([1, 2], labels=["PC1", "PC2"])
ax.set_xlabel("eigenvalues")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to use explained variance in the plot as it the term eigenvalues is not used (yet):

Suggested change
ax.set_xlabel("eigenvalues")
ax.set_xlabel("Explained variance")

# with each other. Two strongly correlated features will jointly define a
# direction with much higher variance than either one alone, and the explained
# variance ratios across components will still be very unequal. Scaling removes
# the unit bias; it does not make all directions equally important.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that without scaling a magnitude bias is introduced more than a unit bias:

Suggested change
# the unit bias; it does not make all directions equally important.
# the magnitude bias; it does not make all directions equally important.

# ---

# %% [markdown]
# # Solution for Exercise M8.01
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# # Solution for Exercise M8.01
# # Exercise M8.01

# variance, keeps even more: all 900 components we computed pass it, meaning the
# true cutoff lies beyond what we measured.
#
# For text data, a common practice is to fix the number of components to be 100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity; where do these values come from? Why not between 100 and 300?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants