Skip to content

feat: upgrade recommender.py scoring to ML-based cosine similarity us…#136

Open
Yogesh23-03 wants to merge 1 commit into
komalharshita:mainfrom
Yogesh23-03:feature/ml-cosine-similarity
Open

feat: upgrade recommender.py scoring to ML-based cosine similarity us…#136
Yogesh23-03 wants to merge 1 commit into
komalharshita:mainfrom
Yogesh23-03:feature/ml-cosine-similarity

Conversation

@Yogesh23-03
Copy link
Copy Markdown

Summary [required]

This PR upgrades the recommendation engine in utils/recommender.py
from a fixed point-based scoring system to ML-based cosine similarity
using scikit-learn's TfidfVectorizer and cosine_similarity.
This makes project recommendations smarter and more accurate by
computing actual vector similarity between user skills and project
skills instead of simple point counting.

Related Issue [required]

Closes #135

Type of Change [required]

  • Bug fix — resolves a broken behaviour
  • Feature — adds new functionality
  • Data — adds new projects to data/projects.json
  • Documentation — updates docs, README, or code comments only
  • Style — CSS or visual changes only, no logic change
  • Refactor — restructures code without changing behaviour
  • Test — adds or updates tests

What Was Changed [required]

File Change made
utils/recommender.py Replaced point-based scoring with TF-IDF cosine similarity using scikit-learn
tests/test_basic.py Updated expected score from 8 to 15 to reflect new ML scoring output
requirements.txt Added scikit-learn dependency

How to Test This PR [required]

  1. Clone this branch: git checkout feature/ml-cosine-similarity
  2. Install dependencies: pip install -r requirements.txt
  3. Run the app: python app.py
  4. Open http://127.0.0.1:5000 and enter skills to verify recommendations work
  5. Run the tests: python tests/test_basic.py

Expected test output:
27 passed, 0 failed out of 27 tests

Test Results [required]

PASS test_projects_json_loads
PASS test_each_project_has_required_fields
PASS test_find_project_by_id_found
PASS test_find_project_by_id_missing
PASS test_parse_skills_basic
PASS test_parse_skills_empty_string
PASS test_parse_skills_single_entry
PASS test_score_single_project_full_match
PASS test_score_single_project_no_match
PASS test_get_recommendations_returns_results
PASS test_get_recommendations_max_three
PASS test_get_recommendations_no_match_returns_empty
PASS test_get_recommendations_result_format
PASS test_validate_all_valid
PASS test_validate_missing_skills
PASS test_validate_missing_level
PASS test_validate_missing_interest
PASS test_validate_missing_time
PASS test_validate_all_missing
PASS test_home_route
PASS test_recommend_api_valid
PASS test_recommend_api_missing_field
PASS test_recommend_api_empty_body
PASS test_project_detail_found
PASS test_project_detail_not_found
PASS test_view_code_found
PASS test_download_code_found
27 passed, 0 failed out of 27 tests

Screenshots (if UI change)

No UI changes in this PR.

Self-Review Checklist [required]

  • I have read CONTRIBUTING.md and followed all guidelines
  • My branch name follows the convention: feat/, fix/, docs/, data/, style/, test/
  • I have run python tests/test_basic.py and all 27 tests pass
  • I have run flake8 . locally and there are no errors
  • I have not introduced any print() or console.log() debug statements
  • Every new function I wrote has a docstring
  • I have not modified files outside the scope of the linked issue
  • If I changed the UI, I tested it at 375px (mobile) and 1280px (desktop)
  • If I added a project to the dataset, it has all required JSON fields

Notes for Reviewer

The test expected value was updated from 8 to 15 because the scoring
engine was upgraded from fixed points to ML-based cosine similarity.
The old value (8) reflected simple point counting. The new value (15)
reflects cosine similarity score scaled to 10 plus bonus points for
level, interest and time match. All 27 tests pass successfully.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 16, 2026

@Yogesh23-03 is attempting to deploy a commit to the komalsony234-1530's projects Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting your first pull request to DevPath.

Before review:

  • Complete the PR template fully
  • Ensure all tests pass
  • Link your PR to an issue
  • Keep changes scoped to the issue

A maintainer will review your contribution soon.

@Yogesh23-03
Copy link
Copy Markdown
Author

Hi @komalharshita! I noticed there is a merge conflict in
utils/recommender.py. Could you please guide me on how to
resolve it, or let me know if you'd like to handle it from
your side? I'm happy to make any changes needed. 🙏

Copy link
Copy Markdown
Owner

@komalharshita komalharshita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on a more advanced improvement to the recommendation engine. This is one of the more technically ambitious PRs submitted so far and the effort is appreciated.

The TF-IDF + cosine similarity implementation is logically correct, the code is readable, and CI passes successfully. However, there are several concerns that need to be addressed before this can be merged.

Main concerns:

  1. The repository is currently very lightweight, and adding scikit-learn introduces a large dependency for a relatively small recommendation dataset. The complexity increase may not be justified for the current scale of the project.

  2. TF-IDF cosine similarity is being described as “ML-based”, but this implementation is closer to vector similarity / information retrieval rather than machine learning. The terminology should be adjusted for accuracy.

  3. The new scoring system becomes difficult to interpret and maintain:

final_score = (skill_score * 10) + bonus_score

The scaling factor appears arbitrary and there is no calibration or explanation for why 10 was selected.

  1. The PR claims recommendation quality improvements, but no comparison examples or benchmarking against the existing algorithm were provided. Please include:
  • before vs after recommendation examples,
  • edge-case comparisons,
  • and reasoning showing why the new approach improves recommendation relevance.
  1. Since the dataset is still relatively small, consider whether a lighter-weight approach (improved weighted matching, fuzzy matching, synonym expansion, etc.) may achieve similar benefits without introducing heavy ML dependencies.

This is a strong attempt at a meaningful backend improvement, but additional justification and refinement are needed before merge.

@Yogesh23-03
Copy link
Copy Markdown
Author

Yogesh23-03 commented May 17, 2026

Thank you for the detailed review! I've addressed all four concerns below.

  1. Dependency justification (scikit-learn)
    I understand the concern about adding a heavy dependency for a small dataset. However, scikit-learn is a standard, well-maintained library with a minimal runtime footprint for this use case. TF-IDF vectorization is just 3 lines of code without it — but reimplementing it manually introduces risk of bugs and makes the codebase harder to maintain. Additionally, as DevPath grows and more projects are added to data/projects.json, the cosine similarity approach scales naturally without any code changes, whereas the old point-counting system would need manual retuning. I believe this is a worthwhile tradeoff, but I'm happy to discuss a lighter alternative if the team prefers.
  2. Terminology fix
    You're right — "ML-based" was inaccurate. I've updated all references in the code to "vector similarity-based scoring using TF-IDF and cosine similarity". The docstring in score_single_project() has been corrected.
  3. Magic number explanation (* 10)
    The * 10 scaling factor converts the cosine similarity score (which returns a float between 0.0 and 1.0) into a 0–10 range, making it numerically comparable to the bonus points (max 5 points from level + interest + time). Without scaling, a perfect skill match would only contribute 1.0 to the final score, which would be dominated by the 5 bonus points and make skill matching nearly irrelevant. I've now added a named constant and a clear comment in the code:
    pythonSIMILARITY_SCALE = 10 # scales cosine similarity (0.0–1.0) to 0–10 range

so skill match weight is comparable to bonus_score (max 5 points)

final_score = (skill_score * SIMILARITY_SCALE) + bonus_score
4. Before vs After comparison
I tested 3 input scenarios across both versions:
Test 1 — Skills: Python, Flask, SQL, React | Level: Intermediate | Interest: Web Development | Time: Medium
RankOld (point-based)New (cosine similarity)1Task Manager REST APITask Manager REST API2URL ShortenerData Analysis Report Generator ✅3Data Analysis Report GeneratorURL Shortener
The new system correctly promotes Data Analysis Report Generator to rank 2 because it has stronger Python overlap with the user's skill set. The old system ranked it 3rd due to simple point counting.
WhatsApp Image 2026-05-17 at 07 10 44
WhatsApp Image 2026-05-17 at 07 11 39
Test 2 — Skills: HTML, CSS, JavaScript | Level: Beginner | Interest: Web Development | Time: Low
RankOld (point-based)New (cosine similarity)1Weather DashboardWeather Dashboard2Portfolio WebsitePortfolio Website3URL ShortenerURL Shortener
Results are identical — both systems agree on clearly matching projects. This shows the new system doesn't break existing correct recommendations.
Test 3 — Skills: Python only | Level: Intermediate | Interest: Data and Analytics | Time: High
RankOld (point-based)New (cosine similarity)1Data Analysis Report GeneratorData Analysis Report Generator2Personal Expense TrackerURL Shortener ✅3Task Manager REST APIPersonal Expense Tracker
The new system promotes URL Shortener to rank 2 because it has Python + JavaScript overlap, giving it a higher cosine similarity score than Personal Expense Tracker (which is Python only but Beginner level — a mismatch). The old system didn't detect this nuance.
WhatsApp Image 2026-05-17 at 07 15 36
WhatsApp Image 2026-05-17 at 07 16 56
All 27 tests still pass. Happy to make any further changes if needed!
cc @komalharshita — please let me know... if any further changes are needed!

@Yogesh23-03 Yogesh23-03 requested a review from komalharshita May 17, 2026 03:02
Copy link
Copy Markdown
Owner

@komalharshita komalharshita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed follow-up and for addressing the earlier review concerns thoroughly.

The terminology corrections, scaling explanation, and before-vs-after recommendation comparisons significantly improved the clarity and justification of this implementation.

The new cosine similarity helper is modular and the implementation is readable overall. The additional reasoning around why the scaling factor exists also makes the scoring logic much easier to understand and maintain.

At this point, the main remaining blocker is that the branch still has unresolved merge conflicts in utils/recommender.py.

Please rebase/merge the latest main branch and resolve the conflicts cleanly. Once conflicts are resolved and CI passes again, this PR should be in a good state for merge.

@komalharshita komalharshita added the need review Further information is requested label May 18, 2026
@Yogesh23-03 Yogesh23-03 force-pushed the feature/ml-cosine-similarity branch from 5f2058c to 59f3e2a Compare May 18, 2026 11:50
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@Yogesh23-03
Copy link
Copy Markdown
Author

Hi @komalharshita ! Conflict resolved and all issues addressed.

  • 30/30 tests passing ✅
  • SCORING_WEIGHTS kept for backward compatibility ✅
  • Terminology updated to vector similarity-based ✅
  • test_health_check bug fixed (missing client fixture) ✅
  • SIMILARITY_SCALE constant replaces magic number ✅

Ready for final review! 🙏

@Yogesh23-03 Yogesh23-03 requested a review from komalharshita May 18, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Upgrade recommender.py scoring to ML-based cosine similarity using scikit-learn

2 participants