find common tags by misrasaurabh1 · Pull Request #891 · codeflash-ai/codeflash

misrasaurabh1 · 2025-11-10T02:49:11Z

PR Type

Enhancement, Tests

Description

Add common tags utility function
Implement tests for tag intersection

Diagram Walkthrough

flowchart LR
  A["Add common_tags module"] -- "provides find_common_tags" --> B["Set of common tags"]
  C["Add tests"] -- "validate intersections" --> B

File Walkthrough

Relevant files

Enhancement

common_tags.py `Add tag intersection utility function` codeflash/result/common_tags.py Introduce `find_common_tags` function. Handles empty input returning empty set. Iteratively intersects tag lists across articles. Returns result as a set of strings.	+11/-0

Tests

test_common_tags.py `Add unit tests for common tags` tests/test_common_tags.py Add tests for `find_common_tags`. Validate common tags across 3 and 4 articles. Assert expected set {"Python", "AI"}.	+22/-0

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>

github-actions · 2025-11-10T02:50:08Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Performance The intersection is computed via repeated list comprehensions, resulting in O(nm) membership checks per step. Converting to sets and using set intersection would be clearer and more efficient for larger tag lists. common_tags = articles[0].get("tags", []) for article in articles[1:]: common_tags = [tag for tag in common_tags if tag in article.get("tags", [])] Deduplication* Using a list to accumulate common tags preserves duplicates until the final cast to set. If duplicate tags exist in the first article, interim operations may do extra work; starting with a set avoids this and simplifies logic. common_tags = articles[0].get("tags", []) for article in articles[1:]: common_tags = [tag for tag in common_tags if tag in article.get("tags", [])] return set(common_tags) Type Robustness The code assumes each article has a list of strings under `tags`. Consider defensive handling (e.g., treat missing or non-list `tags` as empty) to prevent runtime issues with malformed inputs. common_tags = articles[0].get("tags", []) for article in articles[1:]: common_tags = [tag for tag in common_tags if tag in article.get("tags", [])]

github-actions · 2025-11-10T02:50:23Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Use set intersections Converting to and operating on sets will improve performance and clarity for tag intersection. It also avoids repeated linear membership checks and preserves set semantics throughout. codeflash/result/common_tags.py [8-11] -common_tags = articles[0].get("tags", []) +common_tags = set(articles[0].get("tags", [])) for article in articles[1:]: - common_tags = [tag for tag in common_tags if tag in article.get("tags", [])] -return set(common_tags) + common_tags &= set(article.get("tags", [])) +return common_tags Suggestion importance[1-10]: 7 __ Why: Switching to set intersection is accurate for this logic, improves performance and clarity, and the improved_code correctly transforms the existing list-based intersection. Impact is moderate since current code is correct but less efficient.	Medium
General	Tighten input type hints The type hints are too permissive and may hide malformed inputs (e.g., non-list tags). Narrow the input type for each article to ensure `tags` is a sequence of strings. codeflash/result/common_tags.py [4] -def find_common_tags(articles: list[dict[str, list[str]]]) -> set[str]: +from typing import TypedDict, Sequence +class Article(TypedDict, total=False): + title: str + tags: Sequence[str] + +def find_common_tags(articles: list[Article]) -> set[str]: + Suggestion importance[1-10]: 5 __ Why: Stricter typing via TypedDict/Sequence can improve maintainability, but it's optional and not critical to functionality; it also introduces new types beyond the current diff. The suggestion is valid but has moderate impact.	Low
Possible issue	Robust initialization of seed set If the first article lacks `tags`, starting with an empty list forces the result to empty even when later articles share tags. Initialize with the first present tags or an empty set only if none exist. codeflash/result/common_tags.py [8] -common_tags = articles[0].get("tags", []) +# initialize with the first article that has tags; otherwise empty set +common_tags: set[str] = set() +for first in articles: + first_tags = first.get("tags", []) + if first_tags: + common_tags = set(first_tags) + break +for article in articles: + tags = set(article.get("tags", [])) + if common_tags: + common_tags &= tags + else: + common_tags = tags +return common_tags Suggestion importance[1-10]: 3 __ Why: The concern about empty initial tags is not an issue for computing intersection—starting empty would lead to empty result, but current code starts with the first list and intersects, which is standard. The proposed change adds complexity with marginal benefit.	Low

codeflash-ai · 2025-11-10T02:53:40Z

+    common_tags = articles[0].get("tags", [])
+    for article in articles[1:]:
+        common_tags = [tag for tag in common_tags if tag in article.get("tags", [])]
+    return set(common_tags)


⚡️Codeflash found 8,026% (80.26x) speedup for find_common_tags in codeflash/result/common_tags.py

⏱️ Runtime : 583 milliseconds → 7.18 milliseconds (best of 96 runs)

📝 Explanation and details

The optimization achieves a remarkable 8025% speedup by replacing inefficient list operations with optimized set operations for finding common tags across articles.

Key Changes:

Initial conversion to set: common_tags = set(articles[0].get("tags", [])) instead of keeping tags as a list

Set intersection instead of list comprehension: common_tags.intersection_update(article.get("tags", [])) replaces [tag for tag in common_tags if tag in article.get("tags", [])]

Why This Is Much Faster:

O(1) vs O(n) lookups: The original code uses tag in article.get("tags", []) which is O(n) for lists. Set membership testing is O(1) on average.

Eliminates quadratic complexity: The original list comprehension creates O(n×m) operations where n is tags in common_tags and m is tags per article. With many articles, this compounds exponentially.

In-place operations: intersection_update modifies the existing set rather than creating new data structures each iteration, reducing memory allocations.

Performance Impact by Test Case:

Massive gains on large datasets: Tests with 1000+ tags show 5257-11201% speedups, demonstrating how the optimization scales

Consistent improvements on small datasets: Even simple cases show 10-50% improvements

Most effective when: Articles have many tags or there are many articles to process

The line profiler confirms this: the bottleneck line went from 99.6% of execution time (635ms) to just 79.3% (12.5ms), representing a ~50x improvement on the critical path. This optimization is particularly valuable for content management systems or tag analysis workflows processing large article datasets.

✅ Correctness verification report:

Test Status

⚙️ Existing Unit Tests ✅ 2 Passed

🌀 Generated Regression Tests ✅ 29 Passed

⏪ Replay Tests 🔘 None Found

🔎 Concolic Coverage Tests ✅ 2 Passed

📊 Tests Coverage 100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup

test_common_tags.py::test_common_tags_1 5.05μs 3.93μs 28.3%✅

🌀 Generated Regression Tests and Runtime

# imports # function to test from __future__ import annotations import pytest # used for our unit tests from codeflash.result.common_tags import find_common_tags # unit tests def test_single_article(): # Single article should return its tags articles = [{"tags": ["python", "coding", "tutorial"]}] codeflash_output = find_common_tags(articles) # 1.55μs -> 1.24μs (25.5% faster) # Outputs were verified to be equal to the original implementation def test_multiple_articles_with_common_tags(): # Multiple articles with common tags should return the common tags articles = [ {"tags": ["python", "coding"]}, {"tags": ["python", "data"]}, {"tags": ["python", "machine learning"]} ] codeflash_output = find_common_tags(articles) # 2.90μs -> 2.32μs (25.0% faster) # Outputs were verified to be equal to the original implementation def test_empty_list_of_articles(): # Empty list of articles should return an empty set articles = [] codeflash_output = find_common_tags(articles) # 732ns -> 483ns (51.6% faster) # Outputs were verified to be equal to the original implementation def test_articles_with_no_common_tags(): # Articles with no common tags should return an empty set articles = [ {"tags": ["python"]}, {"tags": ["java"]}, {"tags": ["c++"]} ] codeflash_output = find_common_tags(articles) # 2.38μs -> 2.15μs (10.5% faster) # Outputs were verified to be equal to the original implementation def test_articles_with_empty_tag_lists(): # Articles with some empty tag lists should return an empty set articles = [ {"tags": []}, {"tags": ["python"]}, {"tags": ["python", "java"]} ] codeflash_output = find_common_tags(articles) # 1.94μs -> 1.95μs (0.718% slower) # Outputs were verified to be equal to the original implementation def test_all_articles_with_empty_tag_lists(): # All articles with empty tag lists should return an empty set articles = [ {"tags": []}, {"tags": []}, {"tags": []} ] codeflash_output = find_common_tags(articles) # 1.90μs -> 1.78μs (6.69% faster) # Outputs were verified to be equal to the original implementation def test_tags_with_special_characters(): # Tags with special characters should be handled correctly articles = [ {"tags": ["python!", "coding"]}, {"tags": ["python!", "data"]} ] codeflash_output = find_common_tags(articles) # 2.16μs -> 1.78μs (21.3% faster) # Outputs were verified to be equal to the original implementation def test_case_sensitivity(): # Tags with different cases should not be considered the same articles = [ {"tags": ["Python", "coding"]}, {"tags": ["python", "data"]} ] codeflash_output = find_common_tags(articles) # 1.91μs -> 1.74μs (9.55% faster) # Outputs were verified to be equal to the original implementation def test_large_number_of_articles(): # Large number of articles with a common tag should return that tag articles = [{"tags": ["common_tag", f"tag{i}"]} for i in range(1000)] codeflash_output = find_common_tags(articles) # 206μs -> 138μs (49.8% faster) # Outputs were verified to be equal to the original implementation def test_large_number_of_tags(): # Large number of tags with some common tags should return the common tags articles = [ {"tags": [f"tag{i}" for i in range(1000)]}, {"tags": [f"tag{i}" for i in range(500, 1500)]} ] expected = {f"tag{i}" for i in range(500, 1000)} codeflash_output = find_common_tags(articles) # 4.43ms -> 82.7μs (5257% faster) # Outputs were verified to be equal to the original implementation def test_mixed_length_of_tag_lists(): # Articles with mixed length of tag lists should return the common tags articles = [ {"tags": ["python", "coding"]}, {"tags": ["python"]}, {"tags": ["python", "coding", "tutorial"]} ] codeflash_output = find_common_tags(articles) # 2.42μs -> 2.12μs (13.9% faster) # Outputs were verified to be equal to the original implementation def test_tags_with_different_data_types(): # Tags with different data types should only consider strings articles = [ {"tags": ["python", 123]}, {"tags": ["python", "123"]} ] codeflash_output = find_common_tags(articles) # 1.98μs -> 1.79μs (10.5% faster) # Outputs were verified to be equal to the original implementation def test_performance_with_large_data(): # Performance with large data should return the common tag articles = [{"tags": ["common_tag", f"tag{i}"]} for i in range(10000)] codeflash_output = find_common_tags(articles) # 2.14ms -> 1.42ms (50.2% faster) # Outputs were verified to be equal to the original implementation def test_scalability_with_increasing_tags(): # Scalability with increasing tags should return the common tag articles = [{"tags": ["common_tag"] + [f"tag{i}" for i in range(j)]} for j in range(1, 1001)] codeflash_output = find_common_tags(articles) # 458μs -> 341μs (34.2% faster) # Outputs were verified to be equal to the original implementation

# imports # function to test from __future__ import annotations import pytest # used for our unit tests from codeflash.result.common_tags import find_common_tags # unit tests def test_empty_input_list(): # Test with an empty list codeflash_output = find_common_tags([]) # 575ns -> 484ns (18.8% faster) # Outputs were verified to be equal to the original implementation def test_single_article(): # Test with a single article with tags codeflash_output = find_common_tags([{"tags": ["python", "coding", "development"]}]) # 1.33μs -> 1.17μs (13.8% faster) # Test with a single article with no tags codeflash_output = find_common_tags([{"tags": []}]) # 478ns -> 404ns (18.3% faster) # Outputs were verified to be equal to the original implementation def test_multiple_articles_some_common_tags(): # Test with multiple articles having some common tags articles = [ {"tags": ["python", "coding", "development"]}, {"tags": ["python", "development", "tutorial"]}, {"tags": ["python", "development", "guide"]} ] codeflash_output = find_common_tags(articles) # 2.62μs -> 2.25μs (16.5% faster) articles = [ {"tags": ["tech", "news"]}, {"tags": ["tech", "gadgets"]}, {"tags": ["tech", "reviews"]} ] codeflash_output = find_common_tags(articles) # 1.42μs -> 1.03μs (37.3% faster) # Outputs were verified to be equal to the original implementation def test_multiple_articles_no_common_tags(): # Test with multiple articles having no common tags articles = [ {"tags": ["python", "coding"]}, {"tags": ["development", "tutorial"]}, {"tags": ["guide", "learning"]} ] codeflash_output = find_common_tags(articles) # 2.08μs -> 2.02μs (2.62% faster) articles = [ {"tags": ["apple", "banana"]}, {"tags": ["orange", "grape"]}, {"tags": ["melon", "kiwi"]} ] codeflash_output = find_common_tags(articles) # 1.23μs -> 1.01μs (21.8% faster) # Outputs were verified to be equal to the original implementation def test_articles_with_duplicate_tags(): # Test with articles having duplicate tags articles = [ {"tags": ["python", "python", "coding"]}, {"tags": ["python", "development", "python"]}, {"tags": ["python", "guide", "python"]} ] codeflash_output = find_common_tags(articles) # 2.46μs -> 2.03μs (21.4% faster) articles = [ {"tags": ["tech", "tech", "news"]}, {"tags": ["tech", "tech", "gadgets"]}, {"tags": ["tech", "tech", "reviews"]} ] codeflash_output = find_common_tags(articles) # 1.39μs -> 985ns (40.6% faster) # Outputs were verified to be equal to the original implementation def test_articles_with_mixed_case_tags(): # Test with articles having mixed case tags articles = [ {"tags": ["Python", "Coding"]}, {"tags": ["python", "Development"]}, {"tags": ["PYTHON", "Guide"]} ] codeflash_output = find_common_tags(articles) # 2.16μs -> 1.97μs (9.49% faster) articles = [ {"tags": ["Tech", "News"]}, {"tags": ["tech", "Gadgets"]}, {"tags": ["TECH", "Reviews"]} ] codeflash_output = find_common_tags(articles) # 1.10μs -> 1.00μs (9.60% faster) # Outputs were verified to be equal to the original implementation def test_articles_with_non_string_tags(): # Test with articles having non-string tags articles = [ {"tags": ["python", 123, "coding"]}, {"tags": ["python", "development", 123]}, {"tags": ["python", "guide", 123]} ] codeflash_output = find_common_tags(articles) # 2.68μs -> 2.05μs (30.5% faster) articles = [ {"tags": [None, "news"]}, {"tags": ["tech", None]}, {"tags": [None, "reviews"]} ] codeflash_output = find_common_tags(articles) # 1.48μs -> 1.07μs (38.4% faster) # Outputs were verified to be equal to the original implementation def test_large_scale_test_cases(): # Test with large scale input where all tags should be common articles = [ {"tags": ["tag" + str(i) for i in range(1000)]} for _ in range(100) ] expected_output = {"tag" + str(i) for i in range(1000)} codeflash_output = find_common_tags(articles) # 385ms -> 3.41ms (11201% faster) # Test with large scale input where no tags should be common articles = [ {"tags": ["tag" + str(i) for i in range(1000)]} for _ in range(50) ] + [{"tags": ["unique_tag"]}] codeflash_output = find_common_tags(articles) # 190ms -> 1.74ms (10851% faster) # Outputs were verified to be equal to the original implementation

from codeflash.result.common_tags import find_common_tags def test_find_common_tags(): find_common_tags([{}, {}]) def test_find_common_tags_2(): find_common_tags([])

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup

codeflash_concolic_2d1ideoq/tmpgwc7y34w/test_concolic_coverage.py::test_find_common_tags 2.08μs 1.80μs 15.4%✅

codeflash_concolic_2d1ideoq/tmpgwc7y34w/test_concolic_coverage.py::test_find_common_tags_2 659ns 507ns 30.0%✅

To test or edit this optimization locally git merge codeflash/optimize-pr891-2025-11-10T02.53.34

Suggested change

common_tags = articles[0].get("tags", [])

for article in articles[1:]:

common_tags = [tag for tag in common_tags if tag in article.get("tags", [])]

return set(common_tags)

common_tags = set(articles[0].get("tags", []))

for article in articles[1:]:

common_tags.intersection_update(article.get("tags", []))

return common_tags

find common tags

530c9dd

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>

github-actions Bot added the Review effort 2/5 label Nov 10, 2025

codeflash-ai Bot reviewed Nov 10, 2025

View reviewed changes

misrasaurabh1 closed this Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find common tags#891

find common tags#891
misrasaurabh1 wants to merge 1 commit into
mainfrom
codeflash-demo-009

misrasaurabh1 commented Nov 10, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Nov 10, 2025

Uh oh!

github-actions Bot commented Nov 10, 2025

Uh oh!

codeflash-ai Bot Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Test	Status
⚙️ Existing Unit Tests	✅ 2 Passed
🌀 Generated Regression Tests	✅ 29 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 2 Passed
📊 Tests Coverage	100.0%

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_2d1ideoq/tmpgwc7y34w/test_concolic_coverage.py::test_find_common_tags`	2.08μs	1.80μs	15.4%✅
`codeflash_concolic_2d1ideoq/tmpgwc7y34w/test_concolic_coverage.py::test_find_common_tags_2`	659ns	507ns	30.0%✅

Conversation

misrasaurabh1 commented Nov 10, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions Bot commented Nov 10, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Nov 10, 2025

PR Code Suggestions ✨

Uh oh!

codeflash-ai Bot Nov 10, 2025

Choose a reason for hiding this comment

⚡️Codeflash found 8,026% (80.26x) speedup for find_common_tags in codeflash/result/common_tags.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

misrasaurabh1 commented Nov 10, 2025 •

edited by github-actions Bot

Loading

⚡️Codeflash found 8,026% (80.26x) speedup for `find_common_tags` in `codeflash/result/common_tags.py`