Skip to content

Script that feeds the gurubase.io RAG AI tool with youtube videos#414

Closed
amilcarlucas wants to merge 1 commit into
masterfrom
youtube
Closed

Script that feeds the gurubase.io RAG AI tool with youtube videos#414
amilcarlucas wants to merge 1 commit into
masterfrom
youtube

Conversation

@amilcarlucas
Copy link
Copy Markdown
Collaborator

@amilcarlucas amilcarlucas commented Apr 21, 2025

This pull request introduces a script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include the required library. Below are the key changes:

New Script Addition:

  • Added a new script, crawl_ardupilot_youtube_channel.py, to scrape video URLs from the ArduPilot YouTube channel. The script uses Selenium for browser automation, handles proxy settings, and includes robust logging. It is designed to work in headless mode and implements several fallback strategies for finding video links.

Dependency Update:

  • Updated pyproject.toml to include selenium in the scripts section, ensuring the required library is available for the new script.

Copilot AI review requested due to automatic review settings April 21, 2025 15:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds two new crawling scripts to feed the gurubase.io RAG AI tool with URLs from ArduPilot's YouTube channel and documentation pages, alongside updates to dependency configuration in pyproject.toml and the linting workflow.

  • Added script for extracting YouTube video URLs.
  • Added script for crawling ArduPilot documentation URLs with duplicate removal logic.
  • Updated project configuration files to include required dependencies for the new scripts.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
scripts/crawl_ardupilot_youtube_channel.py New script to crawl the ArduPilot YouTube channel for video URLs.
scripts/crawl_ardupilot_wiki.py New script to crawl the ArduPilot documentation pages with URL deduplication.
pyproject.toml Added extra dependencies under "scripts" to support crawling functionality.
.github/workflows/pylint.yml Updated dependency installation command to include the new "scripts" extras.
Files not reviewed (1)
  • scripts/crawl_github_ardupilot.sh: Language not supported

Comment on lines +54 to +59
if proxies:
HTTPProxyAuth(USERNAME, PASSWORD)

# Setup Firefox in headless mode
options = Options()
options.add_argument("--headless")
Copy link

Copilot AI Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to HTTPProxyAuth here is not assigned to any variable or integrated with the webdriver or session, which may make proxy authentication ineffective. Consider reviewing and properly integrating proxy authentication if proxies are in use.

Suggested change
if proxies:
HTTPProxyAuth(USERNAME, PASSWORD)
# Setup Firefox in headless mode
options = Options()
options.add_argument("--headless")
auth = None
if proxies:
auth = HTTPProxyAuth(USERNAME, PASSWORD)
from selenium.webdriver.common.proxy import Proxy, ProxyType
proxy = Proxy()
proxy.http_proxy = proxies.get("http")
proxy.ssl_proxy = proxies.get("https")
proxy.proxy_type = ProxyType.MANUAL
# Setup Firefox in headless mode
options = Options()
options.add_argument("--headless")
if proxies:
options.proxy = proxy

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
6134 4027 66% 60% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 98ec2d7 by action🐍

@github-actions
Copy link
Copy Markdown
Contributor

Test Results

    2 files      2 suites   1m 55s ⏱️
  835 tests   834 ✅ 1 💤 0 ❌
1 670 runs  1 668 ✅ 2 💤 0 ❌

Results for commit 98ec2d7.

@amilcarlucas amilcarlucas requested a review from Copilot April 22, 2025 12:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a new script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include selenium.

  • Added the Python script (crawl_ardupilot_youtube_channel.py) that automates browser operations, handles proxies, and implements scrolling and logging for collecting video URLs.
  • Updated pyproject.toml to include selenium in the scripts section for dependency management.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
scripts/crawl_ardupilot_youtube_channel.py Implements YouTube video URL scraping with Selenium using headless mode, proxy handling, and scrolling logic.
pyproject.toml Adds selenium as a dependency under the scripts section.


proxies = get_env_proxies()
if proxies:
HTTPProxyAuth(USERNAME, PASSWORD)
Copy link

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instantiation of HTTPProxyAuth is not assigned to any variable nor passed to the WebDriver. Ensure that proxy authentication is properly configured for the driver if required.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +106
if len(video_links) >= 80: # gurubase has a limit of 100 videos
break

if len(video_links) >= 80: # gurubase has a limit of 100 videos
Copy link

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition restricts video_links to 80, but the comment indicates a limit of 100. Please update the condition or comment to align the intended logic.

Suggested change
if len(video_links) >= 80: # gurubase has a limit of 100 videos
break
if len(video_links) >= 80: # gurubase has a limit of 100 videos
if len(video_links) >= 100: # gurubase has a limit of 100 videos
break
if len(video_links) >= 100: # gurubase has a limit of 100 videos

Copilot uses AI. Check for mistakes.
@amilcarlucas
Copy link
Copy Markdown
Collaborator Author

gurubase does this already, no need for this code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants