Script that feeds the gurubase.io RAG AI tool with youtube videos by amilcarlucas · Pull Request #414 · ArduPilot/MethodicConfigurator

amilcarlucas · 2025-04-21T15:58:57Z

This pull request introduces a script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include the required library. Below are the key changes:

New Script Addition:

Added a new script, crawl_ardupilot_youtube_channel.py, to scrape video URLs from the ArduPilot YouTube channel. The script uses Selenium for browser automation, handles proxy settings, and includes robust logging. It is designed to work in headless mode and implements several fallback strategies for finding video links.

Dependency Update:

Updated pyproject.toml to include selenium in the scripts section, ensuring the required library is available for the new script.

Copilot

Pull Request Overview

This PR adds two new crawling scripts to feed the gurubase.io RAG AI tool with URLs from ArduPilot's YouTube channel and documentation pages, alongside updates to dependency configuration in pyproject.toml and the linting workflow.

Added script for extracting YouTube video URLs.
Added script for crawling ArduPilot documentation URLs with duplicate removal logic.
Updated project configuration files to include required dependencies for the new scripts.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File	Description
scripts/crawl_ardupilot_youtube_channel.py	New script to crawl the ArduPilot YouTube channel for video URLs.
scripts/crawl_ardupilot_wiki.py	New script to crawl the ArduPilot documentation pages with URL deduplication.
pyproject.toml	Added extra dependencies under "scripts" to support crawling functionality.
.github/workflows/pylint.yml	Updated dependency installation command to include the new "scripts" extras.

Files not reviewed (1)

scripts/crawl_github_ardupilot.sh: Language not supported

Copilot · 2025-04-21T15:59:29Z

+    if proxies:
+        HTTPProxyAuth(USERNAME, PASSWORD)
+
+    # Setup Firefox in headless mode
+    options = Options()
+    options.add_argument("--headless")


The call to HTTPProxyAuth here is not assigned to any variable or integrated with the webdriver or session, which may make proxy authentication ineffective. Consider reviewing and properly integrating proxy authentication if proxies are in use.

Suggested change

if proxies:

HTTPProxyAuth(USERNAME, PASSWORD)

# Setup Firefox in headless mode

options = Options()

options.add_argument("--headless")

auth = None

if proxies:

auth = HTTPProxyAuth(USERNAME, PASSWORD)

from selenium.webdriver.common.proxy import Proxy, ProxyType

proxy = Proxy()

proxy.http_proxy = proxies.get("http")

proxy.ssl_proxy = proxies.get("https")

proxy.proxy_type = ProxyType.MANUAL

# Setup Firefox in headless mode

options = Options()

options.add_argument("--headless")

if proxies:

options.proxy = proxy

github-actions · 2025-04-21T17:33:21Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
6134	4027	66%	60%	🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 98ec2d7 by action🐍

github-actions · 2025-04-21T20:00:00Z

Test Results

2 files 2 suites 1m 55s ⏱️
835 tests 834 ✅ 1 💤 0 ❌
1 670 runs 1 668 ✅ 2 💤 0 ❌

Results for commit 98ec2d7.

Copilot

Pull Request Overview

This pull request introduces a new script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include selenium.

Added the Python script (crawl_ardupilot_youtube_channel.py) that automates browser operations, handles proxies, and implements scrolling and logging for collecting video URLs.
Updated pyproject.toml to include selenium in the scripts section for dependency management.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
scripts/crawl_ardupilot_youtube_channel.py	Implements YouTube video URL scraping with Selenium using headless mode, proxy handling, and scrolling logic.
pyproject.toml	Adds selenium as a dependency under the scripts section.

Copilot · 2025-04-22T12:10:28Z

+
+    proxies = get_env_proxies()
+    if proxies:
+        HTTPProxyAuth(USERNAME, PASSWORD)


The instantiation of HTTPProxyAuth is not assigned to any variable nor passed to the WebDriver. Ensure that proxy authentication is properly configured for the driver if required.

Copilot · 2025-04-22T12:10:29Z

+                if len(video_links) >= 80:  # gurubase has a limit of 100 videos
+                    break
+
+            if len(video_links) >= 80:  # gurubase has a limit of 100 videos


The condition restricts video_links to 80, but the comment indicates a limit of 100. Please update the condition or comment to align the intended logic.

Suggested change

if len(video_links) >= 80: # gurubase has a limit of 100 videos

break

if len(video_links) >= 80: # gurubase has a limit of 100 videos

if len(video_links) >= 100: # gurubase has a limit of 100 videos

break

if len(video_links) >= 100: # gurubase has a limit of 100 videos

amilcarlucas · 2025-05-24T18:35:21Z

gurubase does this already, no need for this code

Copilot AI review requested due to automatic review settings April 21, 2025 15:58

Copilot AI reviewed Apr 21, 2025

View reviewed changes

feat(youtube): add youtube channel crawl script

98ec2d7

amilcarlucas force-pushed the youtube branch from cf6c7dd to 98ec2d7 Compare April 21, 2025 18:05

amilcarlucas requested a review from Copilot April 22, 2025 12:09

Copilot AI reviewed Apr 22, 2025

View reviewed changes

amilcarlucas closed this May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script that feeds the gurubase.io RAG AI tool with youtube videos#414

Script that feeds the gurubase.io RAG AI tool with youtube videos#414
amilcarlucas wants to merge 1 commit into
masterfrom
youtube

amilcarlucas commented Apr 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2025

Uh oh!

github-actions Bot commented Apr 21, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2025

Uh oh!

Copilot AI Apr 22, 2025

Uh oh!

amilcarlucas commented May 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amilcarlucas commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Script Addition:

Dependency Update:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Uh oh!

github-actions Bot commented Apr 21, 2025

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

amilcarlucas commented May 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amilcarlucas commented Apr 21, 2025 •

edited

Loading

github-actions Bot commented Apr 21, 2025 •

edited

Loading