From 850723e776f549a9544a0ec80500ec6446085f32 Mon Sep 17 00:00:00 2001
From: Nick Budak <budak@stanford.edu>
Date: Wed, 17 Jun 2026 14:35:48 -0700
Subject: [PATCH] Filter repositories by declared schema using repository
 properties

This updates the clone and pull tasks to only apply to repositories
that declare their conformance with a particular OGM schema version
(v1.0 or Aardvark).

This is implemented via a repository custom property, which is set
at the organization level and can be enabled on a per-repository
basis. Unlike repository topics, this value is a controlled vocabulary
and is unique to the OGM organization.

This functionality replaces the "denylist" that was a hardcoded list
of repositories that shouldn't be harvested (e.g. because they were
tools or didn't contain metadata). In this version, providers can
opt-in to harvesting by adding the relevant schema property.

After this change, you no longer need to clone repositories that
don't match the schema version you are using, which saves disk
space and processing time.
---
 .rubocop_todo.yml                      |  2 +-
 README.md                              | 28 ++++++++++------
 lib/geo_combine/harvester.rb           | 41 ++++++++++-------------
 spec/lib/geo_combine/harvester_spec.rb | 46 +++++++++++++++++---------
 4 files changed, 68 insertions(+), 49 deletions(-)

diff --git a/.rubocop_todo.yml b/.rubocop_todo.yml
index 963a40c..ffb030c 100644
--- a/.rubocop_todo.yml
+++ b/.rubocop_todo.yml
@@ -41,7 +41,7 @@ Metrics/CyclomaticComplexity:
 # Offense count: 13
 # Configuration parameters: CountComments, CountAsOne, AllowedMethods, AllowedPatterns.
 Metrics/MethodLength:
-  Max: 21
+  Max: 25
 
 # Offense count: 2
 # Configuration parameters: AllowedMethods, AllowedPatterns.
diff --git a/README.md b/README.md
index 9d4222c..443adc4 100644
--- a/README.md
+++ b/README.md
@@ -32,8 +32,11 @@ $ gem install geo_combine
 ## Usage
 
 ### Converting metadata
+
 #### Converting metadata into GeoBlacklight JSON
+
 GeoCombine provides several classes representing different metadata standards that implement the `#to_geoblacklight` method for generating records in the [GeoBlacklight JSON format](https://opengeometadata.org/reference/):
+
 ```ruby
 GeoCombine::Iso19139 # ISO 19139 XML
 GeoCombine::OGP # OpenGeoPortal JSON
@@ -41,7 +44,9 @@ GeoCombine::Fgdc # FGDC XML
 GeoCombine::EsriOpenData # Esri Open Data Portal JSON
 GeoCombine::CkanMetadata # CKAN JSON
 ```
+
 An example for converting an ISO 19139 XML record:
+
 ```ruby
 # Create a new ISO19139 object
 > iso_metadata =  GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml')
@@ -52,7 +57,9 @@ An example for converting an ISO 19139 XML record:
 # Output it as JSON instead of a Ruby hash
 > iso_metadata.to_geoblacklight.to_json
 ```
+
 Some formats also support conversion into HTML for display in a web browser:
+
 ```ruby
 # Create a new ISO19139 object
 > iso_metadata =  GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml')
@@ -94,7 +101,16 @@ GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map
 Some of the tools and scripts in this gem use Ruby's `Logger` class to print information to `$stderr`. By default, the log level is set to `Logger::INFO`. For more verbose information, you can set the `LOG_LEVEL` environment variable to `DEBUG`:
 
 ```sh
-$ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone
+$ export LOG_LEVEL=DEBUG
+```
+
+#### Schema version
+
+By default, GeoCombine will only fetch and index records using the current schema version. If you instead want to index records using an older schema (e.g. because your GeoBlacklight instance is an older version), you can set the `SCHEMA_VERSION` environment variable:
+
+```sh
+# Only fetch and index schema version 1.0 records
+$ export SCHEMA_VERSION=1.0
 ```
 
 #### Clone OpenGeoMetadata repositories locally
@@ -103,7 +119,7 @@ $ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone
 $ bundle exec rake geocombine:clone
 ```
 
-Will clone all `edu.*`,` org.*`, and `uk.*` OpenGeoMetadata repositories into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable.
+Will clone all OpenGeoMetadata repositories containing metadata matching the `SCHEMA_VERSION` into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable.
 
 ```sh
 $ OGM_PATH='my/custom/location' bundle exec rake geocombine:clone
@@ -158,12 +174,6 @@ You can also set the Solr instance URL using `SOLR_URL`:
 $ SOLR_URL=http://www.example.com:1234/solr/collection bundle exec rake geocombine:index
 ```
 
-By default, GeoCombine will index only records using the Aardvark metadata format. If you instead want to index records using an older format (e.g. because your GeoBlacklight instance is version 3 or older), you can set the `SCHEMA_VERSION` environment variable:
-
-```sh
-# Only index schema version 1.0 records
-$ SCHEMA_VERSION=1.0 bundle exec rake geocombine:index
-```
 ### Indexing local documents
 
 To index an arbitrary collection of records in a custom directory, run one of the following:
@@ -182,8 +192,6 @@ rake geocombine:index\[/path/to/your/file.json\]
 OGM_PATH=/path/to/your/file.json rake geocombine:index
 ```
 
-
-
 ### Harvesting and indexing documents from GeoBlacklight sites
 
 GeoCombine provides a Harvester class and rake task to harvest and index content from GeoBlacklight sites (or any site that follows the Blacklight API format). Given that the configurations can change from consumer to consumer and site to site, the class provides a relatively simple configuration API. This can be configured in an initializer, a wrapping rake task, or any other ruby context where the rake task our class would be invoked.
diff --git a/lib/geo_combine/harvester.rb b/lib/geo_combine/harvester.rb
index 3fffb3a..d025102 100644
--- a/lib/geo_combine/harvester.rb
+++ b/lib/geo_combine/harvester.rb
@@ -11,19 +11,6 @@ module GeoCombine
   class Harvester
     attr_reader :ogm_path, :schema_version
 
-    # Non-metadata repositories that shouldn't be harvested
-    def self.denylist
-      [
-        'GeoCombine',
-        'aardvark',
-        'metadata-issues',
-        'ogm_utils-python',
-        'opengeometadata.github.io',
-        'opengeometadata-rails',
-        'gbl-1_to_aardvark'
-      ]
-    end
-
     # GitHub API endpoint for OpenGeoMetadata repositories
     def self.ogm_api_uri
       URI('https://api.github.com/orgs/opengeometadata/repos?per_page=1000')
@@ -53,9 +40,16 @@ def docs_to_index
 
         doc = JSON.parse(File.read(path))
         [doc].flatten.each do |record|
-          # skip indexing if this record has a different schema version than what we want
           record_schema = record['gbl_mdVersion_s'] || record['geoblacklight_version']
           record_id = record['layer_slug_s'] || record['dc_identifier_s']
+
+          # skip indexing if no identifiable schema version
+          unless record_schema
+            @logger.debug "skipping #{record_id || path}; no schema version declared in record"
+            next
+          end
+
+          # skip indexing if this record has a different schema version than what we want
           if record_schema != @schema_version
             @logger.debug "skipping #{record_id}; schema version #{record_schema} doesn't match #{@schema_version}"
             next
@@ -87,19 +81,20 @@ def pull_all
     end
 
     # Clone a repository via git
-    # If the repository already exists, skip it.
+    # Return the name of the repository cloned, or nil if skipped
     def clone(repo)
       repo_path = File.join(@ogm_path, repo)
       repo_info = repository_info(repo)
       repo_url = "https://github.com/OpenGeoMetadata/#{repo}.git"
-
-      # Skip if exists; warn if archived or empty
-      if File.directory? repo_path
-        @logger.warn "skipping clone to #{repo_path}; directory exists"
-        return nil
+      repo_schemas = Array(repo_info.dig('custom_properties', 'supported_schemas'))
+
+      # Skip if exists, archived, empty, or different schema
+      return @logger.warn "skipping clone to #{repo_path}; directory exists" if File.directory? repo_path
+      return @logger.warn "repository is archived: #{repo_url}" if repo_info['archived']
+      return @logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero?
+      unless repo_schemas.include? @schema_version
+        return @logger.warn "repository #{repo_url} clone to #{repo_path}; repository properties don't include schema version #{@schema_version} (found #{repo_schemas.join(', ')})"
       end
-      @logger.warn "repository is archived: #{repo_url}" if repo_info['archived']
-      @logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero?
 
       Git.clone(repo_url, nil, path: ogm_path, depth: 1)
       @logger.info "cloned #{repo_url} to #{repo_path}"
@@ -119,10 +114,10 @@ def clone_all
     # List of repository names to harvest
     def repositories
       @repositories ||= JSON.parse(Net::HTTP.get(self.class.ogm_api_uri))
+                            .filter { |repo| Array(repo.dig('custom_properties', 'supported_schemas')).include? @schema_version }
                             .filter { |repo| repo['size'].positive? }
                             .reject { |repo| repo['archived'] }
                             .map { |repo| repo['name'] }
-                            .reject { |name| self.class.denylist.include? name }
     end
 
     def repository_info(repo_name)
diff --git a/spec/lib/geo_combine/harvester_spec.rb b/spec/lib/geo_combine/harvester_spec.rb
index 80781f9..d75b91b 100644
--- a/spec/lib/geo_combine/harvester_spec.rb
+++ b/spec/lib/geo_combine/harvester_spec.rb
@@ -5,7 +5,7 @@
 require 'spec_helper'
 
 RSpec.describe GeoCombine::Harvester do
-  subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', schema_version: '1.0') }
+  subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', logger: logger) }
 
   let(:logger) { instance_double(Logger, warn: nil, info: nil, error: nil, debug: nil) }
   let(:repo_name) { 'my-institution' }
@@ -14,11 +14,12 @@
   let(:stub_repo) { instance_double(Git::Base) }
   let(:stub_gh_api) do
     [
-      { name: repo_name, size: 100 },
-      { name: 'another-institution', size: 100 },
-      { name: 'outdated-institution', size: 100, archived: true }, # archived
-      { name: 'aardvark', size: 300 },                             # on denylist
-      { name: 'empty', size: 0 }                                   # no data
+      { name: repo_name, size: 100, custom_properties: { supported_schemas: ['Aardvark'] } },
+      { name: 'another-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark', '1.0'] } }, # multiple schemas
+      { name: 'v1-institution', size: 300, custom_properties: { supported_schemas: ['1.0'] } }, # schema mismatch
+      { name: 'outdated-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark'] }, archived: true }, # archived
+      { name: 'empty', size: 0, custom_properties: { supported_schemas: ['Aardvark'] } }, # no data
+      { name: 'tool', size: 50 } # not a metadata repository
     ]
   end
 
@@ -42,15 +43,15 @@
   describe '#docs_to_index' do
     it 'yields each JSON record with its path, skipping layers.JSON' do
       expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args(
-        [JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'],
-        [JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json']
+        [JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json']
       )
     end
 
-    it 'skips records with a different schema version' do
-      harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: 'Aardvark', logger:)
+    it 'can yield JSON records for a different schema version' do
+      harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: '1.0', logger:)
       expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args(
-        [JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json']
+        [JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'],
+        [JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json']
       )
     end
   end
@@ -79,15 +80,20 @@
       expect(harvester.pull_all).to eq(%w[my-institution another-institution])
     end
 
-    it 'skips repositories in the denylist' do
+    it 'skips repositories with no schema declared' do
       harvester.pull_all
-      expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/aardvark.git')
+      expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/tool.git')
     end
 
     it 'skips archived repositories' do
       harvester.pull_all
       expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/outdated-institution.git')
     end
+
+    it 'skips repositories with no data' do
+      harvester.pull_all
+      expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/empty.git')
+    end
   end
 
   describe '#clone' do
@@ -115,9 +121,19 @@
       expect(Git).to have_received(:clone).exactly(2).times
     end
 
-    it 'skips repositories in the denylist' do
+    it 'skips repositories with no schema declared' do
+      harvester.clone_all
+      expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/tool.git')
+    end
+
+    it 'skips archived repositories' do
+      harvester.clone_all
+      expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/outdated-institution.git')
+    end
+
+    it 'skips repositories with no data' do
       harvester.clone_all
-      expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/aardvark.git')
+      expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/empty.git')
     end
 
     it 'returns the names of repositories cloned' do