From 850723e776f549a9544a0ec80500ec6446085f32 Mon Sep 17 00:00:00 2001 From: Nick Budak Date: Wed, 17 Jun 2026 14:35:48 -0700 Subject: [PATCH] Filter repositories by declared schema using repository properties This updates the clone and pull tasks to only apply to repositories that declare their conformance with a particular OGM schema version (v1.0 or Aardvark). This is implemented via a repository custom property, which is set at the organization level and can be enabled on a per-repository basis. Unlike repository topics, this value is a controlled vocabulary and is unique to the OGM organization. This functionality replaces the "denylist" that was a hardcoded list of repositories that shouldn't be harvested (e.g. because they were tools or didn't contain metadata). In this version, providers can opt-in to harvesting by adding the relevant schema property. After this change, you no longer need to clone repositories that don't match the schema version you are using, which saves disk space and processing time. --- .rubocop_todo.yml | 2 +- README.md | 28 ++++++++++------ lib/geo_combine/harvester.rb | 41 ++++++++++------------- spec/lib/geo_combine/harvester_spec.rb | 46 +++++++++++++++++--------- 4 files changed, 68 insertions(+), 49 deletions(-) diff --git a/.rubocop_todo.yml b/.rubocop_todo.yml index 963a40c..ffb030c 100644 --- a/.rubocop_todo.yml +++ b/.rubocop_todo.yml @@ -41,7 +41,7 @@ Metrics/CyclomaticComplexity: # Offense count: 13 # Configuration parameters: CountComments, CountAsOne, AllowedMethods, AllowedPatterns. Metrics/MethodLength: - Max: 21 + Max: 25 # Offense count: 2 # Configuration parameters: AllowedMethods, AllowedPatterns. diff --git a/README.md b/README.md index 9d4222c..443adc4 100644 --- a/README.md +++ b/README.md @@ -32,8 +32,11 @@ $ gem install geo_combine ## Usage ### Converting metadata + #### Converting metadata into GeoBlacklight JSON + GeoCombine provides several classes representing different metadata standards that implement the `#to_geoblacklight` method for generating records in the [GeoBlacklight JSON format](https://opengeometadata.org/reference/): + ```ruby GeoCombine::Iso19139 # ISO 19139 XML GeoCombine::OGP # OpenGeoPortal JSON @@ -41,7 +44,9 @@ GeoCombine::Fgdc # FGDC XML GeoCombine::EsriOpenData # Esri Open Data Portal JSON GeoCombine::CkanMetadata # CKAN JSON ``` + An example for converting an ISO 19139 XML record: + ```ruby # Create a new ISO19139 object > iso_metadata = GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml') @@ -52,7 +57,9 @@ An example for converting an ISO 19139 XML record: # Output it as JSON instead of a Ruby hash > iso_metadata.to_geoblacklight.to_json ``` + Some formats also support conversion into HTML for display in a web browser: + ```ruby # Create a new ISO19139 object > iso_metadata = GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml') @@ -94,7 +101,16 @@ GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map Some of the tools and scripts in this gem use Ruby's `Logger` class to print information to `$stderr`. By default, the log level is set to `Logger::INFO`. For more verbose information, you can set the `LOG_LEVEL` environment variable to `DEBUG`: ```sh -$ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone +$ export LOG_LEVEL=DEBUG +``` + +#### Schema version + +By default, GeoCombine will only fetch and index records using the current schema version. If you instead want to index records using an older schema (e.g. because your GeoBlacklight instance is an older version), you can set the `SCHEMA_VERSION` environment variable: + +```sh +# Only fetch and index schema version 1.0 records +$ export SCHEMA_VERSION=1.0 ``` #### Clone OpenGeoMetadata repositories locally @@ -103,7 +119,7 @@ $ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone $ bundle exec rake geocombine:clone ``` -Will clone all `edu.*`,` org.*`, and `uk.*` OpenGeoMetadata repositories into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable. +Will clone all OpenGeoMetadata repositories containing metadata matching the `SCHEMA_VERSION` into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable. ```sh $ OGM_PATH='my/custom/location' bundle exec rake geocombine:clone @@ -158,12 +174,6 @@ You can also set the Solr instance URL using `SOLR_URL`: $ SOLR_URL=http://www.example.com:1234/solr/collection bundle exec rake geocombine:index ``` -By default, GeoCombine will index only records using the Aardvark metadata format. If you instead want to index records using an older format (e.g. because your GeoBlacklight instance is version 3 or older), you can set the `SCHEMA_VERSION` environment variable: - -```sh -# Only index schema version 1.0 records -$ SCHEMA_VERSION=1.0 bundle exec rake geocombine:index -``` ### Indexing local documents To index an arbitrary collection of records in a custom directory, run one of the following: @@ -182,8 +192,6 @@ rake geocombine:index\[/path/to/your/file.json\] OGM_PATH=/path/to/your/file.json rake geocombine:index ``` - - ### Harvesting and indexing documents from GeoBlacklight sites GeoCombine provides a Harvester class and rake task to harvest and index content from GeoBlacklight sites (or any site that follows the Blacklight API format). Given that the configurations can change from consumer to consumer and site to site, the class provides a relatively simple configuration API. This can be configured in an initializer, a wrapping rake task, or any other ruby context where the rake task our class would be invoked. diff --git a/lib/geo_combine/harvester.rb b/lib/geo_combine/harvester.rb index 3fffb3a..d025102 100644 --- a/lib/geo_combine/harvester.rb +++ b/lib/geo_combine/harvester.rb @@ -11,19 +11,6 @@ module GeoCombine class Harvester attr_reader :ogm_path, :schema_version - # Non-metadata repositories that shouldn't be harvested - def self.denylist - [ - 'GeoCombine', - 'aardvark', - 'metadata-issues', - 'ogm_utils-python', - 'opengeometadata.github.io', - 'opengeometadata-rails', - 'gbl-1_to_aardvark' - ] - end - # GitHub API endpoint for OpenGeoMetadata repositories def self.ogm_api_uri URI('https://api.github.com/orgs/opengeometadata/repos?per_page=1000') @@ -53,9 +40,16 @@ def docs_to_index doc = JSON.parse(File.read(path)) [doc].flatten.each do |record| - # skip indexing if this record has a different schema version than what we want record_schema = record['gbl_mdVersion_s'] || record['geoblacklight_version'] record_id = record['layer_slug_s'] || record['dc_identifier_s'] + + # skip indexing if no identifiable schema version + unless record_schema + @logger.debug "skipping #{record_id || path}; no schema version declared in record" + next + end + + # skip indexing if this record has a different schema version than what we want if record_schema != @schema_version @logger.debug "skipping #{record_id}; schema version #{record_schema} doesn't match #{@schema_version}" next @@ -87,19 +81,20 @@ def pull_all end # Clone a repository via git - # If the repository already exists, skip it. + # Return the name of the repository cloned, or nil if skipped def clone(repo) repo_path = File.join(@ogm_path, repo) repo_info = repository_info(repo) repo_url = "https://github.com/OpenGeoMetadata/#{repo}.git" - - # Skip if exists; warn if archived or empty - if File.directory? repo_path - @logger.warn "skipping clone to #{repo_path}; directory exists" - return nil + repo_schemas = Array(repo_info.dig('custom_properties', 'supported_schemas')) + + # Skip if exists, archived, empty, or different schema + return @logger.warn "skipping clone to #{repo_path}; directory exists" if File.directory? repo_path + return @logger.warn "repository is archived: #{repo_url}" if repo_info['archived'] + return @logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero? + unless repo_schemas.include? @schema_version + return @logger.warn "repository #{repo_url} clone to #{repo_path}; repository properties don't include schema version #{@schema_version} (found #{repo_schemas.join(', ')})" end - @logger.warn "repository is archived: #{repo_url}" if repo_info['archived'] - @logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero? Git.clone(repo_url, nil, path: ogm_path, depth: 1) @logger.info "cloned #{repo_url} to #{repo_path}" @@ -119,10 +114,10 @@ def clone_all # List of repository names to harvest def repositories @repositories ||= JSON.parse(Net::HTTP.get(self.class.ogm_api_uri)) + .filter { |repo| Array(repo.dig('custom_properties', 'supported_schemas')).include? @schema_version } .filter { |repo| repo['size'].positive? } .reject { |repo| repo['archived'] } .map { |repo| repo['name'] } - .reject { |name| self.class.denylist.include? name } end def repository_info(repo_name) diff --git a/spec/lib/geo_combine/harvester_spec.rb b/spec/lib/geo_combine/harvester_spec.rb index 80781f9..d75b91b 100644 --- a/spec/lib/geo_combine/harvester_spec.rb +++ b/spec/lib/geo_combine/harvester_spec.rb @@ -5,7 +5,7 @@ require 'spec_helper' RSpec.describe GeoCombine::Harvester do - subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', schema_version: '1.0') } + subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', logger: logger) } let(:logger) { instance_double(Logger, warn: nil, info: nil, error: nil, debug: nil) } let(:repo_name) { 'my-institution' } @@ -14,11 +14,12 @@ let(:stub_repo) { instance_double(Git::Base) } let(:stub_gh_api) do [ - { name: repo_name, size: 100 }, - { name: 'another-institution', size: 100 }, - { name: 'outdated-institution', size: 100, archived: true }, # archived - { name: 'aardvark', size: 300 }, # on denylist - { name: 'empty', size: 0 } # no data + { name: repo_name, size: 100, custom_properties: { supported_schemas: ['Aardvark'] } }, + { name: 'another-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark', '1.0'] } }, # multiple schemas + { name: 'v1-institution', size: 300, custom_properties: { supported_schemas: ['1.0'] } }, # schema mismatch + { name: 'outdated-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark'] }, archived: true }, # archived + { name: 'empty', size: 0, custom_properties: { supported_schemas: ['Aardvark'] } }, # no data + { name: 'tool', size: 50 } # not a metadata repository ] end @@ -42,15 +43,15 @@ describe '#docs_to_index' do it 'yields each JSON record with its path, skipping layers.JSON' do expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args( - [JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'], - [JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json'] + [JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json'] ) end - it 'skips records with a different schema version' do - harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: 'Aardvark', logger:) + it 'can yield JSON records for a different schema version' do + harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: '1.0', logger:) expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args( - [JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json'] + [JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'], + [JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json'] ) end end @@ -79,15 +80,20 @@ expect(harvester.pull_all).to eq(%w[my-institution another-institution]) end - it 'skips repositories in the denylist' do + it 'skips repositories with no schema declared' do harvester.pull_all - expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/aardvark.git') + expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/tool.git') end it 'skips archived repositories' do harvester.pull_all expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/outdated-institution.git') end + + it 'skips repositories with no data' do + harvester.pull_all + expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/empty.git') + end end describe '#clone' do @@ -115,9 +121,19 @@ expect(Git).to have_received(:clone).exactly(2).times end - it 'skips repositories in the denylist' do + it 'skips repositories with no schema declared' do + harvester.clone_all + expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/tool.git') + end + + it 'skips archived repositories' do + harvester.clone_all + expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/outdated-institution.git') + end + + it 'skips repositories with no data' do harvester.clone_all - expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/aardvark.git') + expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/empty.git') end it 'returns the names of repositories cloned' do