Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .rubocop_todo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Metrics/CyclomaticComplexity:
# Offense count: 13
# Configuration parameters: CountComments, CountAsOne, AllowedMethods, AllowedPatterns.
Metrics/MethodLength:
Max: 21
Max: 25

# Offense count: 2
# Configuration parameters: AllowedMethods, AllowedPatterns.
Expand Down
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,21 @@ $ gem install geo_combine
## Usage

### Converting metadata

#### Converting metadata into GeoBlacklight JSON

GeoCombine provides several classes representing different metadata standards that implement the `#to_geoblacklight` method for generating records in the [GeoBlacklight JSON format](https://opengeometadata.org/reference/):

```ruby
GeoCombine::Iso19139 # ISO 19139 XML
GeoCombine::OGP # OpenGeoPortal JSON
GeoCombine::Fgdc # FGDC XML
GeoCombine::EsriOpenData # Esri Open Data Portal JSON
GeoCombine::CkanMetadata # CKAN JSON
```

An example for converting an ISO 19139 XML record:

```ruby
# Create a new ISO19139 object
> iso_metadata = GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml')
Expand All @@ -52,7 +57,9 @@ An example for converting an ISO 19139 XML record:
# Output it as JSON instead of a Ruby hash
> iso_metadata.to_geoblacklight.to_json
```

Some formats also support conversion into HTML for display in a web browser:

```ruby
# Create a new ISO19139 object
> iso_metadata = GeoCombine::Iso19139.new('./tmp/opengeometadata/edu.stanford.purl/bb/338/jh/0716/iso19139.xml')
Expand Down Expand Up @@ -94,7 +101,16 @@ GeoCombine::Migrators::V1AardvarkMigrator.new(v1_hash: record, collection_id_map
Some of the tools and scripts in this gem use Ruby's `Logger` class to print information to `$stderr`. By default, the log level is set to `Logger::INFO`. For more verbose information, you can set the `LOG_LEVEL` environment variable to `DEBUG`:

```sh
$ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone
$ export LOG_LEVEL=DEBUG
```

#### Schema version

By default, GeoCombine will only fetch and index records using the current schema version. If you instead want to index records using an older schema (e.g. because your GeoBlacklight instance is an older version), you can set the `SCHEMA_VERSION` environment variable:

```sh
# Only fetch and index schema version 1.0 records
$ export SCHEMA_VERSION=1.0
```

#### Clone OpenGeoMetadata repositories locally
Expand All @@ -103,7 +119,7 @@ $ LOG_LEVEL=DEBUG bundle exec rake geocombine:clone
$ bundle exec rake geocombine:clone
```

Will clone all `edu.*`,` org.*`, and `uk.*` OpenGeoMetadata repositories into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable.
Will clone all OpenGeoMetadata repositories containing metadata matching the `SCHEMA_VERSION` into `./tmp/opengeometadata`. Location of the OpenGeoMetadata repositories can be configured using the `OGM_PATH` environment variable.

```sh
$ OGM_PATH='my/custom/location' bundle exec rake geocombine:clone
Expand Down Expand Up @@ -158,12 +174,6 @@ You can also set the Solr instance URL using `SOLR_URL`:
$ SOLR_URL=http://www.example.com:1234/solr/collection bundle exec rake geocombine:index
```

By default, GeoCombine will index only records using the Aardvark metadata format. If you instead want to index records using an older format (e.g. because your GeoBlacklight instance is version 3 or older), you can set the `SCHEMA_VERSION` environment variable:

```sh
# Only index schema version 1.0 records
$ SCHEMA_VERSION=1.0 bundle exec rake geocombine:index
```
### Indexing local documents

To index an arbitrary collection of records in a custom directory, run one of the following:
Expand All @@ -182,8 +192,6 @@ rake geocombine:index\[/path/to/your/file.json\]
OGM_PATH=/path/to/your/file.json rake geocombine:index
```



### Harvesting and indexing documents from GeoBlacklight sites

GeoCombine provides a Harvester class and rake task to harvest and index content from GeoBlacklight sites (or any site that follows the Blacklight API format). Given that the configurations can change from consumer to consumer and site to site, the class provides a relatively simple configuration API. This can be configured in an initializer, a wrapping rake task, or any other ruby context where the rake task our class would be invoked.
Expand Down
41 changes: 18 additions & 23 deletions lib/geo_combine/harvester.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,6 @@ module GeoCombine
class Harvester
attr_reader :ogm_path, :schema_version

# Non-metadata repositories that shouldn't be harvested
def self.denylist
[
'GeoCombine',
'aardvark',
'metadata-issues',
'ogm_utils-python',
'opengeometadata.github.io',
'opengeometadata-rails',
'gbl-1_to_aardvark'
]
end

# GitHub API endpoint for OpenGeoMetadata repositories
def self.ogm_api_uri
URI('https://api.github.com/orgs/opengeometadata/repos?per_page=1000')
Expand Down Expand Up @@ -53,9 +40,16 @@ def docs_to_index

doc = JSON.parse(File.read(path))
[doc].flatten.each do |record|
# skip indexing if this record has a different schema version than what we want
record_schema = record['gbl_mdVersion_s'] || record['geoblacklight_version']
record_id = record['layer_slug_s'] || record['dc_identifier_s']

# skip indexing if no identifiable schema version
unless record_schema
@logger.debug "skipping #{record_id || path}; no schema version declared in record"
next
end

# skip indexing if this record has a different schema version than what we want
if record_schema != @schema_version
@logger.debug "skipping #{record_id}; schema version #{record_schema} doesn't match #{@schema_version}"
next
Expand Down Expand Up @@ -87,19 +81,20 @@ def pull_all
end

# Clone a repository via git
# If the repository already exists, skip it.
# Return the name of the repository cloned, or nil if skipped
def clone(repo)
repo_path = File.join(@ogm_path, repo)
repo_info = repository_info(repo)
repo_url = "https://github.com/OpenGeoMetadata/#{repo}.git"

# Skip if exists; warn if archived or empty
if File.directory? repo_path
@logger.warn "skipping clone to #{repo_path}; directory exists"
return nil
repo_schemas = Array(repo_info.dig('custom_properties', 'supported_schemas'))

# Skip if exists, archived, empty, or different schema
return @logger.warn "skipping clone to #{repo_path}; directory exists" if File.directory? repo_path
return @logger.warn "repository is archived: #{repo_url}" if repo_info['archived']
return @logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero?
unless repo_schemas.include? @schema_version
return @logger.warn "repository #{repo_url} clone to #{repo_path}; repository properties don't include schema version #{@schema_version} (found #{repo_schemas.join(', ')})"
end
@logger.warn "repository is archived: #{repo_url}" if repo_info['archived']
@logger.warn "repository is empty: #{repo_url}" if repo_info['size'].zero?

Git.clone(repo_url, nil, path: ogm_path, depth: 1)
@logger.info "cloned #{repo_url} to #{repo_path}"
Expand All @@ -119,10 +114,10 @@ def clone_all
# List of repository names to harvest
def repositories
@repositories ||= JSON.parse(Net::HTTP.get(self.class.ogm_api_uri))
.filter { |repo| Array(repo.dig('custom_properties', 'supported_schemas')).include? @schema_version }
.filter { |repo| repo['size'].positive? }
.reject { |repo| repo['archived'] }
.map { |repo| repo['name'] }
.reject { |name| self.class.denylist.include? name }
end

def repository_info(repo_name)
Expand Down
46 changes: 31 additions & 15 deletions spec/lib/geo_combine/harvester_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
require 'spec_helper'

RSpec.describe GeoCombine::Harvester do
subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', schema_version: '1.0') }
subject(:harvester) { described_class.new(ogm_path: 'spec/fixtures/indexing', logger: logger) }

let(:logger) { instance_double(Logger, warn: nil, info: nil, error: nil, debug: nil) }
let(:repo_name) { 'my-institution' }
Expand All @@ -14,11 +14,12 @@
let(:stub_repo) { instance_double(Git::Base) }
let(:stub_gh_api) do
[
{ name: repo_name, size: 100 },
{ name: 'another-institution', size: 100 },
{ name: 'outdated-institution', size: 100, archived: true }, # archived
{ name: 'aardvark', size: 300 }, # on denylist
{ name: 'empty', size: 0 } # no data
{ name: repo_name, size: 100, custom_properties: { supported_schemas: ['Aardvark'] } },
{ name: 'another-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark', '1.0'] } }, # multiple schemas
{ name: 'v1-institution', size: 300, custom_properties: { supported_schemas: ['1.0'] } }, # schema mismatch
{ name: 'outdated-institution', size: 100, custom_properties: { supported_schemas: ['Aardvark'] }, archived: true }, # archived
{ name: 'empty', size: 0, custom_properties: { supported_schemas: ['Aardvark'] } }, # no data
{ name: 'tool', size: 50 } # not a metadata repository
]
end

Expand All @@ -42,15 +43,15 @@
describe '#docs_to_index' do
it 'yields each JSON record with its path, skipping layers.JSON' do
expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args(
[JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'],
[JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json']
[JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json']
)
end

it 'skips records with a different schema version' do
harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: 'Aardvark', logger:)
it 'can yield JSON records for a different schema version' do
harvester = described_class.new(ogm_path: 'spec/fixtures/indexing/', schema_version: '1.0', logger:)
expect { |b| harvester.docs_to_index(&b) }.to yield_successive_args(
[JSON.parse(File.read('spec/fixtures/indexing/aardvark.json')), 'spec/fixtures/indexing/aardvark.json']
[JSON.parse(File.read('spec/fixtures/indexing/basic_geoblacklight.json')), 'spec/fixtures/indexing/basic_geoblacklight.json'],
[JSON.parse(File.read('spec/fixtures/indexing/geoblacklight.json')), 'spec/fixtures/indexing/geoblacklight.json']
)
end
end
Expand Down Expand Up @@ -79,15 +80,20 @@
expect(harvester.pull_all).to eq(%w[my-institution another-institution])
end

it 'skips repositories in the denylist' do
it 'skips repositories with no schema declared' do
harvester.pull_all
expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/aardvark.git')
expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/tool.git')
end

it 'skips archived repositories' do
harvester.pull_all
expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/outdated-institution.git')
end

it 'skips repositories with no data' do
harvester.pull_all
expect(Git).not_to have_received(:open).with('https://github.com/OpenGeoMetadata/empty.git')
end
end

describe '#clone' do
Expand Down Expand Up @@ -115,9 +121,19 @@
expect(Git).to have_received(:clone).exactly(2).times
end

it 'skips repositories in the denylist' do
it 'skips repositories with no schema declared' do
harvester.clone_all
expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/tool.git')
end

it 'skips archived repositories' do
harvester.clone_all
expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/outdated-institution.git')
end

it 'skips repositories with no data' do
harvester.clone_all
expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/aardvark.git')
expect(Git).not_to have_received(:clone).with('https://github.com/OpenGeoMetadata/empty.git')
end

it 'returns the names of repositories cloned' do
Expand Down