Skip to content

VertNet Modularization process (WIP)

Javier Otegui edited this page Apr 21, 2016 · 3 revisions

VertNet Modularization How-To

  1. Rationale
  2. Proposed layout overview
  3. Shared components
    1. Datastore
    2. yaml files
  4. Module api (default)
    1. Method search
    2. Method download
    3. Version prod (default)
    4. Version dev
  5. Module indexer
  6. Module portal-web
    1. Version vertnet (default)
    2. Version tuco / dev / testing...
    3. Version amazonia
    4. Version amazonia-dev
  7. Module tools-*

Rationale

Proposed layout overview

I propose the following star-schema with a central module that serves data to accessory modules and packages.

The advantage of modularizing the project in this way is that each module will be isolated from the other, to a certain extent, both in software and (virtual) hardware. This means more heavy-duty modules (such as the api or the indexer) can be configured to be run on higher-class instances while more modest ones are suitable for other modules (such as the emailer). Also, not all modules need access to the same software modules, or make use of the same libraries and components.

There are a bunch of shared components, though, and this is for the good. See below for a list and description of each one.

Shared components

Datastore

All versions of all modules share the datastore. That means that potentially every open endpoint could be used to access the underlying data, but it also means that there is no need to duplicate datastores to serve different sub-projects such as Dimensions of Amazonia or DIPnet.

yaml files

For a specific App Engine application (in our case, vertnet-portal), there must be only one of the following .yaml configuration files:

  • app.yaml: shared application configuration
  • queue.yaml: definition of task queues (such as download, usage stats, ...)
  • index.yaml: definition of non-standard indexes for datastore queries (such as descending orders, multi-property filters, ...)
  • cron.yaml: definition of cron tasks (such as monthly usage stats)
  • dispatch.yaml: definition of URL routing "aliases"

Module api (default)

A single api module as the core element of the project in the broader sense (encompassing amazonia, dipnet and any other network). This module will be the central endpoint for all data-related operations, such as providing basic CRUD capabilities, launching download-building processes and any other data-intensive task. All other modules will refer to this one for such operations, meaning no other module will access the datastore directly. The main benefit is a more maintainable code, since all function to extract data will be found in this module.

Every project will access the underlying datastore via this module, and data fragmentation (making each project get only their data) will be perfomed at this level, by filtering based on the value of the networks field.

This also allows external packages (such as rvertnet) to directly access the data by querying the api module and making sure they always check on the latest available version of search and download methods.

Method search

Basic method to retrieve data from the datastore based on sent parameters. URL could be something like http://api.vertnet-portal.appspot.com/search?<parameters>

Method download

Basic method to build download files with data from the datastore based on sent parameters. URL could be something like http://api.vertnet-portal.appspot.com/download?<parameters>

Version prod (default)

The main version of the module, in charge of supplying data to requests coming from other modules.

Version dev

Development version, useful for testing new methods without hampering the production workflow. Only usable for within-module development. Other modules (even in development versions) will not access this version, but rather the prod version.

Module indexer

A core part of the data workflow, this module will be in charge of indexing the DarwinCore text files into App Engine's Search API documents.

Even though it might be included in the other tools-* module, the preeminent importance of the indexer and the fact that it might need some special configuration makes this tool worthy of having its own module.

Module portal-web

The current webapp module, once freed from the data management parts, will become a more UI-oriented module. From the perspective of the users, its current functionality will remain untouched. But under the hood, it will retrieve the data via calls to the api module rather than extracting the records directly from the datastore.

Version vertnet (default)

The default version of the web portal, pointing to data from the vertnet network.

Version tuco / dev / testing...

Development versions to try new features and/or building new project-based portals.

Version amazonia

A different version of the vertnet portal for the Dimensions of Amazonia project, pointing to data from the amazonia network. UI layout can be different, but core funcionality will remain the same.

Version amazonia-dev

There is no impediment in having development sub-versions for each of the project-specific versions.

This same schema (amazonia + amazonia-dev) can be applied to any other number of projects.

Module tools-*

Every other tool will have its associated module. Some tools will be open to the public, some will be restricted to internal usage.

Currently developed modules include:

  • API usage tracker (apitracker), private
  • Usage stat report generator (usagestats), private (the generator) and public (the viewer)
  • Batch emailer (emailer), private
  • Geospatial Quality API (api-geospatial), public

But there is potential to build many more, such as:

  • Migrators
  • Traits service
  • Gazetteer/locality service
  • Deduplication service
  • ...

There is a limitation here imposed by Google App Engine: the maximum amount of modules for the whole application is 20. This layout implies "using" 7 module "slots" (3 for the core modules and one for each currently deployed tool). Currently there are very few modules implemented, but this might become an issue in the future.

As a potential solution, we could merge all tools into a single tools module. The drawback is a total lack of isolation: each module has to be deployed as a whole, and shares instance resources among all its components. This hampers parallel development and can cause many code consistency issues if comitters don't have a strong responsibility and foresight.

Clone this wiki locally