Skip to content

Species suggestion Backend#467

Merged
BryonLewis merged 19 commits intomainfrom
species-suggestion-backend
Apr 8, 2026
Merged

Species suggestion Backend#467
BryonLewis merged 19 commits intomainfrom
species-suggestion-backend

Conversation

@BryonLewis
Copy link
Copy Markdown
Collaborator

@BryonLewis BryonLewis commented Apr 2, 2026

  • Large code change number is because of the 4MB geojson species boundaries

  • In prep for a future change I've modified grts_cell to be sample_frame_id for the GRTS cell Ids. This is because the sample_frame_id is an additional number that is used for distinguishing between CONUS (14), Canada/Alaska, Mexico, Offshore

  • Updates most references from grts_cell to sample_frame_id and also creations points to allow it to be populated based on either Guano for filename metadata

  • Adds SpeciesRanges species.geojson that contains GeoJSON multiple polygons and a species name

  • Created a SpeciesRange Model that stores the geometry data with a reference to the Species in the DB

  • Migrations are made for adding this new table as well as changing the name of grts_cell to sample_frame_id

  • Add a load_species_geojson Django Management command to load the geojson into the database

  • Species Return value has a field called in_range that determines if the current RecordingId, or current GRTS_Cell_Id and Sample_frame_id are inside the species range.

  • The Species endpoint now has query parameters:

    • Recording Id: Use the grts_cell_id and sample_frame_id (defaulting to 14 for Continental US if None) for the recording Id and returning a list of Species back
    • GRTS_Cell_Id/Sample_Frame_Id: Instead of providing a recording, you can provide the GRTS Cell ID and the Sample Frame ID with the sample frame defaulting to 14.

This will be followed up with the UI section in #468

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 3, 2026

Deploying bats-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: 37e1802
Status: ✅  Deploy successful!
Preview URL: https://d012e543.bats-ai.pages.dev
Branch Preview URL: https://species-suggestion-backend.bats-ai.pages.dev

View logs

@BryonLewis BryonLewis marked this pull request as ready for review April 6, 2026 12:00
@BryonLewis BryonLewis requested a review from naglepuff April 6, 2026 12:00
def get_species(
request: HttpRequest,
grts_cell_id: int | None = Query(None),
sample_frame_id: int = Query(14),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a constant for 14 to encode some sense of meaning for future developers? Maybe CONUS_SAMPLE_FRAME_ID = 14 somewhere?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, why isn't this just set as a model default, so it's always defined in the database?

Is there a special meaning to when it's NULL (and defaults in code to 14) vs when it's explicitly set to 14 in the database?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something that wasn't fully understood until recently. There is a 10x10km grid laid across the areas on the map that USGS has created for the NABat project that provides these grts_cell_ids. What wasn't known is that there are different GRTS_cell_ids for different countries/regions, Continental US is 14, Canada and Alaska are a different number and Mexico is another number.
The GRTS Cell Ids when imported should have this smaple_frame_id in the shape file and should be set.

Now when we import Bat file recordings they have metadata attached to them in multiple ways. One is an internal metadata format called guano metadata which are metada fields attached to the WAV file itself and can be read using the python guano package. The other way is the file naming scheme itself can contain things like the date/time/ grts_cell_id and possible sample_frame_id. The sample_frame_id isn't guaranteed to be populalated or found using guano or the file naming scheme so it can be unknown. In those instances I'm assuming it's the continental U.S. which is something that wasn't done before.
So we either can:

  • update all recordings to swap to a sample_frame_id of 14 if it is none
  • When looking for a grts_cell_id and a sample_frame_id isn't provided whe can interpret it as being 14

There may be times when I want to use this endpoint without a recording but with the GRTS cell Id and optional sample_frame_id and return species suggestions.

It's kind of a complicated decision based on what data USGS is providing us, trying to make reasonable defaults and determining what may be given in the future.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped to a label constant in 14c969e

This may change in the future if we get some more information about the GRTS_Cell_Id and the sample_frame_id and better ways to determine from the wav file what the sample_frame_id is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation!

Ok, it makes more sense now that we want to preserve the recording metadata (or lack thereof) as-is, especially since it might be possible to better fill gaps in the future.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you have used django-click at all but if you prefer writing CLIs with click, we already include that as a dependency. I wouldn't say you have to rewrite this command, but keep in mind for the future.

DEFAULT_GEOJSON = Path(__file__).resolve().parents[2] / "data" / "species.geojson"


class Command(BaseCommand):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this a data migration, so it's loaded automatically when the database is bootstrapped?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on including it because I didn't know if I wanted to make it a requirement for the application. I think for ease of deployment it may make sense to have it and keep the management command as a way to update if I happent o get new species.geojson with updated ranges/species.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the path locations and added a python command to migration to load it in 7dfd479

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the data migration is a great ergonomic improvement.

Is it reasonable to declare that if species.geojson is updated in the future, we'll just be required to also add a new data migration to keep the database updated accordingly? That way, the state of these rows in the database should always match the state of the species.geojson file in the repo.

If you want to do that approach, we should be able to delete this management command entirely now. If / when a new species.geojson data migration needs to be written, it might be nice to share some of the code, so it's not copy-pasted every time, but I think it's reasonable to defer that until we actually get a second data migration.

If you want to keep this manage.py command, I think it'd be much more maintainable to find a way to deduplicate the species.geojson deserialization / parsing logic. Maybe it'd be good to have it in a common utility, then have both the manage.py command and the data migration import ant use it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the future holds but we could have different deployments that have different species suggestions. For now it made sense for it to be in the migration fo the USGS vetting process that is ocurring, but we may have instances in which we decide to utilize a species suggestion list that is trained (from annotations) instead of curated.

I did consider the de-duplication or abstracting out code for specifically species ingestion. I internally decided against that (and this may be thinking too much ahead) but in the future if the species suggestion structures change, it would be easier for me to modify and add additional optional fields to the management command instead of relying on an abstracted out bit of code that would either have to be copied or maintained to be historically accurate with the migration. I can also understand the arguments of YAGNI for it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. One the one hand, with the removal of the error guards, these two code blocks are looking quite similar. On the other hand, I totally see your point that future ingest updates wouldn't necessarily want to modify the historic migration code.

When that future comes, we'll need to figure out how to handle the species-ranges.geojson file changes. Do we edit the file in-place and add a new migration that attempts to reconcile any differences from the old one? Do we add a second file and a second migration to just re-do all the ingest work? Regardless, it's something we can worry about later.

BryonLewis and others added 2 commits April 8, 2026 08:55
Co-authored-by: Brian Helba <brian.helba@kitware.com>
Co-authored-by: Brian Helba <brian.helba@kitware.com>
Copy link
Copy Markdown
Member

@brianhelba brianhelba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline feedback.

@BryonLewis BryonLewis force-pushed the species-suggestion-backend branch from 202aa00 to 1c6cae4 Compare April 8, 2026 14:24
geom__intersects=cell_geom,
)
qs = Species.objects.annotate(
pk=F("id"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this particular line necessary?

Copy link
Copy Markdown
Collaborator Author

@BryonLewis BryonLewis Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove the unnecessary pk: a95324b

Also noticed that I should set in_range default to False if there exists a cell_geom so the species list is clear that
in_range: null - means that it couldn't figure it out because it didn't have a valid grts_cell_id geom
in_range: true - intersecting cell geom from the grts_cell_id with the species range
in_range: false - has a valid cell geom but doesn't intersect with the species range

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in_range is something we're making up just for this output, it's worth adding at least a code comment to the SpeciesSchema.in_range definition to explain these semantics. That's very helpful when first trying to interpret the code.


fid = props.get("id")
source_feature_id = str(fid) if fid is not None else ""
SpeciesRange.objects.update_or_create(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal for a single file to contain duplicate SpeciesRange? If that's not expected, I think this should be a .create and appropriate error if a duplicate is detected.

I'd expect this operation to always run on an empty SpeciesRange table, since it was only created as part of this migration.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure if this could in the future have duplicate feature IDs and species types. This was provided from USGS and didn't have a specification for what exactly the data would be. I realize this would have the latest in the geojson feature order so maybe I should use create and fail if there happens to be duplicates.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. If you decide to tolerate duplicates and use some policy like "always prefer the first / last entry", then it'd be worth a code comment to note the intent.

)
)
for msg in errors:
self.stdout.write(self.style.ERROR(msg))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just write the errors as they occur?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly habit from other ingestion/management scripts where I may have thousands of things being processed and logged and I want a concise list of the failure items (if they are non-raising/stopping) at the end to see what is missed. If they were stopping errors I would log them immediately.

geom = GEOSGeometry(json.dumps(geom_json))
except Exception as e:
errors.append(f"Feature {i} ({code}): invalid geometry: {e}")
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that in the migration, this operation doesn't fail through to continue. I think that's totally reasonable, given that invalid geometry is a huge red flag and perhaps should be fatal (rather than proceeding with a suspect file).

Do we want to be more lenient in this script for some reason?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below I'm removing it so it will properly fail. I think initiall during testing I was being cautious and just wanted to visualized data incase there were problems with the geojson I got so I was over cautious in getting data into the DB to see the structure of it.


try:
geom = GEOSGeometry(json.dumps(geom_json))
except Exception as e:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more specific than Exception? We wouldn't want to accidentally catch something like a file read permission error.

Maybe you just want json.JSONDecodeError? I'm not sure if GEOSGeometry raises any exceptions that you'd want to silently let pass?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it and let the ingestion fail if there happens to be any problems.

@brianhelba
Copy link
Copy Markdown
Member

@BryonLewis Ok, this is looking really good to me.

I think there's just 2 places that warrant code comments, and we're set?

@BryonLewis
Copy link
Copy Markdown
Collaborator Author

I've added in comments about the create_or_update, and the in_range and also added a section to the README.md to describe a bit what is happening with species-ranges.geojson and how it can be updated in the future.

Thanks for taking a look at this. I'd like to get it merged sometime today so I can update the front-end PR and possibly give some USGS members a link to the suggested species interface preview for feedback.

@BryonLewis BryonLewis requested a review from brianhelba April 8, 2026 17:59
Copy link
Copy Markdown
Member

@brianhelba brianhelba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@BryonLewis BryonLewis merged commit 19f969b into main Apr 8, 2026
4 checks passed
@BryonLewis BryonLewis deleted the species-suggestion-backend branch April 8, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants