Conversation
Deploying bats-ai with
|
| Latest commit: |
37e1802
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://d012e543.bats-ai.pages.dev |
| Branch Preview URL: | https://species-suggestion-backend.bats-ai.pages.dev |
bats_ai/core/views/species.py
Outdated
| def get_species( | ||
| request: HttpRequest, | ||
| grts_cell_id: int | None = Query(None), | ||
| sample_frame_id: int = Query(14), |
There was a problem hiding this comment.
Could we use a constant for 14 to encode some sense of meaning for future developers? Maybe CONUS_SAMPLE_FRAME_ID = 14 somewhere?
There was a problem hiding this comment.
More generally, why isn't this just set as a model default, so it's always defined in the database?
Is there a special meaning to when it's NULL (and defaults in code to 14) vs when it's explicitly set to 14 in the database?
There was a problem hiding this comment.
This was something that wasn't fully understood until recently. There is a 10x10km grid laid across the areas on the map that USGS has created for the NABat project that provides these grts_cell_ids. What wasn't known is that there are different GRTS_cell_ids for different countries/regions, Continental US is 14, Canada and Alaska are a different number and Mexico is another number.
The GRTS Cell Ids when imported should have this smaple_frame_id in the shape file and should be set.
Now when we import Bat file recordings they have metadata attached to them in multiple ways. One is an internal metadata format called guano metadata which are metada fields attached to the WAV file itself and can be read using the python guano package. The other way is the file naming scheme itself can contain things like the date/time/ grts_cell_id and possible sample_frame_id. The sample_frame_id isn't guaranteed to be populalated or found using guano or the file naming scheme so it can be unknown. In those instances I'm assuming it's the continental U.S. which is something that wasn't done before.
So we either can:
- update all recordings to swap to a sample_frame_id of 14 if it is none
- When looking for a grts_cell_id and a sample_frame_id isn't provided whe can interpret it as being 14
There may be times when I want to use this endpoint without a recording but with the GRTS cell Id and optional sample_frame_id and return species suggestions.
It's kind of a complicated decision based on what data USGS is providing us, trying to make reasonable defaults and determining what may be given in the future.
There was a problem hiding this comment.
Swapped to a label constant in 14c969e
This may change in the future if we get some more information about the GRTS_Cell_Id and the sample_frame_id and better ways to determine from the wav file what the sample_frame_id is.
There was a problem hiding this comment.
Thanks for the detailed explanation!
Ok, it makes more sense now that we want to preserve the recording metadata (or lack thereof) as-is, especially since it might be possible to better fill gaps in the future.
There was a problem hiding this comment.
Not sure if you have used django-click at all but if you prefer writing CLIs with click, we already include that as a dependency. I wouldn't say you have to rewrite this command, but keep in mind for the future.
| DEFAULT_GEOJSON = Path(__file__).resolve().parents[2] / "data" / "species.geojson" | ||
|
|
||
|
|
||
| class Command(BaseCommand): |
There was a problem hiding this comment.
Would it make sense to make this a data migration, so it's loaded automatically when the database is bootstrapped?
There was a problem hiding this comment.
I went back and forth on including it because I didn't know if I wanted to make it a requirement for the application. I think for ease of deployment it may make sense to have it and keep the management command as a way to update if I happent o get new species.geojson with updated ranges/species.
There was a problem hiding this comment.
updated the path locations and added a python command to migration to load it in 7dfd479
There was a problem hiding this comment.
Yeah, I think the data migration is a great ergonomic improvement.
Is it reasonable to declare that if species.geojson is updated in the future, we'll just be required to also add a new data migration to keep the database updated accordingly? That way, the state of these rows in the database should always match the state of the species.geojson file in the repo.
If you want to do that approach, we should be able to delete this management command entirely now. If / when a new species.geojson data migration needs to be written, it might be nice to share some of the code, so it's not copy-pasted every time, but I think it's reasonable to defer that until we actually get a second data migration.
If you want to keep this manage.py command, I think it'd be much more maintainable to find a way to deduplicate the species.geojson deserialization / parsing logic. Maybe it'd be good to have it in a common utility, then have both the manage.py command and the data migration import ant use it.
There was a problem hiding this comment.
I don't know what the future holds but we could have different deployments that have different species suggestions. For now it made sense for it to be in the migration fo the USGS vetting process that is ocurring, but we may have instances in which we decide to utilize a species suggestion list that is trained (from annotations) instead of curated.
I did consider the de-duplication or abstracting out code for specifically species ingestion. I internally decided against that (and this may be thinking too much ahead) but in the future if the species suggestion structures change, it would be easier for me to modify and add additional optional fields to the management command instead of relying on an abstracted out bit of code that would either have to be copied or maintained to be historically accurate with the migration. I can also understand the arguments of YAGNI for it.
There was a problem hiding this comment.
Makes sense. One the one hand, with the removal of the error guards, these two code blocks are looking quite similar. On the other hand, I totally see your point that future ingest updates wouldn't necessarily want to modify the historic migration code.
When that future comes, we'll need to figure out how to handle the species-ranges.geojson file changes. Do we edit the file in-place and add a new migration that attempts to reconcile any differences from the old one? Do we add a second file and a second migration to just re-do all the ingest work? Regardless, it's something we can worry about later.
Co-authored-by: Brian Helba <brian.helba@kitware.com>
Co-authored-by: Brian Helba <brian.helba@kitware.com>
202aa00 to
1c6cae4
Compare
bats_ai/core/views/species.py
Outdated
| geom__intersects=cell_geom, | ||
| ) | ||
| qs = Species.objects.annotate( | ||
| pk=F("id"), |
There was a problem hiding this comment.
Is this particular line necessary?
There was a problem hiding this comment.
Updated to remove the unnecessary pk: a95324b
Also noticed that I should set in_range default to False if there exists a cell_geom so the species list is clear that
in_range: null - means that it couldn't figure it out because it didn't have a valid grts_cell_id geom
in_range: true - intersecting cell geom from the grts_cell_id with the species range
in_range: false - has a valid cell geom but doesn't intersect with the species range
There was a problem hiding this comment.
Since in_range is something we're making up just for this output, it's worth adding at least a code comment to the SpeciesSchema.in_range definition to explain these semantics. That's very helpful when first trying to interpret the code.
bats_ai/core/migrations/0036_rename_grts_cell_recording_sample_frame_id_and_more.py
Outdated
Show resolved
Hide resolved
|
|
||
| fid = props.get("id") | ||
| source_feature_id = str(fid) if fid is not None else "" | ||
| SpeciesRange.objects.update_or_create( |
There was a problem hiding this comment.
Is it normal for a single file to contain duplicate SpeciesRange? If that's not expected, I think this should be a .create and appropriate error if a duplicate is detected.
I'd expect this operation to always run on an empty SpeciesRange table, since it was only created as part of this migration.
There was a problem hiding this comment.
I don't know for sure if this could in the future have duplicate feature IDs and species types. This was provided from USGS and didn't have a specification for what exactly the data would be. I realize this would have the latest in the geojson feature order so maybe I should use create and fail if there happens to be duplicates.
There was a problem hiding this comment.
Up to you. If you decide to tolerate duplicates and use some policy like "always prefer the first / last entry", then it'd be worth a code comment to note the intent.
| ) | ||
| ) | ||
| for msg in errors: | ||
| self.stdout.write(self.style.ERROR(msg)) |
There was a problem hiding this comment.
Why not just write the errors as they occur?
There was a problem hiding this comment.
mostly habit from other ingestion/management scripts where I may have thousands of things being processed and logged and I want a concise list of the failure items (if they are non-raising/stopping) at the end to see what is missed. If they were stopping errors I would log them immediately.
| geom = GEOSGeometry(json.dumps(geom_json)) | ||
| except Exception as e: | ||
| errors.append(f"Feature {i} ({code}): invalid geometry: {e}") | ||
| continue |
There was a problem hiding this comment.
I notice that in the migration, this operation doesn't fail through to continue. I think that's totally reasonable, given that invalid geometry is a huge red flag and perhaps should be fatal (rather than proceeding with a suspect file).
Do we want to be more lenient in this script for some reason?
There was a problem hiding this comment.
Same as below I'm removing it so it will properly fail. I think initiall during testing I was being cautious and just wanted to visualized data incase there were problems with the geojson I got so I was over cautious in getting data into the DB to see the structure of it.
|
|
||
| try: | ||
| geom = GEOSGeometry(json.dumps(geom_json)) | ||
| except Exception as e: |
There was a problem hiding this comment.
Can you make this more specific than Exception? We wouldn't want to accidentally catch something like a file read permission error.
Maybe you just want json.JSONDecodeError? I'm not sure if GEOSGeometry raises any exceptions that you'd want to silently let pass?
There was a problem hiding this comment.
I'll remove it and let the ingestion fail if there happens to be any problems.
Co-authored-by: Brian Helba <brian.helba@kitware.com>
|
@BryonLewis Ok, this is looking really good to me. I think there's just 2 places that warrant code comments, and we're set? |
|
I've added in comments about the create_or_update, and the in_range and also added a section to the README.md to describe a bit what is happening with species-ranges.geojson and how it can be updated in the future. Thanks for taking a look at this. I'd like to get it merged sometime today so I can update the front-end PR and possibly give some USGS members a link to the suggested species interface preview for feedback. |
Large code change number is because of the 4MB geojson species boundaries
In prep for a future change I've modified grts_cell to be sample_frame_id for the GRTS cell Ids. This is because the sample_frame_id is an additional number that is used for distinguishing between CONUS (14), Canada/Alaska, Mexico, Offshore
Updates most references from
grts_celltosample_frame_idand also creations points to allow it to be populated based on either Guano for filename metadataAdds SpeciesRanges species.geojson that contains GeoJSON multiple polygons and a species name
Created a SpeciesRange Model that stores the geometry data with a reference to the Species in the DB
Migrations are made for adding this new table as well as changing the name of
grts_celltosample_frame_idAdd a
load_species_geojson Django Management command to load the geojson into the databaseSpecies Return value has a field called
in_rangethat determines if the current RecordingId, or current GRTS_Cell_Id and Sample_frame_id are inside the species range.The Species endpoint now has query parameters:
This will be followed up with the UI section in #468