feat: Improve NASA Earthdata credential providers#472
Conversation
|
@kylebarron, I'm making this a draft PR for now, as I have not added any tests. I have confirmed that it works via manual testing, but do you have any thoughts on how you might like to add automated tests since this requires NASA Earthdata credentials to test? Since this requires running from within AWS us-west-2, I'm not sure we can add any "positive" tests. Perhaps we can at least add some "expected error" tests, to make sure failures (due to not being in us-west-2) raise the expected errors? |
|
Oops! I see a couple of failed checks. I'll look at them tomorrow. |
|
|
||
| async def close(self) -> None: | ||
| """Close the underlying session. | ||
| """Close the underlying session, if it was not supplied by the user. |
There was a problem hiding this comment.
Are there any external references we can consider here? I'm inclined to make close always close the underlying session, and the user can decide not to call close if they want to keep using the underlying session. Seems like that would be more predictable.
There was a problem hiding this comment.
I don't think that's a good practice. If the user created the session, we cannot assume responsibility for closing it for them. They may have created it themselves precisely so they can control when to close it.
For example, they may be controlling resources by using a session across various operations in order to take better advantage of pooling. If we close out their session from under them, they may be very unhappy when they attempt to use it elsewhere as well, only to find that it has been closed.
What I think might be a bit better is at least changing the docstring to avoid specifically mentioning the session object, and instead say something along the lines of "release resources created by this credential provider."
To go a step further, I would be inclined to make this a context manager to allow (perhaps even require) use of a with block, but that might be more than you're willing to push onto the user.
There was a problem hiding this comment.
While in standard usage we would make this a context manager, it might be hard to use as a context manager when the stores will live for so long.
I don't know, and I'm open to suggestions. You can see the other credential providers don't have any tests, and I agree it's hard to test. In theory you could mock the auth calls and responses but that seems like a lot of work for not that much gain.
Looks like it's just the PR title and the constants missing from the docs. |
My initial thoughts on mocking also led me to believe it might be too much effort, but now that I'm thinking about it a bit more, I realize that we only need to mock obtaining S3 credentials, not actually fetching any data, because we would assume that once we have valid credentials in hand, an S3Store would be able to properly fetch data. So perhaps we can mock something up without too much effort, but given that it looks like @maxrjones is attempting to use this (currently unsuccessfully without the changes in this PR), perhaps we can leave unit tests for a separate PR, particularly if bugs are discovered. |
|
FYI one embarrassing mistake I made yesterday was tab completing the credential provider without paying much attention and then using the async version in a Jupyter cell which led to it hanging. Just something to consider with the API design since the async version will be preferentially chosen over the sync version and hanging makes it harder to recognize a user error. |
This is one reason I really dislike "colored" functions. However, I'm not sure how we can help users avoid this issue. What specifically do you have in mind with regard to, "something to consider with the API design"? |
I was thinking about having separate namespaces, but it's just an off-the-cuff thought rather than a suggestion
|
Preferentially chosen... by the IDE autocompletion?
As it is the namespaces are designed around "concept" and required imports. So if you import from |
My assumption is that most autocompletion implementations use alphabetical order rather than random selection but I could be wrong. |
|
@kylebarron, I think I've addressed all of your comments. Regarding some tests, I believe I can add a few without much effort by making use of vcrpy and pytest-recording. I've used them elsewhere with good success, including in the Let me know if you'd like me to add them to this PR or submit a separate PR. |
| """ | ||
|
|
||
| EARTHDATA_HOST_UAT = "uat.urs.earthdata.nasa.gov" | ||
| """Hostname of the NASA Earthdata Login User-Acceptance Testing (UAT) system.""" |
There was a problem hiding this comment.
Can you say when a user might want to use this instead?
Is this only for testing? If it's not intended for end-user use, we should remove it from here.
There was a problem hiding this comment.
This is not often used, but it is something that may still want to expose to users. Users of the earthaccess library have expressed the desire to hit "test" data rather than "production" data, and both hosts are coded into that library.
However, I'm not opposed to removing this. Users who want to hit the UAT system will know the host name and can simply provide it explicitly themselves, so I'd be okay with having only the "ops" host, in which case, I might be inclined to simply name it DEFAULT_EARTHDATA_HOST.
What do you prefer?
There was a problem hiding this comment.
Yeah I'm inclined to remove this so that there's less confusion for new users, while advanced users will know what to pass in themselves. And happy to rename the existing value to DEFAULT_EARTHDATA_HOST as you suggest.
|
|
||
| # Obtain an S3 credentials URL and an S3 data/download URL, typically | ||
| # via metadata returned from a NASA CMR collection or granule query. | ||
| credentials_url = "https://data.ornldaac.earthdata.nasa.gov/s3credentials" |
There was a problem hiding this comment.
It might be good to also have an initial example that doesn't pass a custom credentials url? It's not clear to me when a user would need to override this url, and so I could see a user getting confused by an example that passes this, even though the argument has a default.
There was a problem hiding this comment.
The user must always pass a credentials URL. There is no default. Perhaps you're confusing this with the default host value. The 2 are completely distinct.
The credentials URL is specific to the data provider (DAAC). It is part of the collection and/or granule metadata that is in response to a CMR query. That metadata contains both the credentials URL (HTTP, not S3) that the user must use for obtaining S3 credentials, as well as a separate S3 URL that the user can use to read the data by using the obtained credentials.
When using the credentials URL to obtain s3 credentials, the user must supply a username/password pair associated with a NASA Earthdata account at the Earthdata host (default is the "operational" host). Otherwise, they can supply a token, which they can generate on the very same Earthdata host, to use instead of their username/password pair.
Sequence:
- Query the CMR to obtain metadata for a granule
- Extract 2 URLs from the metadata:
- S3 credentials URL (this is an HTTPS URL used for obtaining S3 credentials, it is not itself an S3 URL)
- S3 URL of the granule data file (requiring S3 credentials for reading it)
- Use NASA Earthdata user token or user creds to obtain (temporary, typically 1-hr) S3 credentials via the credentials URL (the user token or user creds must be associated with the Earthdata account hosted at the Earthdata host, which defaults to the "operational" [production] host)
- Use the obtained (temporary) S3 credentials to read the granule data from the S3 URL pulled from the metadata
There was a problem hiding this comment.
Can we set a default? Is there a DAAC that's most commonly used? Why does earthaccess.login not require the user to pass a URL? https://earthaccess.readthedocs.io/en/latest/tutorials/file-access/
There was a problem hiding this comment.
What we're doing here is at a lower level that what earthaccess is doing. earthaccess does supply an explicit URL under the covers (essentially as I've outlined above). Supplying a default does not make sense.
If earthaccess were to start using obstore, the earthaccess user would still not supply a credentials URL, but earthaccess would supply it to the obstore credential provider.
See the following, where the credentials URL (endpoint) is extracted from the metadata returned by a CMR query:
- https://github.com/nsidc/earthaccess/blob/main/earthaccess/api.py#L329
- https://github.com/nsidc/earthaccess/blob/main/earthaccess/api.py#L447
We would pass that "endpoint" value along as the credentials URL to the obstore credential provider.
There was a problem hiding this comment.
The logic in earthaccess that is handing the "endpoint" (credentials URL) is equivalent to the logic that's in this obstore credential provider, but the code in the PR adds further functionality by supporting the use of a token as well as recognizing env vars.
There was a problem hiding this comment.
I see previously the code I wrote hardcoded
CREDENTIALS_API = "https://archive.podaac.earthdata.nasa.gov/s3credentials"Is there a known set of DAACs? Could we have these on an enum so the user doesn't need to know a full URL?
class EarthdataProviders(Enum):
PODAAC = "https://archive.podaac.earthdata.nasa.gov/s3credentials"
...?
There was a problem hiding this comment.
Again, I strongly recommend against this, but perhaps my explanation is still not clear.
The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.
If earthaccess were to make use of this, the earthaccess user would not have to change a thing. It would all be handled within the earthaccess library itself, unbeknownst to the earthaccess user.
At this point, if that still doesn't convince you, I'd ask that you trust that the sensible decision is to not provide any sort of hard-coded list because this information is obtained dynamically, and is subject to change.
We can always add this at a later point if it is deemed sensible to do so.
There was a problem hiding this comment.
I want this to be useful not only to users who are already using the earthaccess library (in that case, we could have this entire file live in the earthaccess project).
The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.
I suppose that means a user will need separate S3Store objects for different datasets? And they won't be able to construct an S3Store until doing a CMR search because they won't be able to construct the credential provider until seeing the CMR result...
There was a problem hiding this comment.
I want this to be useful not only to users who are already using the
earthaccesslibrary (in that case, we could have this entire file live in the earthaccess project).
Yes, I absolutely agree with you here!
The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.
I suppose that means a user will need separate
S3Storeobjects for different datasets? And they won't be able to construct anS3Storeuntil doing a CMR search because they won't be able to construct the credential provider until seeing the CMR result...
We need an S3Store per provider/DAAC (which typically equates to a bucket -- I'm not aware of a DAAC using multiple buckets for data).
For the most part, each DAAC will host multiple datasets (collections, not to be confused with the STAC term), and will use the same credentials URL across all of their hosted datasets. However, I don't think it's safe to assume this. A DAAC could very well partition their datasets such that a user who may have access to some datasets, may not necessarily have access to other datasets, thus their may be the possibility of multiple credentials URLs per DAAC. Offhand, I don't now if any DAAC currently does this, so this could be an extreme edge case.
Also, most commonly, a user makes a CMR query for data from a single collection, so all results would use the same credentials URL (most likely). However, it would not be strange to issue a CMR query that spans multiple datasets, or even multiple DAACs, thus requiring multiple credentials URLs (multiple credential providers). I can even imagine doing this across GEDI data, where I might want to perform a join of 2 GEDI datasets, but 1 is hosted by LP DAAC and the other is hosted by ORNL DAAC.
Generally speaking, yes, they won't be able construct a store until they do a CMR search and obtain the metadata, but that is the general workflow with NASA Earthdata datasets, regardless of the library being used (unless they're using a library specialized for a particular DAAC's holdings, such as ASF, which hides all of these details).
You're likely going to be hard-pressed to find anybody holding onto a list of bare S3 URLs for data files, which would then also require them to maintain a list of associated credentials URLs (or at least 1 credentials URL, if all S3 URLs are for the same DAAC/bucket).
The common use case for accessing NASA Earthdata data is to perform a CMR query and extract the S3 data URL along with the related credentials URL from the response, for each item in the result. I've never seen an example usage that doesn't do this. Of course, that doesn't mean they don't exist, but they would be rather odd (unless somebody has simply hardcoded an S3 URL and its related credentials URL simply to use as an example that wouldn't require a CMR query, simply for expediency for the example, exactly what I've done in the examples in the docstrings).
However, I have some thoughts about things to explore for potentially making things easier for the user, whether or not they're using earthaccess, so I'm going to open a new discussion. There's definitely much room for improvement.
|
I made some small edits; take a look.
Maybe put those in a new PR? Might be easier to follow just the test and CI changes then. |
|
cc @betolink, @mfisher87 in case you have any comments. |
|
@kylebarron, thanks for the tweaks you made. They look good. I added a couple more very minor tweaks on top, just for a bit more naming and calling consistency in a couple of spots. Just let me know whether or not we should pull out the UAT host. I'm actually inclined to do so, but could go either way. |
|
Post merge comment, will there be a notebook showing how to use obstore with Earthdata credentials? @chuckwondo great work all! |
|
The tl;dr is just from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store
credential_provider = NasaEarthdataCredentialProvider(...) # refer to docstring
store = S3Store("bucket", credential_provider=credential_provider)I'd accept a PR with a notebook example. I assume we don't know the bucket in advance from the authentication process, right? @chuckwondo |
@kylebarron, sorry I missed your last question earlier, but correct, the bucket is not known in advance. It is pulled from CMR metadata. |
What I am changing
How I did it
How you can test it
Related Issues