feat: Improve NASA Earthdata credential providers by chuckwondo · Pull Request #472 · developmentseed/obstore

chuckwondo · 2025-05-27T23:32:37Z

What I am changing

Fixing and completing NASA Earthdata credential providers

How I did it

Fixed/enhanced NasaEarthdataCredentialProvider
Fixed/enhanced NasaEarthdataAsyncCredentialProvider

How you can test it

TBD

Related Issues

Fixes Earthdata credential provider improvements #392

chuckwondo · 2025-05-27T23:36:38Z

@kylebarron, I'm making this a draft PR for now, as I have not added any tests. I have confirmed that it works via manual testing, but do you have any thoughts on how you might like to add automated tests since this requires NASA Earthdata credentials to test? Since this requires running from within AWS us-west-2, I'm not sure we can add any "positive" tests. Perhaps we can at least add some "expected error" tests, to make sure failures (due to not being in us-west-2) raise the expected errors?

chuckwondo · 2025-05-27T23:37:10Z

Oops! I see a couple of failed checks. I'll look at them tomorrow.

kylebarron · 2025-05-28T02:40:18Z


    async def close(self) -> None:
-        """Close the underlying session.
+        """Close the underlying session, if it was not supplied by the user.


Are there any external references we can consider here? I'm inclined to make close always close the underlying session, and the user can decide not to call close if they want to keep using the underlying session. Seems like that would be more predictable.

I don't think that's a good practice. If the user created the session, we cannot assume responsibility for closing it for them. They may have created it themselves precisely so they can control when to close it.

For example, they may be controlling resources by using a session across various operations in order to take better advantage of pooling. If we close out their session from under them, they may be very unhappy when they attempt to use it elsewhere as well, only to find that it has been closed.

What I think might be a bit better is at least changing the docstring to avoid specifically mentioning the session object, and instead say something along the lines of "release resources created by this credential provider."

To go a step further, I would be inclined to make this a context manager to allow (perhaps even require) use of a with block, but that might be more than you're willing to push onto the user.

While in standard usage we would make this a context manager, it might be hard to use as a context manager when the stores will live for so long.

kylebarron · 2025-05-28T02:55:14Z

I have confirmed that it works via manual testing, but do you have any thoughts on how you might like to add automated tests since this requires NASA Earthdata credentials to test?

I don't know, and I'm open to suggestions. You can see the other credential providers don't have any tests, and I agree it's hard to test. In theory you could mock the auth calls and responses but that seems like a lot of work for not that much gain.

Oops! I see a couple of failed checks. I'll look at them tomorrow.

Looks like it's just the PR title and the constants missing from the docs.

chuckwondo · 2025-05-28T10:36:08Z

I have confirmed that it works via manual testing, but do you have any thoughts on how you might like to add automated tests since this requires NASA Earthdata credentials to test?

I don't know, and I'm open to suggestions. You can see the other credential providers don't have any tests, and I agree it's hard to test. In theory you could mock the auth calls and responses but that seems like a lot of work for not that much gain.

My initial thoughts on mocking also led me to believe it might be too much effort, but now that I'm thinking about it a bit more, I realize that we only need to mock obtaining S3 credentials, not actually fetching any data, because we would assume that once we have valid credentials in hand, an S3Store would be able to properly fetch data.

So perhaps we can mock something up without too much effort, but given that it looks like @maxrjones is attempting to use this (currently unsuccessfully without the changes in this PR), perhaps we can leave unit tests for a separate PR, particularly if bugs are discovered.

maxrjones · 2025-05-28T12:55:02Z

FYI one embarrassing mistake I made yesterday was tab completing the credential provider without paying much attention and then using the async version in a Jupyter cell which led to it hanging. Just something to consider with the API design since the async version will be preferentially chosen over the sync version and hanging makes it harder to recognize a user error.

chuckwondo · 2025-05-28T13:30:43Z

FYI one embarrassing mistake I made yesterday was tab completing the credential provider without paying much attention and then using the async version in a Jupyter cell which led to it hanging. Just something to consider with the API design since the async version will be preferentially chosen over the sync version and hanging makes it harder to recognize a user error.

This is one reason I really dislike "colored" functions. However, I'm not sure how we can help users avoid this issue. What specifically do you have in mind with regard to, "something to consider with the API design"?

maxrjones · 2025-05-28T13:36:06Z

What specifically do you have in mind with regard to, "something to consider with the API design"?

I was thinking about having separate namespaces, but it's just an off-the-cuff thought rather than a suggestion

from obstore.auth.async import NasaEarthdataAsyncCredentialProvider

kylebarron · 2025-05-28T13:55:50Z

preferentially chosen over the sync version

Preferentially chosen... by the IDE autocompletion?

I was thinking about having separate namespaces, but it's just an off-the-cuff thought rather than a suggestion

from obstore.auth.async import NasaEarthdataAsyncCredentialProvider

As it is the namespaces are designed around "concept" and required imports. So if you import from obstore.auth.boto3 you need boto3 installed.

maxrjones · 2025-05-28T14:17:33Z

preferentially chosen over the sync version

Preferentially chosen... by the IDE autocompletion?

My assumption is that most autocompletion implementations use alphabetical order rather than random selection but I could be wrong.

chuckwondo · 2025-05-28T15:58:49Z

@kylebarron, I think I've addressed all of your comments.

Regarding some tests, I believe I can add a few without much effort by making use of vcrpy and pytest-recording. I've used them elsewhere with good success, including in the earthaccess library.

Let me know if you'd like me to add them to this PR or submit a separate PR.

… response

kylebarron · 2025-05-28T17:23:20Z

+"""
+
+EARTHDATA_HOST_UAT = "uat.urs.earthdata.nasa.gov"
+"""Hostname of the NASA Earthdata Login User-Acceptance Testing (UAT) system."""


Can you say when a user might want to use this instead?

Is this only for testing? If it's not intended for end-user use, we should remove it from here.

This is not often used, but it is something that may still want to expose to users. Users of the earthaccess library have expressed the desire to hit "test" data rather than "production" data, and both hosts are coded into that library.

However, I'm not opposed to removing this. Users who want to hit the UAT system will know the host name and can simply provide it explicitly themselves, so I'd be okay with having only the "ops" host, in which case, I might be inclined to simply name it DEFAULT_EARTHDATA_HOST.

What do you prefer?

Yeah I'm inclined to remove this so that there's less confusion for new users, while advanced users will know what to pass in themselves. And happy to rename the existing value to DEFAULT_EARTHDATA_HOST as you suggest.

kylebarron · 2025-05-28T17:46:21Z

+
+        # Obtain an S3 credentials URL and an S3 data/download URL, typically
+        # via metadata returned from a NASA CMR collection or granule query.
+        credentials_url = "https://data.ornldaac.earthdata.nasa.gov/s3credentials"


It might be good to also have an initial example that doesn't pass a custom credentials url? It's not clear to me when a user would need to override this url, and so I could see a user getting confused by an example that passes this, even though the argument has a default.

The user must always pass a credentials URL. There is no default. Perhaps you're confusing this with the default host value. The 2 are completely distinct.

The credentials URL is specific to the data provider (DAAC). It is part of the collection and/or granule metadata that is in response to a CMR query. That metadata contains both the credentials URL (HTTP, not S3) that the user must use for obtaining S3 credentials, as well as a separate S3 URL that the user can use to read the data by using the obtained credentials.

When using the credentials URL to obtain s3 credentials, the user must supply a username/password pair associated with a NASA Earthdata account at the Earthdata host (default is the "operational" host). Otherwise, they can supply a token, which they can generate on the very same Earthdata host, to use instead of their username/password pair.

Sequence:

Query the CMR to obtain metadata for a granule

Extract 2 URLs from the metadata:

S3 credentials URL (this is an HTTPS URL used for obtaining S3 credentials, it is not itself an S3 URL)

S3 URL of the granule data file (requiring S3 credentials for reading it)

Use NASA Earthdata user token or user creds to obtain (temporary, typically 1-hr) S3 credentials via the credentials URL (the user token or user creds must be associated with the Earthdata account hosted at the Earthdata host, which defaults to the "operational" [production] host)

Use the obtained (temporary) S3 credentials to read the granule data from the S3 URL pulled from the metadata

Can we set a default? Is there a DAAC that's most commonly used? Why does earthaccess.login not require the user to pass a URL? https://earthaccess.readthedocs.io/en/latest/tutorials/file-access/

What we're doing here is at a lower level that what earthaccess is doing. earthaccess does supply an explicit URL under the covers (essentially as I've outlined above). Supplying a default does not make sense.

If earthaccess were to start using obstore, the earthaccess user would still not supply a credentials URL, but earthaccess would supply it to the obstore credential provider.

See the following, where the credentials URL (endpoint) is extracted from the metadata returned by a CMR query:

https://github.com/nsidc/earthaccess/blob/main/earthaccess/api.py#L329

https://github.com/nsidc/earthaccess/blob/main/earthaccess/api.py#L447

We would pass that "endpoint" value along as the credentials URL to the obstore credential provider.

The logic in earthaccess that is handing the "endpoint" (credentials URL) is equivalent to the logic that's in this obstore credential provider, but the code in the PR adds further functionality by supporting the use of a token as well as recognizing env vars.

I see previously the code I wrote hardcoded

CREDENTIALS_API = "https://archive.podaac.earthdata.nasa.gov/s3credentials"

Is there a known set of DAACs? Could we have these on an enum so the user doesn't need to know a full URL?

class EarthdataProviders(Enum): PODAAC = "https://archive.podaac.earthdata.nasa.gov/s3credentials" ...

?

Again, I strongly recommend against this, but perhaps my explanation is still not clear.

The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.

If earthaccess were to make use of this, the earthaccess user would not have to change a thing. It would all be handled within the earthaccess library itself, unbeknownst to the earthaccess user.

At this point, if that still doesn't convince you, I'd ask that you trust that the sensible decision is to not provide any sort of hard-coded list because this information is obtained dynamically, and is subject to change.

We can always add this at a later point if it is deemed sensible to do so.

I want this to be useful not only to users who are already using the earthaccess library (in that case, we could have this entire file live in the earthaccess project).

The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.

I suppose that means a user will need separate S3Store objects for different datasets? And they won't be able to construct an S3Store until doing a CMR search because they won't be able to construct the credential provider until seeing the CMR result...

I want this to be useful not only to users who are already using the earthaccess library (in that case, we could have this entire file live in the earthaccess project).

Yes, I absolutely agree with you here!

The user doesn't need to know these URLs. The URLs come as part of the metadata returned by a CMR search.

I suppose that means a user will need separate S3Store objects for different datasets? And they won't be able to construct an S3Store until doing a CMR search because they won't be able to construct the credential provider until seeing the CMR result...

We need an S3Store per provider/DAAC (which typically equates to a bucket -- I'm not aware of a DAAC using multiple buckets for data).

For the most part, each DAAC will host multiple datasets (collections, not to be confused with the STAC term), and will use the same credentials URL across all of their hosted datasets. However, I don't think it's safe to assume this. A DAAC could very well partition their datasets such that a user who may have access to some datasets, may not necessarily have access to other datasets, thus their may be the possibility of multiple credentials URLs per DAAC. Offhand, I don't now if any DAAC currently does this, so this could be an extreme edge case.

Also, most commonly, a user makes a CMR query for data from a single collection, so all results would use the same credentials URL (most likely). However, it would not be strange to issue a CMR query that spans multiple datasets, or even multiple DAACs, thus requiring multiple credentials URLs (multiple credential providers). I can even imagine doing this across GEDI data, where I might want to perform a join of 2 GEDI datasets, but 1 is hosted by LP DAAC and the other is hosted by ORNL DAAC.

Generally speaking, yes, they won't be able construct a store until they do a CMR search and obtain the metadata, but that is the general workflow with NASA Earthdata datasets, regardless of the library being used (unless they're using a library specialized for a particular DAAC's holdings, such as ASF, which hides all of these details).

You're likely going to be hard-pressed to find anybody holding onto a list of bare S3 URLs for data files, which would then also require them to maintain a list of associated credentials URLs (or at least 1 credentials URL, if all S3 URLs are for the same DAAC/bucket).

The common use case for accessing NASA Earthdata data is to perform a CMR query and extract the S3 data URL along with the related credentials URL from the response, for each item in the result. I've never seen an example usage that doesn't do this. Of course, that doesn't mean they don't exist, but they would be rather odd (unless somebody has simply hardcoded an S3 URL and its related credentials URL simply to use as an example that wouldn't require a CMR query, simply for expediency for the example, exactly what I've done in the examples in the docstrings).

However, I have some thoughts about things to explore for potentially making things easier for the user, whether or not they're using earthaccess, so I'm going to open a new discussion. There's definitely much room for improvement.

kylebarron · 2025-05-28T17:49:21Z

I made some small edits; take a look.

Let me know if you'd like me to add them to this PR or submit a separate PR.

Maybe put those in a new PR? Might be easier to follow just the test and CI changes then.

kylebarron · 2025-05-28T18:29:37Z

cc @betolink, @mfisher87 in case you have any comments.

chuckwondo · 2025-05-28T19:00:53Z

@kylebarron, thanks for the tweaks you made. They look good. I added a couple more very minor tweaks on top, just for a bit more naming and calling consistency in a couple of spots.

Just let me know whether or not we should pull out the UAT host. I'm actually inclined to do so, but could go either way.

kylebarron

Thank you!

betolink · 2025-06-02T15:24:30Z

Post merge comment, will there be a notebook showing how to use obstore with Earthdata credentials? @chuckwondo great work all!

kylebarron · 2025-06-02T15:49:12Z

The tl;dr is just

from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store

credential_provider = NasaEarthdataCredentialProvider(...) # refer to docstring
store = S3Store("bucket", credential_provider=credential_provider)

I'd accept a PR with a notebook example.

I assume we don't know the bucket in advance from the authentication process, right? @chuckwondo

chuckwondo · 2025-10-10T16:04:40Z

The tl;dr is just
from obstore.auth.earthdata import NasaEarthdataCredentialProvider
from obstore.store import S3Store

credential_provider = NasaEarthdataCredentialProvider(...) # refer to docstring
store = S3Store("bucket", credential_provider=credential_provider)
I'd accept a PR with a notebook example.

I assume we don't know the bucket in advance from the authentication process, right? @chuckwondo

@kylebarron, sorry I missed your last question earlier, but correct, the bucket is not known in advance. It is pulled from CMR metadata.

chuckwondo and others added 3 commits May 27, 2025 14:19

Enhance NASA Earthdata credential provider

45a679c

Translate unexpected redirect as unauthorized

ce934e5

Properly handle non-JSON content as unauthorized

fe5993f

kylebarron reviewed May 28, 2025

View reviewed changes

chuckwondo changed the title ~~Chuckwondo/earthdata creds~~ Fix/complete NASA Earthdata credential providers May 28, 2025

chuckwondo mentioned this pull request May 28, 2025

Refactor codebase to support a new simplified Parser->ManifestStore model. zarr-developers/VirtualiZarr#601

Merged

7 tasks

chuckwondo changed the title ~~Fix/complete NASA Earthdata credential providers~~ fix: NASA Earthdata credential providers May 28, 2025

github-actions Bot added the fix label May 28, 2025

Refactor to eliminate closures in earthdata.py

c2f5af0

kylebarron added 4 commits May 28, 2025 13:40

Split raise unauthorized

6617052

Suggest user to use S3Store.from_url for better type hinting in the…

2164540

… response

Don't assign callable to class and better typing for token

76ec5cf

Cleanup

1b6cee6

kylebarron reviewed May 28, 2025

View reviewed changes

kylebarron mentioned this pull request May 28, 2025

Earthdata credential provider improvements #392

Closed

5 tasks

Add minor naming and calling consistency

36c62b7

chuckwondo marked this pull request as ready for review May 28, 2025 19:01

chuckwondo and others added 2 commits May 28, 2025 16:40

Remove NASA Earthdata UAT host

019982c

Merge branch 'main' into chuckwondo/earthdata-creds

3f4c765

kylebarron changed the title ~~fix: NASA Earthdata credential providers~~ feat: Improve NASA Earthdata credential providers May 28, 2025

github-actions Bot added feat and removed fix labels May 28, 2025

kylebarron approved these changes May 28, 2025

View reviewed changes

kylebarron merged commit 50782ed into main May 28, 2025
8 checks passed

kylebarron deleted the chuckwondo/earthdata-creds branch May 28, 2025 22:34

Uh oh!

Conversation

chuckwondo commented May 27, 2025

What I am changing

How I did it

How you can test it

Related Issues

Uh oh!

chuckwondo commented May 27, 2025

Uh oh!

chuckwondo commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylebarron commented May 28, 2025

Uh oh!

chuckwondo commented May 28, 2025

Uh oh!

maxrjones commented May 28, 2025

Uh oh!

chuckwondo commented May 28, 2025

Uh oh!

maxrjones commented May 28, 2025

Uh oh!

kylebarron commented May 28, 2025

Uh oh!

maxrjones commented May 28, 2025

Uh oh!

chuckwondo commented May 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chuckwondo May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylebarron commented May 28, 2025

Uh oh!

kylebarron commented May 28, 2025

Uh oh!

chuckwondo commented May 28, 2025

Uh oh!

kylebarron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betolink commented Jun 2, 2025

Uh oh!

kylebarron commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chuckwondo May 28, 2025 •

edited

Loading

kylebarron commented Jun 2, 2025 •

edited

Loading