In 1626 digitized theses workflow create#240
Conversation
Why these changes are being introduced: * The digitized theses workflow requires the addition of environment variables to fetch items from DSpace and metadata from Alma. The workflow also relies on an S3 bucket used by the Imaging Lab as a workspace to create batches of theses. How this addresses that need: * Add DSPACE_CREDENTIALS env var * Add METADATA_API_URL env var * Add S3_BUCKET_DIGITIZED_THESES env var Side effects of this change: * The addition of the digitized theses workflow reveals a need to revisit how configurations (e.g., env vars) are managed for different workflows. While the current structure of the `Config` class has worked in terms of holding env vars for different deployment environments (dev, stage, prod), there is an additional layer of complexity for the DSC app in that the "required" configs may also be *workflow-specific*. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1626 * https://mitlibraries.atlassian.net/browse/IN-1709 * https://mitlibraries.atlassian.net/browse/IN-1745
Coverage Report for CI Build 26519017964Warning No base build found for commit Coverage: 69.174%Details
Uncovered Changes
Coverage RegressionsRequires a base build to compare against. How to fix this → Coverage Stats
💛 - Coveralls |
bc86c77 to
4ecfd35
Compare
| try: | ||
| credentials = CONFIG.dspace_credentials[self.submission_system] | ||
| except KeyError as exception: | ||
| raise exceptions.DSpaceClientCredentialsNotFoundError( | ||
| f"No credentials for {self.submission_system}" | ||
| ) from exception | ||
|
|
||
| client = DSpaceClient( | ||
| api_endpoint=credentials["url"], | ||
| username=credentials["user"], | ||
| password=credentials["password"], | ||
| fake_user_agent=True, | ||
| ) | ||
| authenticated = client.authenticate() | ||
| if not authenticated: | ||
| raise exceptions.DSpaceClientAuthenticationError( | ||
| credentials["url"], credentials["user"] | ||
| ) | ||
| self._dspace_client = client | ||
| logger.info( | ||
| f"Successfully authenticated to {credentials['url']} " | ||
| f"as {credentials['url']}" | ||
| ) |
There was a problem hiding this comment.
Pulled from DSS!
|
Before I jump into a review @jonavellecuerdo, had a question about your second screenshot. For PDF
Is that right? I suppose my naive expectation would be that bad/error/skipped records would get moved into a folder like "skipped". This might be a decent name, is it captures true skips, unhandled errors, etc. It's a bit minor, but it feels kind of messy to have the "bad" PDF + XMLs at the root of the batch folder. Open to pushback or clarification here, just a thought as I prepare to review tomorrow! |
ehanson8
left a comment
There was a problem hiding this comment.
Looking good! Some comments and questions
| def metadata_api_url(self) -> str: | ||
| value = os.getenv("METADATA_API_URL") | ||
| if not value: | ||
| raise OSError("Env var 'METADATA_API_URL' must be defined") | ||
| return value |
There was a problem hiding this comment.
So this would be used by both DT and Wiley for example?
There was a problem hiding this comment.
Yes. 🤔 I think this is also worth asking @cabutlermit about when we talk about how to update the DSPACE_CREDENTIALS environment variable (i.e., multiple credentials/info intended for a single environment variable).
There was a problem hiding this comment.
Your comment is making me realize we could do a similar JSON blob structure for other env vars that change for various workflows, but yes, consulting with him would be great. This addresses my concerns a few comments down
| @property | ||
| def s3_bucket_digitized_theses(self) -> str: | ||
| value = os.getenv("S3_BUCKET_DIGITIZED_THESES") | ||
| if not value: | ||
| raise OSError("Env var 'S3_BUCKET_DIGITIZED_THESES' must be defined") | ||
| return value |
There was a problem hiding this comment.
Similarly to above (if I'm understanding it right), could this also be more generic, e.g. working_directory?
There was a problem hiding this comment.
I think it could be! Requesting to leave it for now until we get into the WIley implementation. What this is revealing is a need to reevaluate how to dynamically set the appropriate configs based on the workflow METADATA_API_URL and S3_BUCKET_WORKING_DIRECTORY. I think the discussion for DSPACE_CREDENTIALS will help us figure out how we should be storing these configs in AWS in the first place and from there we can figure out how the app should access them.
Does that sound like a reasonable next step -- proceed as-is, include in discussion with @cabutlermit ? 🤔
|
|
||
| def check_required_env_vars(self) -> None: | ||
| """Method to raise exception if required env vars not set.""" | ||
| missing_vars = [var for var in self.REQUIRED_ENV_VARS if not os.getenv(var)] |
There was a problem hiding this comment.
I'm still a bit unclear how DSC would manage workflow-specific env vars, I get the need to hit different buckets for different AWS envs for working_directory but it is odd to have a workflow-specific env vars at the Config level. That said, if the plan is to create a ticket and address later, that's OK with me. Just want to avoid a proliferation of workflow-specific env var in the future
For metadata_api_url however, a workflow class attribute feels like a better fit in most, if not all, cases. Even with Alma, I don't think we would read from Alma sandbox even on Dev1, we would read from Alma prod.
There was a problem hiding this comment.
Created this ticket for now! This has been on my mind as well but have been focused on getting something up and running first, figuring we can tackle infra-related changes after discussions with @cabutlermit !
| CREATE_SUCCESS = "create_success" | ||
| CREATE_FAILED = "create_failed" | ||
| CREATE_SKIPPED = "create_skipped" | ||
| BATCH_CREATED = "batch_created" |
There was a problem hiding this comment.
BATCH_CREATED feels duplicative now, do we still need it? And if we are changing enum values for DynamoDB, we may need to run an update pass to ensure consistency
There was a problem hiding this comment.
I left ItemSubmissionStatus.BATCH_CREATED to prevent breaking workflows that rely on it, but I did create a ticket to align all workflows to use the new statuses: IN-1753. I don't believe adding a new status changes anything in terms of writing to DynamoDB. As long as it meets the type, it should write to DynamoDB just fine.
There was a problem hiding this comment.
Correct, DynamoDB will write whatever but we can talk about whether consistency in useful or not in the ticket
| ) | ||
|
|
||
|
|
||
| def test_workflow_dspace_client_raise_authenticationerror( |
There was a problem hiding this comment.
test_workflow_dspace_client_raise_authentication_error?
There was a problem hiding this comment.
I always wonder whether it'd be good practice to use the same name/formatting as the error. What do you think of:
test_workflow_dspace_client_raise_AuthenticationError? 🤔
There was a problem hiding this comment.
Ah, I see, not a big deal either way, do what you feel!
ghukill
left a comment
There was a problem hiding this comment.
I thought it better to submit my review now, with some requests for comments + docstrings, and some possible suggestions for the "minted" batch id language and logic.
If it were possible to drop the word "mint" entirely, except maybe once when some code "mints the batch id going forward for this batch", that feels like it could be a big win. This might remove some confusion when scanning code if we're:
- copying files from digi theses S3 workspace bucket to batch folder in DSC bucket
- moving files within the batch folder
Which all brings me to a question.... is there a universe where this workflow should be preparing the batch locally -- e.g. in a python temp directory -- and then uploading the final, perfectly formed batch with it's new "minted" batch id to the DSC bucket?
If you went that path, if something went horribly wrong along the way, there isn't cruft in the DSC bucket. Nothing got uploaded, because nothing was really ready enough. And instead of moving files around in S3, which is kind of tricky, you're literally just moving around files in a python temp directory. For this app, that'd be inside the ECS task Docker container, which is just fine.
Not a hard requirement / request to do this file downloading, renaming, and moving around locally then uploading to DSC, but maybe worth a consideration whilst considering the comments about "minted" batch ids and all fairly repetitive S3 path constructions throughout.
All that said, this is some complicated stuff, really good foundation here conceptually I think.
| errors: list[tuple] = [] # set but not used | ||
|
|
||
| # sync batch folder from DT S3 bucket to minted batch folder in DSC S3 bucket | ||
| run_aws_cli_sync( |
There was a problem hiding this comment.
Also touched on in an off-PR discussion, I think there is some conflict with the word "sync" here, that may or may not be worth addressing.
In the method signature synced: bool = False is referring to whether or not the batch folder was synced from stage --> prod, correct? and that we're not heavily using this right now?
Either way, it seems as though this has some tension with the use of "sync" in the utility function run_aws_cli_sync(...) which while technically "syncing" data in S3, it's more of a naive copy from point A to point B. My understanding is that it's not an "opinionated DSC sync" of a fully formed batch folder from stage --> prod.
I agree that "sync" is still accurate in that function name, given its literally running aws cli sync.... But we may just need some comments + docstrings to help readers know super clearly that this particular workflow kind of has a third bucket in play, the digitized theses bucket that Jenn (or others) upload PDFs to.
I'm going to avoid needling the whole "do things in stage, then do again in prod" logic which I can appreciate but don't love here. Assuming that remains and we like it, I still think there is room here to disambiguate these two things:
- This
DigitizedThesesworkflow copying data from a workspace bucket specific to only this workflow, to the stage bucket - Syncing a fully formed batch folder from stage to prod
| ) | ||
|
|
||
| s3_client = S3Client() | ||
| for file in s3_client.files_iter( |
There was a problem hiding this comment.
A lot of my comments are kind of around this file copying + syncing + batch folder creation, so this comment is actually kind of touching three things:
- comment about possible "setting" of
self.batch_id, which meansself.batch_idcould be used more - the use of
run_aws_cli_sync(...) - this call to
s3_client.files_iter(...)
Could there be sub-methods for some of this work? So this prepare_batch() might look more like?
self.batch_id = self.update_batch_id(batch_id)
self._copy_input_files_to_dsc_batch_folder(...)
input_files = self._get_input_files_iter(...)
item_submissions = []
errors: list[tuple] = [] # set but not used
for file in input_files:
...Just throwing it out there, there might be some opportunity for small, targeted methods here. So when encountering this prepare_batch() method, you can follow along at a high level.
The screenshot is meant to communicate thow ideally there wouldn't be as much explicit S3 bucket + path construction at this high level.
|
@ehanson8 @ghukill I just pushed up two commits--the former comprising of smaller, minor updates and the latter being more involved. In this description, I will be summarizing the changes in the second commit, which emerged from three key things/questions @ghukill pointed out:
Screenshots showing new structure of prepared batch folder
Please let me know if you have any questions! |
ghukill
left a comment
There was a problem hiding this comment.
Looking really good, nice work wrangling some complex mechanics.
I've opted for a "Request changes" 99% for the categorization of items based on their status_details. For reasons noted in the comment, I'd lobby pretty strong to avoid the "new" or "replacment" fuzzy string matching. Open to pushback and more discussion, but this feels like something to really kind of iron out.
Otherwise, basically an approval else-wise!
| This method will inspect the `status_details` for each item submission to | ||
| determine which theses subfolder (prefix) its content should be moved to. | ||
| At this point, `status_details` will contain "Replacement thesis", | ||
| "New thesis", or an exception message. If `status_details` contains an | ||
| exception message, contents are moved to a skipped-theses/ subfolder. |
There was a problem hiding this comment.
I'm not crazy about this kind of fuzzy string matching/checking for classifying the item as new vs. replacement vs. skip/error, mostly for the possibility than an exception message from this statement which shows up a couple of times,
item_submission.status_details = str(exception)may somehow include words like "new" or "replacment" and get incorectly categorized.
Would it work to look for more exact phrases like "Replacement thesis" or "New thesis" in the status details? maybe removing the .lower() as well? Maybe that could add some additional safety without really changing the approach?
Another thought, note that ItemSubmission class has these already on it:
"""
...
2. Processing attributes: Temporary data used during submission processing
but not persisted
...
"""
...
# processing attributes
source_metadata: dict[str, Any] | None = None
dspace_metadata: dict[str, Any] | None = None
bitstream_s3_uris: list[str] | None = None
metadata_s3_uri: str = ""You could update that class to have something like .processing_meta which is a freeform dictionary to note things on the ItemSubmission instance during processing. Then, use that during sorting and categorizing.
Another option...
Instead of appending ItemSubmission instances into item_submissions, you could store them in a dictionary like:
item_submissions = {
"new":[...],
"replacement":[...],
"skip_and_error":[...],
}If you pass that dictionary to self._move_batch_files_to_theses_subfolders() it of course could quite easily just slot them into the right folders.
Then, when ready to return the list you could do
return (
item_submissions["new"] +
item_submissions["replacement"] +
item_submissions["skip_and_error"]
)Or this, which AI suggested and I wouldn't have thought of!
from itertools import chain
return list(chain(
item_submissions["new"],
item_submissions["replacement"],
item_submissions["skip_and_error"],
))TL/DR: It really might be worth avoiding the fuzzy matching of "new" and "replacement" in the ItemSubmission.status_details as a sorting mechanism, given that we're storing raw exceptions; they could theoretically contain those words. A bit of effort to avoid now, might avoid a really sneaky bug in production later.
There was a problem hiding this comment.
I will update the conditions to check for exact matches of "New thesis" and "Replacement thesis". I will hold off on additional changes to the ItemSubmission class but will continue to keep them in mind.
Why these changes are being introduced: * The batch preparation methods for the digitized theses workflow must be capable of performing the following functions: 1. Sync the batch folder from the digitized theses workflow S3 bucket to a minted batch folder in the DSC S3 bucket 2. Download metadata from Alma via SRU 3. Get an item from DSpace 4. Determine if an item is a valid 'replacement thesis', given DSpace item metadata 5. Organize the contents of the minted batch folder into 'new-theses/' and 'replacement-theses/' subfolders How this addresses that need: * Implement methods on DigitizedThesesWorkflow for batch preparation functions * Expand ItemSubmissionStatus to `CREATE_SUCCESS`, `CREATE_FAILED`, `CREATE_SKIPPED` * Add exceptions to handle item retrieval from DSpace Side effects of this change: * Though we've expanded ItemSubmissionStatus to include `CREATE_*` statuses, `BATCH_CREATED` remains to support backwards compatibility with other workflows. Updating all workflows to use `CREATE_*` will be in the DSC backlog for future work. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1626
7200d91 to
aaa826e
Compare



Purpose and background context
This PR implement batch preparation methods for digitized theses workflow. I went back-and-forth on whether I should send a PR with the entire workflow, but as I saw the changes that were piling up, I decided it would be best to request review in reasonably sized chunks. This also allows us to focus / reflect on the key differences for this workflow for each step of the DSC workflow.
Here are some key points to highlight about the
create/prepare_batchmethods for the digitized theses workflow:minted_batch_idattribute, which is comprised of the original batch ID and the current date timestamp.DigitizedTheses.prepare_batchdoes not collect exceptions in theerrorslist like other workflows.ItemSubmission.status_detailsfield.errors.csv.BatchCreationFailedErroris raised. TBD when this error should be raised (if ever) for the workflow.createstep, the minted batch ID is what gets passed in the latter calls tosubmitandfinalize.DigitizedTheses.batch_pathas is and explicitly usedDigitizedTheses.minted_batch_idwhen writing files to S3.submit, we can avoid repeated calls to Alma SRU and read the metadata from S3 instead. The saved XML files are also simplified from the original Alma SRU response (the saved file sets the<record>as the root of the XML file -- will no longer require these steps).How can a reviewer manually see the effects of these changes?
While the test suite includes several new unit tests for the simpler functions in
dsc.workflows.DigitizedTheses, I'm requesting that we move forward with the current set of unit tests given the additional local testing with MinIO shared below.Note: In general, DSC has been a complex app to test. Recent changes have relied on actual runs with the DSO Step Function, but this is not possible until we have a test instance of DSpace 8 available. Revisiting the test suite for clean up / improve maintainability is definitely on track for future work! Thanking reviewers in advance for understanding.
Local testing of
dsc.workflows.DigitizedTheses.prepare_batchAbout the test data:
These five files were provided by Sadie (i.e., items that were imported into DSpace via v1 workflow):
An additional file for a student-submitted electronic theses (i.e., should be skipped by workflow) is used for testing:
batch-0) in the digitized theses workspace S3 bucket:dsc.workflows.DigitizedTheses.prepare_batchin IPython:replacement-theses/subfolder and one item submission (a student-submitted electronic thesis) remains at unmoved.Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Code review