In 1626 digitized theses workflow create by jonavellecuerdo · Pull Request #240 · MITLibraries/dspace-submission-composer

jonavellecuerdo · 2026-05-20T16:43:57Z

Purpose and background context

This PR implement batch preparation methods for digitized theses workflow. I went back-and-forth on whether I should send a PR with the entire workflow, but as I saw the changes that were piling up, I decided it would be best to request review in reasonably sized chunks. This also allows us to focus / reflect on the key differences for this workflow for each step of the DSC workflow.

Here are some key points to highlight about the create / prepare_batch methods for the digitized theses workflow:

The workflow has a minted_batch_id attribute, which is comprised of the original batch ID and the current date timestamp.
DigitizedTheses.prepare_batch does not collect exceptions in the errors list like other workflows.
1. Why: This allows DSC to proceed with writing all item submissions to the DynamoDB table, with any exceptions captured via the ItemSubmission.status_details field.
  1. When DSC sends the report, the resulting CSV file from DynamoDB will include all intended item submissions in the batch rather than creating a separate errors.csv.
  2. No BatchCreationFailedError is raised. TBD when this error should be raised (if ever) for the workflow.
After the create step, the minted batch ID is what gets passed in the latter calls to submit and finalize.
1. For this reason, I left DigitizedTheses.batch_path as is and explicitly used DigitizedTheses.minted_batch_id when writing files to S3.
Why do we save the Alma metadata in XML files to S3?
1. When we call submit, we can avoid repeated calls to Alma SRU and read the metadata from S3 instead. The saved XML files are also simplified from the original Alma SRU response (the saved file sets the <record> as the root of the XML file -- will no longer require these steps).

How can a reviewer manually see the effects of these changes?

While the test suite includes several new unit tests for the simpler functions in dsc.workflows.DigitizedTheses, I'm requesting that we move forward with the current set of unit tests given the additional local testing with MinIO shared below.

Note: In general, DSC has been a complex app to test. Recent changes have relied on actual runs with the DSO Step Function, but this is not possible until we have a test instance of DSpace 8 available. Revisiting the test suite for clean up / improve maintainability is definitely on track for future work! Thanking reviewers in advance for understanding.

Local testing of `dsc.workflows.DigitizedTheses.prepare_batch`

About the test data:
These five files were provided by Sadie (i.e., items that were imported into DSpace via v1 workflow):

12854433: https://hdl.handle.net/1721.1/157046
25395837: https://hdl.handle.net/1721.1/148716
34619573: https://hdl.handle.net/1721.1/154113
22215428: https://hdl.handle.net/1721.1/158471
05588126: https://hdl.handle.net/1721.1/157651

An additional file for a student-submitted electronic theses (i.e., should be skipped by workflow) is used for testing:

932127295: http://hdl.handle.net/1721.1/100593

Created a batch (batch-0) in the digitized theses workspace S3 bucket:

After running dsc.workflows.DigitizedTheses.prepare_batch in IPython:

Contents are synced to a minted batch folder five item submissions moved to a replacement-theses/ subfolder and one item submission (a student-submitted electronic thesis) remains at unmoved.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1626

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: * The digitized theses workflow requires the addition of environment variables to fetch items from DSpace and metadata from Alma. The workflow also relies on an S3 bucket used by the Imaging Lab as a workspace to create batches of theses. How this addresses that need: * Add DSPACE_CREDENTIALS env var * Add METADATA_API_URL env var * Add S3_BUCKET_DIGITIZED_THESES env var Side effects of this change: * The addition of the digitized theses workflow reveals a need to revisit how configurations (e.g., env vars) are managed for different workflows. While the current structure of the `Config` class has worked in terms of holding env vars for different deployment environments (dev, stage, prod), there is an additional layer of complexity for the DSC app in that the "required" configs may also be *workflow-specific*. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1626 * https://mitlibraries.atlassian.net/browse/IN-1709 * https://mitlibraries.atlassian.net/browse/IN-1745

coveralls · 2026-05-20T16:46:19Z

Coverage Report for CI Build 26519017964

Warning

No base build found for commit d69a61e on main.
Coverage changes can't be calculated without a base build.
If a base build is processing, this comment will update automatically when it completes.

Coverage: 69.174%

Details

Patch coverage: 81 uncovered changes across 2 files (132 of 213 lines covered, 61.97%).

Uncovered Changes

File	Changed	Covered	%
dsc/workflows/digitized_theses/workflow.py	181	106	58.56%
dsc/config.py	19	13	68.42%
Total (6 files)	213	132	61.97%

Coverage Regressions

Requires a base build to compare against. How to fix this →

Coverage Stats


Relevant Lines:	1901
Covered Lines:	1315
Line Coverage:	69.17%
Coverage Strength:	0.69 hits per line

💛 - Coveralls

jonavellecuerdo · 2026-05-20T17:36:46Z

+            try:
+                credentials = CONFIG.dspace_credentials[self.submission_system]
+            except KeyError as exception:
+                raise exceptions.DSpaceClientCredentialsNotFoundError(
+                    f"No credentials for {self.submission_system}"
+                ) from exception
+
+            client = DSpaceClient(
+                api_endpoint=credentials["url"],
+                username=credentials["user"],
+                password=credentials["password"],
+                fake_user_agent=True,
+            )
+            authenticated = client.authenticate()
+            if not authenticated:
+                raise exceptions.DSpaceClientAuthenticationError(
+                    credentials["url"], credentials["user"]
+                )
+            self._dspace_client = client
+            logger.info(
+                f"Successfully authenticated to {credentials['url']} "
+                f"as {credentials['url']}"
+            )


Pulled from DSS!

ghukill · 2026-05-20T20:52:32Z

Before I jump into a review @jonavellecuerdo, had a question about your second screenshot.

For PDF 932127295, am I understand correctly that:

the batch folder is created in DSC bucket
the original PDF is moved over to root of batch folder
the Alma MARC is retrieved and saved as an XML document
but because not replacement or new, it remains in the root of the batch folder

Is that right? I suppose my naive expectation would be that bad/error/skipped records would get moved into a folder like "skipped". This might be a decent name, is it captures true skips, unhandled errors, etc.

It's a bit minor, but it feels kind of messy to have the "bad" PDF + XMLs at the root of the batch folder. Open to pushback or clarification here, just a thought as I prepare to review tomorrow!

ehanson8

Looking good! Some comments and questions

ehanson8 · 2026-05-21T13:37:47Z

+    def metadata_api_url(self) -> str:
+        value = os.getenv("METADATA_API_URL")
+        if not value:
+            raise OSError("Env var 'METADATA_API_URL' must be defined")
+        return value


So this would be used by both DT and Wiley for example?

Yes. 🤔 I think this is also worth asking @cabutlermit about when we talk about how to update the DSPACE_CREDENTIALS environment variable (i.e., multiple credentials/info intended for a single environment variable).

Your comment is making me realize we could do a similar JSON blob structure for other env vars that change for various workflows, but yes, consulting with him would be great. This addresses my concerns a few comments down

ehanson8 · 2026-05-21T13:39:20Z

+    @property
+    def s3_bucket_digitized_theses(self) -> str:
+        value = os.getenv("S3_BUCKET_DIGITIZED_THESES")
+        if not value:
+            raise OSError("Env var 'S3_BUCKET_DIGITIZED_THESES' must be defined")
+        return value


Similarly to above (if I'm understanding it right), could this also be more generic, e.g. working_directory?

I think it could be! Requesting to leave it for now until we get into the WIley implementation. What this is revealing is a need to reevaluate how to dynamically set the appropriate configs based on the workflow METADATA_API_URL and S3_BUCKET_WORKING_DIRECTORY. I think the discussion for DSPACE_CREDENTIALS will help us figure out how we should be storing these configs in AWS in the first place and from there we can figure out how the app should access them.

Does that sound like a reasonable next step -- proceed as-is, include in discussion with @cabutlermit ? 🤔

Yes, I'm fine with that!

ehanson8 · 2026-05-21T13:51:47Z

+
    def check_required_env_vars(self) -> None:
        """Method to raise exception if required env vars not set."""
        missing_vars = [var for var in self.REQUIRED_ENV_VARS if not os.getenv(var)]


I'm still a bit unclear how DSC would manage workflow-specific env vars, I get the need to hit different buckets for different AWS envs for working_directory but it is odd to have a workflow-specific env vars at the Config level. That said, if the plan is to create a ticket and address later, that's OK with me. Just want to avoid a proliferation of workflow-specific env var in the future

For metadata_api_url however, a workflow class attribute feels like a better fit in most, if not all, cases. Even with Alma, I don't think we would read from Alma sandbox even on Dev1, we would read from Alma prod.

Created this ticket for now! This has been on my mind as well but have been focused on getting something up and running first, figuring we can tackle infra-related changes after discussions with @cabutlermit !

ehanson8 · 2026-05-21T13:55:28Z

+    CREATE_SUCCESS = "create_success"
+    CREATE_FAILED = "create_failed"
+    CREATE_SKIPPED = "create_skipped"
    BATCH_CREATED = "batch_created"


BATCH_CREATED feels duplicative now, do we still need it? And if we are changing enum values for DynamoDB, we may need to run an update pass to ensure consistency

I left ItemSubmissionStatus.BATCH_CREATED to prevent breaking workflows that rely on it, but I did create a ticket to align all workflows to use the new statuses: IN-1753. I don't believe adding a new status changes anything in terms of writing to DynamoDB. As long as it meets the type, it should write to DynamoDB just fine.

Correct, DynamoDB will write whatever but we can talk about whether consistency in useful or not in the ticket

ehanson8 · 2026-05-21T14:15:57Z

+    )
+
+
+def test_workflow_dspace_client_raise_authenticationerror(


test_workflow_dspace_client_raise_authentication_error?

I always wonder whether it'd be good practice to use the same name/formatting as the error. What do you think of:

test_workflow_dspace_client_raise_AuthenticationError? 🤔

Ah, I see, not a big deal either way, do what you feel!

ghukill

I thought it better to submit my review now, with some requests for comments + docstrings, and some possible suggestions for the "minted" batch id language and logic.

If it were possible to drop the word "mint" entirely, except maybe once when some code "mints the batch id going forward for this batch", that feels like it could be a big win. This might remove some confusion when scanning code if we're:

copying files from digi theses S3 workspace bucket to batch folder in DSC bucket
moving files within the batch folder

Which all brings me to a question.... is there a universe where this workflow should be preparing the batch locally -- e.g. in a python temp directory -- and then uploading the final, perfectly formed batch with it's new "minted" batch id to the DSC bucket?

If you went that path, if something went horribly wrong along the way, there isn't cruft in the DSC bucket. Nothing got uploaded, because nothing was really ready enough. And instead of moving files around in S3, which is kind of tricky, you're literally just moving around files in a python temp directory. For this app, that'd be inside the ECS task Docker container, which is just fine.

Not a hard requirement / request to do this file downloading, renaming, and moving around locally then uploading to DSC, but maybe worth a consideration whilst considering the comments about "minted" batch ids and all fairly repetitive S3 path constructions throughout.

All that said, this is some complicated stuff, really good foundation here conceptually I think.

ghukill · 2026-05-21T14:42:25Z

+        errors: list[tuple] = []  # set but not used
+
+        # sync batch folder from DT S3 bucket to minted batch folder in DSC S3 bucket
+        run_aws_cli_sync(


Also touched on in an off-PR discussion, I think there is some conflict with the word "sync" here, that may or may not be worth addressing.

In the method signature synced: bool = False is referring to whether or not the batch folder was synced from stage --> prod, correct? and that we're not heavily using this right now?

Either way, it seems as though this has some tension with the use of "sync" in the utility function run_aws_cli_sync(...) which while technically "syncing" data in S3, it's more of a naive copy from point A to point B. My understanding is that it's not an "opinionated DSC sync" of a fully formed batch folder from stage --> prod.

I agree that "sync" is still accurate in that function name, given its literally running aws cli sync.... But we may just need some comments + docstrings to help readers know super clearly that this particular workflow kind of has a third bucket in play, the digitized theses bucket that Jenn (or others) upload PDFs to.

I'm going to avoid needling the whole "do things in stage, then do again in prod" logic which I can appreciate but don't love here. Assuming that remains and we like it, I still think there is room here to disambiguate these two things:

This DigitizedTheses workflow copying data from a workspace bucket specific to only this workflow, to the stage bucket

Syncing a fully formed batch folder from stage to prod

ghukill · 2026-05-21T14:58:25Z

+        )
+
+        s3_client = S3Client()
+        for file in s3_client.files_iter(


A lot of my comments are kind of around this file copying + syncing + batch folder creation, so this comment is actually kind of touching three things:

comment about possible "setting" of self.batch_id, which means self.batch_id could be used more

the use of run_aws_cli_sync(...)

this call to s3_client.files_iter(...)

Could there be sub-methods for some of this work? So this prepare_batch() might look more like?

self.batch_id = self.update_batch_id(batch_id) self._copy_input_files_to_dsc_batch_folder(...) input_files = self._get_input_files_iter(...) item_submissions = [] errors: list[tuple] = [] # set but not used for file in input_files: ...

Just throwing it out there, there might be some opportunity for small, targeted methods here. So when encountering this prepare_batch() method, you can follow along at a high level.

The screenshot is meant to communicate thow ideally there wouldn't be as much explicit S3 bucket + path construction at this high level.

jonavellecuerdo · 2026-05-22T18:44:10Z

@ehanson8 @ghukill I just pushed up two commits--the former comprising of smaller, minor updates and the latter being more involved. In this description, I will be summarizing the changes in the second commit, which emerged from three key things/questions @ghukill pointed out:

The workflow module could use more docstrings.
- Please view the workflow module.
- I plan to update the class docstring as I implement methods for each of the DSO steps (i.e., the next PR will include text for 'Queue a batch for ingest'
The workflow now prepares the batch in a temporary directory on local disk and once the batch is ready, it creates the batch folder in the DSC S3 bucket using run_aws_cli_sync
The workflow now moves item submission assets into nested prefixes named with the item identifier under the [replacement|new|skipped]-theses prefixes.
This workflow is the first time we want to use synced=True to avoid unnecessary API calls when creating a batch in Prod via syncing data from Stage.
- The prepare_batch() method now has branching logic for when synced=True.
- When synced=True, this means:
  - A ready batched folder already exists (in Prod).
- All prepare_batch() needs to do is derive ItemSubmission's given the contents of the ready batch folder to create/materialize the item submissions in DynamoDB.
- ⚠️ : Requesting reviewers to wait on tests for DigitizedTheses._get_item_submissions_from_synced_batch().
  - The way the unit tests mock DynamoDB calls is a bit messy and would like some time to ponder this, but wanting to prioritize implementation of DSO steps.

Screenshots showing new structure of prepared batch folder

Please let me know if you have any questions!
Note: I resolved Graham's comments in an attempt to summarize the key points that were shared in this comment.

ghukill

Looking really good, nice work wrangling some complex mechanics.

I've opted for a "Request changes" 99% for the categorization of items based on their status_details. For reasons noted in the comment, I'd lobby pretty strong to avoid the "new" or "replacment" fuzzy string matching. Open to pushback and more discussion, but this feels like something to really kind of iron out.

Otherwise, basically an approval else-wise!

ghukill · 2026-05-22T20:04:44Z

+        This method will inspect the `status_details` for each item submission to
+        determine which theses subfolder (prefix) its content should be moved to.
+        At this point, `status_details` will contain "Replacement thesis",
+        "New thesis", or an exception message. If `status_details` contains an
+        exception message, contents are moved to a skipped-theses/ subfolder.


I'm not crazy about this kind of fuzzy string matching/checking for classifying the item as new vs. replacement vs. skip/error, mostly for the possibility than an exception message from this statement which shows up a couple of times,

item_submission.status_details = str(exception)

may somehow include words like "new" or "replacment" and get incorectly categorized.

Would it work to look for more exact phrases like "Replacement thesis" or "New thesis" in the status details? maybe removing the .lower() as well? Maybe that could add some additional safety without really changing the approach?

Another thought, note that ItemSubmission class has these already on it:

""" ... 2. Processing attributes: Temporary data used during submission processing but not persisted ... """ ... # processing attributes source_metadata: dict[str, Any] | None = None dspace_metadata: dict[str, Any] | None = None bitstream_s3_uris: list[str] | None = None metadata_s3_uri: str = ""

You could update that class to have something like .processing_meta which is a freeform dictionary to note things on the ItemSubmission instance during processing. Then, use that during sorting and categorizing.

Another option...

Instead of appending ItemSubmission instances into item_submissions, you could store them in a dictionary like:

item_submissions = { "new":[...], "replacement":[...], "skip_and_error":[...], }

If you pass that dictionary to self._move_batch_files_to_theses_subfolders() it of course could quite easily just slot them into the right folders.

Then, when ready to return the list you could do

return ( item_submissions["new"] + item_submissions["replacement"] + item_submissions["skip_and_error"] )

Or this, which AI suggested and I wouldn't have thought of!

from itertools import chain return list(chain( item_submissions["new"], item_submissions["replacement"], item_submissions["skip_and_error"], ))

TL/DR: It really might be worth avoiding the fuzzy matching of "new" and "replacement" in the ItemSubmission.status_details as a sorting mechanism, given that we're storing raw exceptions; they could theoretically contain those words. A bit of effort to avoid now, might avoid a really sneaky bug in production later.

I will update the conditions to check for exact matches of "New thesis" and "Replacement thesis". I will hold off on additional changes to the ItemSubmission class but will continue to keep them in mind.

ghukill

Good stuff, solid movement forward.

Why these changes are being introduced: * The batch preparation methods for the digitized theses workflow must be capable of performing the following functions: 1. Sync the batch folder from the digitized theses workflow S3 bucket to a minted batch folder in the DSC S3 bucket 2. Download metadata from Alma via SRU 3. Get an item from DSpace 4. Determine if an item is a valid 'replacement thesis', given DSpace item metadata 5. Organize the contents of the minted batch folder into 'new-theses/' and 'replacement-theses/' subfolders How this addresses that need: * Implement methods on DigitizedThesesWorkflow for batch preparation functions * Expand ItemSubmissionStatus to `CREATE_SUCCESS`, `CREATE_FAILED`, `CREATE_SKIPPED` * Add exceptions to handle item retrieval from DSpace Side effects of this change: * Though we've expanded ItemSubmissionStatus to include `CREATE_*` statuses, `BATCH_CREATED` remains to support backwards compatibility with other workflows. Updating all workflows to use `CREATE_*` will be in the DSC backlog for future work. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1626

jonavellecuerdo added 3 commits May 19, 2026 11:37

Add XML namespace for the SRU protocol

bbb2e4b

Update dependencies and include dspace-rest-python

328f378

jonavellecuerdo force-pushed the IN-1626-digitized-theses-workflow-create branch from bc86c77 to 4ecfd35 Compare May 20, 2026 17:07

jonavellecuerdo marked this pull request as ready for review May 20, 2026 17:28

jonavellecuerdo requested a review from a team as a code owner May 20, 2026 17:28

jonavellecuerdo requested review from ehanson8 and ghukill May 20, 2026 17:28

jonavellecuerdo commented May 20, 2026

View reviewed changes

ehanson8 reviewed May 21, 2026

View reviewed changes

ghukill reviewed May 21, 2026

View reviewed changes

jonavellecuerdo requested review from ehanson8 and ghukill May 22, 2026 18:45

ghukill requested changes May 22, 2026

View reviewed changes

ghukill self-requested a review May 26, 2026 14:30

ghukill approved these changes May 26, 2026

View reviewed changes

ehanson8 approved these changes May 27, 2026

View reviewed changes

jonavellecuerdo force-pushed the IN-1626-digitized-theses-workflow-create branch from 7200d91 to aaa826e Compare May 27, 2026 14:53

jonavellecuerdo merged commit 35439e0 into main May 27, 2026
6 checks passed

jonavellecuerdo deleted the IN-1626-digitized-theses-workflow-create branch May 27, 2026 14:56

jonavellecuerdo mentioned this pull request Jun 1, 2026

Digitized theses workflow retrieves item handles for synced replacement theses #244

Merged

Conversation

jonavellecuerdo commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Local testing of dsc.workflows.DigitizedTheses.prepare_batch

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 26519017964

Coverage: 69.174%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill commented May 20, 2026

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo commented May 20, 2026 •

edited

Loading

Local testing of `dsc.workflows.DigitizedTheses.prepare_batch`

coveralls commented May 20, 2026 •

edited

Loading

jonavellecuerdo commented May 22, 2026 •

edited

Loading