Skip to content

In 1626 digitized theses workflow create#240

Merged
jonavellecuerdo merged 4 commits into
mainfrom
IN-1626-digitized-theses-workflow-create
May 27, 2026
Merged

In 1626 digitized theses workflow create#240
jonavellecuerdo merged 4 commits into
mainfrom
IN-1626-digitized-theses-workflow-create

Conversation

@jonavellecuerdo

@jonavellecuerdo jonavellecuerdo commented May 20, 2026

Copy link
Copy Markdown
Contributor

Purpose and background context

This PR implement batch preparation methods for digitized theses workflow. I went back-and-forth on whether I should send a PR with the entire workflow, but as I saw the changes that were piling up, I decided it would be best to request review in reasonably sized chunks. This also allows us to focus / reflect on the key differences for this workflow for each step of the DSC workflow.

Here are some key points to highlight about the create / prepare_batch methods for the digitized theses workflow:

  1. The workflow has a minted_batch_id attribute, which is comprised of the original batch ID and the current date timestamp.
  2. DigitizedTheses.prepare_batch does not collect exceptions in the errors list like other workflows.
    1. Why: This allows DSC to proceed with writing all item submissions to the DynamoDB table, with any exceptions captured via the ItemSubmission.status_details field.
      1. When DSC sends the report, the resulting CSV file from DynamoDB will include all intended item submissions in the batch rather than creating a separate errors.csv.
      2. No BatchCreationFailedError is raised. TBD when this error should be raised (if ever) for the workflow.
  3. After the create step, the minted batch ID is what gets passed in the latter calls to submit and finalize.
    1. For this reason, I left DigitizedTheses.batch_path as is and explicitly used DigitizedTheses.minted_batch_id when writing files to S3.
  4. Why do we save the Alma metadata in XML files to S3?
    1. When we call submit, we can avoid repeated calls to Alma SRU and read the metadata from S3 instead. The saved XML files are also simplified from the original Alma SRU response (the saved file sets the <record> as the root of the XML file -- will no longer require these steps).

How can a reviewer manually see the effects of these changes?

While the test suite includes several new unit tests for the simpler functions in dsc.workflows.DigitizedTheses, I'm requesting that we move forward with the current set of unit tests given the additional local testing with MinIO shared below.

Note: In general, DSC has been a complex app to test. Recent changes have relied on actual runs with the DSO Step Function, but this is not possible until we have a test instance of DSpace 8 available. Revisiting the test suite for clean up / improve maintainability is definitely on track for future work! Thanking reviewers in advance for understanding.

Local testing of dsc.workflows.DigitizedTheses.prepare_batch

About the test data:
These five files were provided by Sadie (i.e., items that were imported into DSpace via v1 workflow):

An additional file for a student-submitted electronic theses (i.e., should be skipped by workflow) is used for testing:


  1. Created a batch (batch-0) in the digitized theses workspace S3 bucket:
image
  1. After running dsc.workflows.DigitizedTheses.prepare_batch in IPython:
  • Contents are synced to a minted batch folder five item submissions moved to a replacement-theses/ subfolder and one item submission (a student-submitted electronic thesis) remains at unmoved.
image

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:
* The digitized theses workflow requires the addition of
environment variables to fetch items from DSpace and metadata
from Alma. The workflow also relies on an S3 bucket used
by the Imaging Lab as a workspace to create batches
of theses.

How this addresses that need:
* Add DSPACE_CREDENTIALS env var
* Add METADATA_API_URL env var
* Add S3_BUCKET_DIGITIZED_THESES env var

Side effects of this change:
* The addition of the digitized theses workflow reveals a
need to revisit how configurations (e.g., env vars) are
managed for different workflows. While the current structure
of the `Config` class has worked in terms of holding
env vars for different deployment environments (dev, stage, prod),
there is an additional layer of complexity for the DSC app in
that the "required" configs may also be *workflow-specific*.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1626
* https://mitlibraries.atlassian.net/browse/IN-1709
* https://mitlibraries.atlassian.net/browse/IN-1745
@coveralls

coveralls commented May 20, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 26519017964

Warning

No base build found for commit d69a61e on main.
Coverage changes can't be calculated without a base build.
If a base build is processing, this comment will update automatically when it completes.

Coverage: 69.174%

Details

  • Patch coverage: 81 uncovered changes across 2 files (132 of 213 lines covered, 61.97%).

Uncovered Changes

File Changed Covered %
dsc/workflows/digitized_theses/workflow.py 181 106 58.56%
dsc/config.py 19 13 68.42%
Total (6 files) 213 132 61.97%

Coverage Regressions

Requires a base build to compare against. How to fix this →


Coverage Stats

Coverage Status
Relevant Lines: 1901
Covered Lines: 1315
Line Coverage: 69.17%
Coverage Strength: 0.69 hits per line

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1626-digitized-theses-workflow-create branch from bc86c77 to 4ecfd35 Compare May 20, 2026 17:07
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review May 20, 2026 17:28
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner May 20, 2026 17:28
Comment on lines +63 to +85
try:
credentials = CONFIG.dspace_credentials[self.submission_system]
except KeyError as exception:
raise exceptions.DSpaceClientCredentialsNotFoundError(
f"No credentials for {self.submission_system}"
) from exception

client = DSpaceClient(
api_endpoint=credentials["url"],
username=credentials["user"],
password=credentials["password"],
fake_user_agent=True,
)
authenticated = client.authenticate()
if not authenticated:
raise exceptions.DSpaceClientAuthenticationError(
credentials["url"], credentials["user"]
)
self._dspace_client = client
logger.info(
f"Successfully authenticated to {credentials['url']} "
f"as {credentials['url']}"
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled from DSS!

@ghukill

ghukill commented May 20, 2026

Copy link
Copy Markdown

Before I jump into a review @jonavellecuerdo, had a question about your second screenshot.

For PDF 932127295, am I understand correctly that:

  • the batch folder is created in DSC bucket
  • the original PDF is moved over to root of batch folder
  • the Alma MARC is retrieved and saved as an XML document
  • but because not replacement or new, it remains in the root of the batch folder

Is that right? I suppose my naive expectation would be that bad/error/skipped records would get moved into a folder like "skipped". This might be a decent name, is it captures true skips, unhandled errors, etc.

It's a bit minor, but it feels kind of messy to have the "bad" PDF + XMLs at the root of the batch folder. Open to pushback or clarification here, just a thought as I prepare to review tomorrow!

@ehanson8 ehanson8 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Some comments and questions

Comment thread dsc/config.py
Comment on lines +93 to +97
def metadata_api_url(self) -> str:
value = os.getenv("METADATA_API_URL")
if not value:
raise OSError("Env var 'METADATA_API_URL' must be defined")
return value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be used by both DT and Wiley for example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. 🤔 I think this is also worth asking @cabutlermit about when we talk about how to update the DSPACE_CREDENTIALS environment variable (i.e., multiple credentials/info intended for a single environment variable).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment is making me realize we could do a similar JSON blob structure for other env vars that change for various workflows, but yes, consulting with him would be great. This addresses my concerns a few comments down

Comment thread dsc/config.py
Comment on lines +99 to +104
@property
def s3_bucket_digitized_theses(self) -> str:
value = os.getenv("S3_BUCKET_DIGITIZED_THESES")
if not value:
raise OSError("Env var 'S3_BUCKET_DIGITIZED_THESES' must be defined")
return value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to above (if I'm understanding it right), could this also be more generic, e.g. working_directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be! Requesting to leave it for now until we get into the WIley implementation. What this is revealing is a need to reevaluate how to dynamically set the appropriate configs based on the workflow METADATA_API_URL and S3_BUCKET_WORKING_DIRECTORY. I think the discussion for DSPACE_CREDENTIALS will help us figure out how we should be storing these configs in AWS in the first place and from there we can figure out how the app should access them.

Does that sound like a reasonable next step -- proceed as-is, include in discussion with @cabutlermit ? 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm fine with that!

Comment thread dsc/config.py

def check_required_env_vars(self) -> None:
"""Method to raise exception if required env vars not set."""
missing_vars = [var for var in self.REQUIRED_ENV_VARS if not os.getenv(var)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit unclear how DSC would manage workflow-specific env vars, I get the need to hit different buckets for different AWS envs for working_directory but it is odd to have a workflow-specific env vars at the Config level. That said, if the plan is to create a ticket and address later, that's OK with me. Just want to avoid a proliferation of workflow-specific env var in the future

For metadata_api_url however, a workflow class attribute feels like a better fit in most, if not all, cases. Even with Alma, I don't think we would read from Alma sandbox even on Dev1, we would read from Alma prod.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this ticket for now! This has been on my mind as well but have been focused on getting something up and running first, figuring we can tackle infra-related changes after discussions with @cabutlermit !

Comment thread dsc/db/models.py
Comment on lines +24 to 27
CREATE_SUCCESS = "create_success"
CREATE_FAILED = "create_failed"
CREATE_SKIPPED = "create_skipped"
BATCH_CREATED = "batch_created"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BATCH_CREATED feels duplicative now, do we still need it? And if we are changing enum values for DynamoDB, we may need to run an update pass to ensure consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left ItemSubmissionStatus.BATCH_CREATED to prevent breaking workflows that rely on it, but I did create a ticket to align all workflows to use the new statuses: IN-1753. I don't believe adding a new status changes anything in terms of writing to DynamoDB. As long as it meets the type, it should write to DynamoDB just fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, DynamoDB will write whatever but we can talk about whether consistency in useful or not in the ticket

Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
Comment thread dsc/workflows/digitized_theses/workflow.py
Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
Comment thread dsc/workflows/digitized_theses/workflow.py
)


def test_workflow_dspace_client_raise_authenticationerror(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_workflow_dspace_client_raise_authentication_error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wonder whether it'd be good practice to use the same name/formatting as the error. What do you think of:

test_workflow_dspace_client_raise_AuthenticationError? 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, not a big deal either way, do what you feel!

@ghukill ghukill left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it better to submit my review now, with some requests for comments + docstrings, and some possible suggestions for the "minted" batch id language and logic.

If it were possible to drop the word "mint" entirely, except maybe once when some code "mints the batch id going forward for this batch", that feels like it could be a big win. This might remove some confusion when scanning code if we're:

  1. copying files from digi theses S3 workspace bucket to batch folder in DSC bucket
  2. moving files within the batch folder

Which all brings me to a question.... is there a universe where this workflow should be preparing the batch locally -- e.g. in a python temp directory -- and then uploading the final, perfectly formed batch with it's new "minted" batch id to the DSC bucket?

If you went that path, if something went horribly wrong along the way, there isn't cruft in the DSC bucket. Nothing got uploaded, because nothing was really ready enough. And instead of moving files around in S3, which is kind of tricky, you're literally just moving around files in a python temp directory. For this app, that'd be inside the ECS task Docker container, which is just fine.

Not a hard requirement / request to do this file downloading, renaming, and moving around locally then uploading to DSC, but maybe worth a consideration whilst considering the comments about "minted" batch ids and all fairly repetitive S3 path constructions throughout.

All that said, this is some complicated stuff, really good foundation here conceptually I think.

Comment thread dsc/workflows/digitized_theses/workflow.py
errors: list[tuple] = [] # set but not used

# sync batch folder from DT S3 bucket to minted batch folder in DSC S3 bucket
run_aws_cli_sync(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also touched on in an off-PR discussion, I think there is some conflict with the word "sync" here, that may or may not be worth addressing.

In the method signature synced: bool = False is referring to whether or not the batch folder was synced from stage --> prod, correct? and that we're not heavily using this right now?

Either way, it seems as though this has some tension with the use of "sync" in the utility function run_aws_cli_sync(...) which while technically "syncing" data in S3, it's more of a naive copy from point A to point B. My understanding is that it's not an "opinionated DSC sync" of a fully formed batch folder from stage --> prod.

I agree that "sync" is still accurate in that function name, given its literally running aws cli sync.... But we may just need some comments + docstrings to help readers know super clearly that this particular workflow kind of has a third bucket in play, the digitized theses bucket that Jenn (or others) upload PDFs to.

I'm going to avoid needling the whole "do things in stage, then do again in prod" logic which I can appreciate but don't love here. Assuming that remains and we like it, I still think there is room here to disambiguate these two things:

  1. This DigitizedTheses workflow copying data from a workspace bucket specific to only this workflow, to the stage bucket
  2. Syncing a fully formed batch folder from stage to prod

Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
)

s3_client = S3Client()
for file in s3_client.files_iter(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

A lot of my comments are kind of around this file copying + syncing + batch folder creation, so this comment is actually kind of touching three things:

  1. comment about possible "setting" of self.batch_id, which means self.batch_id could be used more
  2. the use of run_aws_cli_sync(...)
  3. this call to s3_client.files_iter(...)

Could there be sub-methods for some of this work? So this prepare_batch() might look more like?

self.batch_id = self.update_batch_id(batch_id)

self._copy_input_files_to_dsc_batch_folder(...)

input_files = self._get_input_files_iter(...)

item_submissions = []
errors: list[tuple] = []  # set but not used
for file in input_files:
    ...

Just throwing it out there, there might be some opportunity for small, targeted methods here. So when encountering this prepare_batch() method, you can follow along at a high level.

The screenshot is meant to communicate thow ideally there wouldn't be as much explicit S3 bucket + path construction at this high level.

Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
Comment thread dsc/workflows/digitized_theses/workflow.py
Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
@jonavellecuerdo

jonavellecuerdo commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@ehanson8 @ghukill I just pushed up two commits--the former comprising of smaller, minor updates and the latter being more involved. In this description, I will be summarizing the changes in the second commit, which emerged from three key things/questions @ghukill pointed out:

  1. The workflow module could use more docstrings.
    • Please view the workflow module.
    • I plan to update the class docstring as I implement methods for each of the DSO steps (i.e., the next PR will include text for 'Queue a batch for ingest'
  2. The workflow now prepares the batch in a temporary directory on local disk and once the batch is ready, it creates the batch folder in the DSC S3 bucket using run_aws_cli_sync
  3. The workflow now moves item submission assets into nested prefixes named with the item identifier under the [replacement|new|skipped]-theses prefixes.
  4. This workflow is the first time we want to use synced=True to avoid unnecessary API calls when creating a batch in Prod via syncing data from Stage.

Screenshots showing new structure of prepared batch folder

image image image

Please let me know if you have any questions!
Note: I resolved Graham's comments in an attempt to summarize the key points that were shared in this comment.

@ghukill ghukill left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good, nice work wrangling some complex mechanics.

I've opted for a "Request changes" 99% for the categorization of items based on their status_details. For reasons noted in the comment, I'd lobby pretty strong to avoid the "new" or "replacment" fuzzy string matching. Open to pushback and more discussion, but this feels like something to really kind of iron out.

Otherwise, basically an approval else-wise!

Comment thread dsc/workflows/digitized_theses/workflow.py Outdated
Comment thread dsc/workflows/digitized_theses/workflow.py
Comment thread dsc/workflows/digitized_theses/workflow.py
Comment on lines +378 to +382
This method will inspect the `status_details` for each item submission to
determine which theses subfolder (prefix) its content should be moved to.
At this point, `status_details` will contain "Replacement thesis",
"New thesis", or an exception message. If `status_details` contains an
exception message, contents are moved to a skipped-theses/ subfolder.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about this kind of fuzzy string matching/checking for classifying the item as new vs. replacement vs. skip/error, mostly for the possibility than an exception message from this statement which shows up a couple of times,

item_submission.status_details = str(exception)

may somehow include words like "new" or "replacment" and get incorectly categorized.

Would it work to look for more exact phrases like "Replacement thesis" or "New thesis" in the status details? maybe removing the .lower() as well? Maybe that could add some additional safety without really changing the approach?

Another thought, note that ItemSubmission class has these already on it:

"""
...
2. Processing attributes: Temporary data used during submission processing
   but not persisted
...
"""

...
# processing attributes
source_metadata: dict[str, Any] | None = None
dspace_metadata: dict[str, Any] | None = None
bitstream_s3_uris: list[str] | None = None
metadata_s3_uri: str = ""

You could update that class to have something like .processing_meta which is a freeform dictionary to note things on the ItemSubmission instance during processing. Then, use that during sorting and categorizing.

Another option...

Instead of appending ItemSubmission instances into item_submissions, you could store them in a dictionary like:

item_submissions = {
  "new":[...],
  "replacement":[...],
  "skip_and_error":[...],
}

If you pass that dictionary to self._move_batch_files_to_theses_subfolders() it of course could quite easily just slot them into the right folders.

Then, when ready to return the list you could do

return (
  item_submissions["new"] +
  item_submissions["replacement"] +
  item_submissions["skip_and_error"]
)

Or this, which AI suggested and I wouldn't have thought of!

from itertools import chain

return list(chain(
    item_submissions["new"],
    item_submissions["replacement"],
    item_submissions["skip_and_error"],
))

TL/DR: It really might be worth avoiding the fuzzy matching of "new" and "replacement" in the ItemSubmission.status_details as a sorting mechanism, given that we're storing raw exceptions; they could theoretically contain those words. A bit of effort to avoid now, might avoid a really sneaky bug in production later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the conditions to check for exact matches of "New thesis" and "Replacement thesis". I will hold off on additional changes to the ItemSubmission class but will continue to keep them in mind.

@ghukill ghukill self-requested a review May 26, 2026 14:30

@ghukill ghukill left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, solid movement forward.

Why these changes are being introduced:
* The batch preparation methods for the digitized theses workflow
must be capable of performing the following functions:

1. Sync the batch folder from the digitized theses workflow S3 bucket
to a minted batch folder in the DSC S3 bucket
2. Download metadata from Alma via SRU
3. Get an item from DSpace
4. Determine if an item is a valid 'replacement thesis', given DSpace item metadata
5. Organize the contents of the minted batch folder into 'new-theses/' and
'replacement-theses/' subfolders

How this addresses that need:
* Implement methods on DigitizedThesesWorkflow for batch preparation functions
* Expand ItemSubmissionStatus to `CREATE_SUCCESS`, `CREATE_FAILED`, `CREATE_SKIPPED`
* Add exceptions to handle item retrieval from DSpace

Side effects of this change:
* Though we've expanded ItemSubmissionStatus to include `CREATE_*` statuses,
`BATCH_CREATED` remains to support backwards compatibility with other workflows.
Updating all workflows to use `CREATE_*` will be in the DSC backlog for future
work.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1626
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1626-digitized-theses-workflow-create branch from 7200d91 to aaa826e Compare May 27, 2026 14:53
@jonavellecuerdo jonavellecuerdo merged commit 35439e0 into main May 27, 2026
6 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1626-digitized-theses-workflow-create branch May 27, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants