Skip to content

Something went wrong with ingestion of some collections in v22 #124

@fedorov

Description

@fedorov

I noticed today that the total size of data in IDC is reported at 184TB in one of the dashboards. I expected ~90TB, based on my earlier queries.

Running the query, I confirmed 184TB is what we see in BQ:

select round(sum(instance_size)/pow(1000,4),3) from `bigquery-public-data.idc_current.dicom_all`

It appears that the portal is reporting those numbers too!

Image

I then did a query per-collection:

SELECT collection_id, sum(instance_size)/pow(10,12) as size_TB
FROM `bigquery-public-data.idc_v21.dicom_all` 
group by collection_id
order by size_TB desc

And inexplicably, some of the collections that were not supposed to change from the previous release increased in size dramatically! See top-20 collections (see spreadsheet here).

v21:

Image

v22:

Image

This looks like a very serious regression. Could it be that earlier versions of those collections were ingested?

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions