This dataset provides a set of 831 3D Multiphase CT exams of renal masses from UCSF. Each exam includes an annotation of renal mass in the form of bounding boxes or polygon masks, and the pathology results from each renal mass that were obtained after surgery that serve as the ground-truth outcome. The purpose of this dataset is to support development of new algorithms to better distinguish aggressive from indolent disease based on non-invasive imaging.
The CT volumes were acquired at UCSF between 2002-2018 and only renal masses less than or equal to 7cm (T1 stage) were included. Each exam has an unenhanced CT volume and up to three contrast enhanced CT phases (arterial/corticomedullary, portal venous/nephrogenic, delayed/excretory). For each exam, the contrast enhanced CT volumes are registered to the unenhanced volume. For a minority of the exams, registration was unsuccessful, but these exams are still included for further investigation.
The dataset is hosted on AWS S3. It can be found at the following URIs:
The dataset can be downloaded directly by clickling on the following URLs:
Alternatively, the dataset can be downloaded via the AWS CLI:
- Install AWS CLI.
- Copy using the S3 URI
aws s3 cp <URI>
All CT imaging data and associated metadata are organized in HDF5 container files named by patient ID (a 10 digit random alphanumeric code). A csv file is included as a key describing which phases are available for each subject and the registration status for each CT volume.
Within phase_reg_key.csv:
- 0 = no volume
- 1 = volume exists but is not registered to the unenhanced (noncon) volume
- 2 = volume exists and is registered to the unenhanced (noncon) volume
The file structure:
.
├── 08FBroxzI6.hdf5
├── 0A87Rq5Hkl.hdf5
├── 0ByGP3oWJi.hdf5
├── 0cb2z7Hao2.hdf5
...
├── phase_reg_key.csv
...
├── Zu1bNdA2od.hdf5
├── ZYUz7t5hOn.hdf5
└── Zz99Ji2swU.hdf5
Within a HDF5 container file, the CT volumes are organized as follows:
└── Zz99Ji2swU.hdf5
├── attrs
├── arterial
├── delay
├── mask
├── noncon
└── portven
The attributes includes selected metadata and image labels.
The HDF5 files can be read in Python using the H5py package. For example, to print the containers and atrributes and extract the unenhanced (noncon) CT volume in a HDF5 file:
import h5py
with h5py.File("Zz99Ji2swU.hdf5", "r") as hdf:
print(f"HDF5 file datasets: {list(hdf.keys())}")
print(f"HDF5 file attributes: {list(hdf.attrs.keys())}")
noncon = hdf["noncon"][:]
print(f"Shape of noncon volume: {noncon.shape}")
Output:
HDF5 file datasets: ['arterial', 'delay', 'mask', 'noncon', 'portven']
HDF5 file attributes: ['Manufacturer', 'PID', 'Patient Age', 'Patient Sex', 'arterial_pixdim', 'delay_pixdim', 'mask_pixdim', 'noncon_pixdim', 'pathology', 'pathology_grade', 'portven_pixdim', 'tumor_type']
Shape of noncon volume: (512, 512, 49)
-
- Explore the pathology labels across all the datasets and plot distributions
-
- Visualize slices of the CT volumes and overlay tumor mask on
Curation jupyter notebooks are collected in /curation and are numbered 01-07 to indicate each step of curation process.
A sample conda environment can be found in environment.yml
curation/utils.py
-- contains utility functions for the curation steps
Project initiation and leadership - Peder Larson, PhD, and Zhen Jane Wang, MD
Dataset Extraction - Sage Kramer, MD
Curation - Sage Kramer, MD, Sule Sahin, PhD, Samantha Jones, Ernesto Diaz
Data Management - Sule Sahin, PhD, Abhejit Rajagopal, PhD, Ernesto Diaz, Qing Dai