How can I bulk download files from the Kid First DRC?
The recommended way to download is to push the files to Cavatica and then use Cavatica API to download from Cavatica (http://docs.cavatica.org/).
Instructions for Downloading:
- Ensure your Kids First Portal account is linked with your eRA commons account
- Ensure your Cavatica account is linked with eRA Commons and you have access to Kids First Data
- To double check this, in Cavatica, in the upper right hand corner, click on your username -\> Account Settings -\> Dataset Access. Make sure there is a green check mark next to "KIDS-FIRST controlled data"
- Ensure your Kids First Portal account is linked with your Cavatica account
- Perform a search in the Kids First File Repository. To move all files in your search result set over to Cavatica, just click the purple "Analyze in Cavatica" button without checking any boxes next to any files.
- Alternatively, if you want to only move a small subset of your search results over to Cavatica, then you can use the check box to select files.
- You will be presented with a popup modal that asks you to select a project to which you want to move the data. If this is your first time doing this, you will have to create a new project – which you can do in that pop up. Click "Copy Files" to move the data to Cavatica.
- From Cavatica, you can either generate a set of download links from the UI (and use the aria2c to get all the file: https://docs.sevenbridges.com/docs/download-results), or if you want to do something more programmatic, you can use SBG's command line client: https://docs.sevenbridges.com/docs/command-line-interface.
To configure the CLI, use the Cavatica API endpoint instead of the default SBG:
$ sb configure
Seven Bridges API endpoint [https://cavatica-api.sbgenomics.com/v2]:
Your process may then look this way:
- List and save to a file all of the files in your project e.g:
$ sb files list –project kids-first-drc/my-great-project \> filelist.txt
- Download the files using the file IDs (the first column returned in the file list) e.g.:
$ sb download –file \<file id\> –destination /tmp
Does it cost money to store data from the Kids First Portal on Cavatica?
No. The Kid First DRC covers AWS storage for all Kids First files and/or indexes files covered by other resources such as the NCI CRDC on the Kids First Data Resource Portal. When copying a file from the Kids First Portal to Cavatica, the file is not actually being duplicated or transferred. Cavatica is just referencing the same S3 file location that the Kids First Data Resource portal is using.
Does it cost money to download Kids First Data?
No, it does not cost users any money to download Kids First data from the portal or from Cavatica. Download costs are covered by the DRC. However, we ask that you do not download your dataset more than twice to help us keep our egress costs manageable.
How many times can I download Kids First files?
We ask that you do not download any given file more than twice.
Does it cost money to use Cavatica or run workflows?
Yes. Cavatica runs on Amazon's Web Services and users are charged for compute resources when running workflows. While storage is free for data coming from the Kids First Data Resource Center, storage costs are incurred if users decide to upload their own data to Cavatica and for the file output from workflows.
As the DRC is looking to encourage the use of cloud-based resources, we are able to help in cost coverage for analyses. Please see the question on "What are cloud credits?" to learn how the DRC can help fund your analysis.
My PI was granted access to the study through dbgap, but I need to be the one downloading the data. Can I do that?
Your PI can grant you downloader access to your data through the dbgap portal. This will add your name as an authorized user via dbGaP & eRA Commons. Once you are added as an approved user on the dataset in dbGaP, we will receive your eRA commons user name in the same fashion we received your PI's and access will automatically be made available on the Kids First side. To learn more about how to be added as a downloader under your PI, please see documentation here: https://gdc.cancer.gov/access-data/obtaining-access-controlled-data
What are cloud credits?
As the DRC is looking to encourage the use of cloud-based resources, we are able to help in cost coverage for analyses. We are currently engaging in cloud pilot credits where we are providing users with $5,000 in Cavatica compute credits for those who are conducting analysis utilizing Kids First data. These types of projects include do an analysis solely on Kids First data or bringing your own data to Cavatica to analyze with Kids First data. To receive cloud credits, we ask that you submit a paragraph abstract of your analysis to email@example.com. If you plan on bringing your own data to Cavatica, please include some sample counts so we can roughly estimate storage costs as well. You can also look at the documentation on https://kidsfirstdrc.org/support/analyze-data/ to have a better idea on how to achieve it.