St. Jude Children’s Research Hospital launches public repository of pediatric cancer genomics data
Click Here to Manage Email Alerts
St. Jude Children’s Research Hospital launched an online data-sharing platform that includes more than 4,000 whole-genome sequencing data sets from three pediatric cohorts.
St. Jude Cloud — created in partnership with Microsoft and DNAnexus — gives researchers access to next-generation sequencing data and unique analysis tools for the advancement of research efforts for pediatric diseases.
“St. Jude Cloud is a powerful resource to drive global research and discovery forward,” Jinghui Zhang, PhD, co-leader of the St. Jude Cloud project and chair of the department of computational biology at St. Jude Children’s Research Hospital, said in a press release. “Providing genomic sequencing data to the global research community and making complex computational analysis pipelines easily accessible will lead to progress in eradicating childhood cancer. St. Jude Children’s Research Hospital has been committed to sequencing and understanding pediatric cancer genomes for nearly a decade, and we will continue to generate and share data with the research community in the future.”
HemOnc Today spoke with Zhang about how the genomics data repository can be accessed and utilized by researchers around the world.
Question: How did this genomics data repository come about?
Answer: We have conducted a significant amount of work to upload our data sets to public repositories. Users often had difficulty downloading data, and transfer errors often occurred or data became corrupted in some of the archives. Because of this, we have had to spend a lot of time helping users get the correct data to their local infrastructure. While working on my own research, I experienced how difficult it can be to deal with data transfer failure and difficulty. Although data corruption does not happen all of the time, it can be quite frustrating when it does happen. It takes a long time to download a large data set — 6 months, easily. We want to provide our users a better experience when utilizing our data. This is one reason why we created this repository. A second reason is that, with current data-sharing models, only institutions with large resources of infrastructure can utilize this data set. Smaller laboratories do not always have access to these resources. They cannot make use of this data set, and this is why cloud data sharing has become a model for democratizing data. We want to enable more people to participate in cancer research, and we want them to place their efforts into analyzing the data and developing tools. The St. Jude Cloud becomes the perfect solution to this.
Q: What does it include?
A: The repository includes three major data sets. The first is paired tumor whole-genome sequencing data from our St. Jude/Washington University Pediatric Cancer Genome Project. This data set was generated in 2010 when St. Jude invested in the biggest pediatric cancer genome sequencing efforts in the world. There are 1,400 whole-genome data sets in this repository. The second is what we call St. Jude Lifetime Cohort Genome Sequencing. This includes 3,006 whole-genome sequencing data sets from long-term survivors of pediatric cancer. The final data set is from our Genome for Kids Clinical Genomic Sequencing Protocol. There are about 300 cases with paired exome and RNA sequencing data included in this cohort.
Q: How can researchers access and utilize the repository?
A: If a researcher does not apply for access, they can get access to aggregated data sets through our visualization portal. We have three options: data button, tools and visualization. If a researcher goes to the repository to look at overall data profiles, they can simply go to visualize mode and review the genes and other data the researcher is interested in without applying for access. At this level, we do not offer patient-level data. It is simply aggregate data of the cohort, so there are no data security issues. If the researcher would like to look at data on the individual patient level, they have to apply for data access. We want to streamline the data access mode for those people who only want to access data on the cloud, as we want to minimize the need for download where we can. The form to apply for access takes about 15 minutes to complete. Once we receive the form, we hope to complete the approval process within 48 hours so researchers can access these data sets as soon as possible.
Q: What needs does the repository fill and what benefits does it offer?
A: Although I have talked in depth about the data sets portion of the repository, there also is the tools component. We make these data sets available for two types of researchers, one of which is computational scientists who are interested in vaccine algorithms. These researchers will now be able to upload their tools to the cloud and analyze data directly to the cloud. This will make their work much more efficient. If researchers want to download and analyze the data locally, this can normally take 6 to 9 months, whereas uploading tools to the cloud only takes several days. We also have geared this repository toward noncomputational-driven biologists with no formal computational training. These researchers may already see some germline mutations in their own cohort, but the frequency may be too low so they may want to see if their findings are replicated in an independent cohort. They can go to our repository and search through genes and, if they have been approved for data access, they can then look at disease subtypes to compare their data with what we have in the repository.
Q: Is there anything else that you would like to mention?
A: The tools and visualization components are both key interests with this repository. We want to ensure that data visualization is intuitive and something that everyone can use. In this repository, we are using visualization to explore data sets further. For example, if a researcher is interested in looking at all samples of activating JAK2 mutations or fusion, this would traditionally be difficult to do. However, researchers can now use our visualization tool to select samples and then place data sample IDs in a ‘shopping cart’ and then go to our cloud portal to access these samples. Many of the tools we develop are complicated and, if someone wants to generate high-quality results, we have to put in a lot of filters to sort out results. These tools are hard to migrate. If another institution wants to implement the same pipelines we have developed, there is a high overhead for us to assist people to do this. We developed a tool — not yet published — called Rapid RNA Seek. By adding this tool to the cloud, any researcher can upload RNA Seek and use the tool to analyze the data themselves. Making both the data and tools accessible to the broad research community is truly one of the beauties of the St. Jude Cloud repository. – by Jennifer Southall
For more information:
Jinghui Zhang, PhD, can be reached at St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105.
Disclosure: Zhang reports no relevant financial disclosures.