WHAT
Pathogen sequence data management: scaling and accelerating an infrastructure (~20 min)
Rapid technological advances have increased dramatically the availability of whole genome sequencing technology for the investigation of pathogens, including their biology, their ecology in host and environmental settings and their monitoring in the context of public health programmes. The public data resources responsible for persisting and making available these important data face radically increased rates of data deposition and novel patterns of access to the data. EMBL-EBI's European Nucleotide Archive, the European node of the International Nucleotide Sequence Database Collaboration, is undergoing a major engineering programme to support rapid sharing of pathogen read and assembly data based upon simplified data structures, fully automated processing pipelines and a range of user APIs and web interfaces to support those working in this area. In the talk I will present the context for such a system and the roadmap to delivering new services upon the system.
Biological data fluidity (~40 min)
Rapidly advancing high-throughput platform technologies, and their application to ever broader areas of study, have brought the life sciences into their data-intensive era. No longer solely at the hands of large specialist facilities, access to these technologies, and hence the capacity to generate data at volume, is widely dispersed around the world. At the same time, making use of these data requires analyses that are both compute-intense and operate across multiple dispersed datasets and connect with distant reference and comparator data.
The concept of 'data fluidity', in which data flow more appropriately and easily, is an important part of the informatics response to the challenges of this era. Data fluidity addresses the streamlining of flows of data around local and global networks through such methods as data compression, streaming, partitioning into manageable units appropriate and sufficient for analysis and efficient transfer.
In the talk, I will describe these challenges and present two methods applicable to sequence data that are being developed in this area: CRAM, a framework for sequence data compression and direct access to read data and the Webin data streamer, a UDT-based network transfer protocol.
WHEN
Monday 14 October @ 11:00-12:30 (1½ hours)
WHERE
Lecture Theatre 3, Building 105, Faculty of BusEco
111 Barry St (corner of Pelham St), CARLTON.
Maps: University | Google