The computing facilities for the UCSC Genomics Institute include two compute clusters available to our researchers. The clusters are built using Supermicro quad-node units. Each node has dual 16-core AMD processors and 256 GB of memory. The public network cluster, Ku, provides a total of 1024 processing cores available using the Parasol batch processing system. The cluster runs the Linux operating system and is managed using the Rocks toolset. The private network cluster, Podk, provides 3968 processing cores configurable in a variety of ways using the Openstack software tools. Both systems were designed to provide an exceptional amount of inexpensive computing power in minimal space. For other memory-intensive jobs, we also have 4 large memory development machines. Each of these machines has 64 compute cores, and 1 TB of memory. These computational clusters are supported by 12 data servers and 4 metadata servers running IBM GPFS; together they provide over 1 petabyte of usable, replicated network storage. The clusters connect to the network infrastructure using 10-gigabit ethernet. We also provide a cluster running CEPH, which holds up to 450 terabytes of replicated object storage.
The Genomics Institute also employs a high-availability connected computer setup for virtual machine hosting. These redundant machines have 64 computing cores, 500 gigabytes of memory, and access to 75 terabytes of local disk space.
UC Santa Cruz Genomics Institute's System Administration Team: Haifang Telc, Jorge Garcia and Erich Weiler
The web servers for the UCSC Genome Browser are housed in a data center designed to function 24/7, 365 days a year. They consist of 6 dual 12-core AMD Opteron processors; each offers 64 gigabytes of internal solid state storage and 128 gigabytes of memory. These machines have access to a central file server that provides 84 extra terabytes of shared disk area and a central mySQL database server that holds up to 28 terabytes of genomic data. Four additional servers, each with 256 gigabytes of memory, provide web access to BLAT (BLAST-like alignment tool) software and its memory-intensive calculations. Servers available for public use include a genome preview server for access to raw data before it has gone through QA, a server that hosts all the browser MySQL data, a server to store user-generated custom tracks, and a wiki server that holds public information and can keep track of named sessions. A local download server that provides access to our raw data serves nearly 2 terabytes of data every day. For redundancy and load balancing, we house an identical download server and one additional file server at the UC San Diego Supercomputer Center. We also provide a web server available to european users (genome-euro, hosted in Germany) and another available to asian users (genome-asia, hosted in Japan)
Why Parallel Processors?
Computer clusters such as these are a cost effective way to process large amounts of data. Since many bioinformatics problems are “embarrassingly parallel,” they do not require high speed inter-process communication to perform calculations. This eliminates the need for high-priced networking equipment. Taking advantage of this fact by employing parallel but separate computation by many processors, we have pioneered the development of “super-computing on-the-cheap” for the specific needs of genome presentation, annotation, and analysis.
Our current clusters are the fifth generation of clusters available at UCSC. The first generation was a cluster of 100 Pentium III processors that was built to assemble the first working draft of the human genome in June of 2000, using a 10,000-line program written by Jim Kent called GigAssembler. We have come a long way since then.
These computing systems are funded through the Howard Hughes Medical Institute, the National Human Genome Research Institute (NHGRI), the California Institute for Quantitative Biosciences (QB3), and the National Cancer Institute.