- Sridhar Vemula, Salesforce.com, Inc., Indianapolis, Indiana
For years, the only decent way to interact with operational data at Salesforce has been Splunk (for log analysis) or Graphite (for time-series metrics). Both are restrictive and slow and their retention is short (30 days for Splunk). A better way to handle this data with maximum retention is an open-ended repository where data piles up so that people can bring good tools ( yarn, hive, spark etc.) to bear on a year’s worth of data to allow problem solving, capacity planning, product intelligence etc., This concept is called a “data lake” which led to “Project Deep Sea”. Project Deep sea is a cluster of 1,160 machines, running hadoop, yarn, hbase, hue and oozie (more tools , such as mpi, lustre will follow soon) with 27,000+ computing cores and 150 TB of RAM. The usable space on this cluster is 5 petabytes (actual space is 15 petabytes with 3x replication), which are all configured, monitored, and operated by using some advanced solutions such as chef, katello, nagios, or splunk. The presentation covers the aspects in relation to the technical management of such a complex infrastructure. Most design choices have been motivated by several years of experience in addressing research and enterprise needs, mainly in the Hadoop and HPC areas. The presentation also discusses about problems faced, improvements applied, and new projects that came into light after the cluster was live.