Project Deep Sea

Project Deep Sea

Presenter

  • Sridhar Vemula, Salesforce.com, Inc., Indianapolis, Indiana
September 29, 2017 - 12:45pm to 1:45pm

Abstract 

For​ ​years,​ ​the​ ​only​ ​decent​ ​way​ ​to​ ​interact​ ​with​ ​operational​ ​data​ ​at​ ​Salesforce​ ​has​ ​been​ ​Splunk (for​ ​log​ ​analysis)​ ​or​ ​Graphite​ ​(for​ ​time-series​ ​metrics).​ ​Both​ ​are​ ​restrictive​ ​and​ ​slow​ ​and​ ​their  retention​ ​is​ ​short​ ​(30​ ​days​ ​for​ ​Splunk).​ ​A​ ​​ ​better​ ​way​ ​to​ ​handle​ ​this​ ​data​ ​with​ ​maximum​ ​retention is​ ​an​ ​open-ended​ ​repository​ ​where​ ​ ​data​ ​piles ​up​ ​so​ ​that people​ ​can​ ​bring​ ​good​ ​tools​ ​( ​yarn, hive,​ ​spark​ ​etc.)​ ​to​ ​bear​ ​on​ a ​years​ ​worth​ ​of​ ​data ​to​ ​allow​ ​problem​ ​solving,​ ​capacity​ ​planning, product​ ​intelligence​ ​etc., ​This​ ​concept​ ​is​ ​called​ ​a​ ​“data​ ​lake”​ ​which​ ​led​ ​to​ ​“Project​ ​Deep​ ​Sea”. Project​ ​Deep​ ​sea​ ​is​ ​a​ ​cluster​ ​of​ ​1,160​ ​machines,​ ​running​ ​hadoop,​ ​yarn,​ ​hbase,​ ​hue​ ​and​ ​oozie (more​ ​tools​ , such as ​mpi,​ ​lustre​ ​will​ ​follow​ ​soon)​ ​with​ ​27,000+​ ​computing​ ​cores​ ​and​ ​150​ ​TB​ ​of​ ​RAM. The​ ​usable​ ​space​ ​on​ ​this​ ​cluster​ ​is​ ​5​ ​petabytes​ ​(actual​ ​space​ ​is​ ​15​ ​petabytes​ ​with​ ​3x replication),​ ​which​ ​are​ ​all​ ​configured,​ ​monitored,​ ​and​ ​operated​ ​by​ ​using​ ​some​ ​advanced​ ​solutions such as​ ​chef,​ ​katello,​ ​nagios,​ or ​splunk​. The​ ​presentation​ ​covers​ ​the​ ​aspects​ ​in​ ​relation​ ​to​ ​the​ ​technical​ ​management​ ​of​ ​such​ ​a​ ​complex infrastructure.​ ​Most​ ​design​ ​choices​ ​have​ ​been​ ​motivated​ ​by​ ​several​ ​years​ ​of​ ​experience​ ​in addressing​ ​research​ ​and​ ​enterprise​ ​needs,​ ​mainly​ ​in​ ​the​ ​Hadoop​ ​and​ ​HPC​ ​areas.​ ​The presentation​ ​also​ ​discusses​ ​about​ ​problems​ ​faced,​ ​improvements​ ​applied,​ ​and​ ​new​ ​projects​ ​that came​ ​into​ ​light​ ​after​ ​the​ ​cluster​ ​was​ ​live.

Sponsoring Organization 

National Center for Computational Sciences

Location

  • Computational Sciences Building
  • Building: 5600
  • Room: E-202

Contact Information

Share