High-Performance Machine Learning on Oracle Cloud with H2O

Running on a multi-CPU VMIntroductionIf you read an introductory book on Machine Learning (ML), the development of a model can seem always an easy, interesting and nice task. The reality, in a true business case, is quite different.In the real world, the development of an effective model can be a difficult and rather long task. To achieve good predictive power, you need a lot of data (the more the better) and often many attempts with different algorithms and to define best values for the hyper-parameters of the model.That is why often you need high computational power, to be able to iterate many executions in a reasonable time.The availability of ML algorithms able to use all the available CPUs of the underlying infrastructure and to scale is, therefore, crucial to be able to reach your business goals in the desired timeframe.It is in this context that frameworks like H20 can become an important piece of the entire solution.In this first story dedicated to the subject, I want to quickly describe what H2O is and provide some details of some tests I have done on Oracle Cloud Infrastructure (OCI) to verify the scalability and performance of H2O ML framework.Sparkling H2O.H2O.ai is a Silicon Valley company whose declared mission is: “Democratizing Artificial Intelligence for Everyone”. In other words, to make easier to adopt ML techniques for solving business problems.The core of their offering is H2O (now version 3, H2O-3). The best description can be read on their Web site:“H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models”.So, the key ingredients are:support for the most important “classic” algorithms, both for regression and classification, supervised and unsupervised: GLM, Naive Bayes, SVM, K-Means, PCAsupport for more modern and “state-of-the-art” algorithms, like Distributed Random Forest, Gradient Boosting Machine (GBM), XGBoost, Stacked Ensemble, Isolation Forest, Deep Neural Networks (DNN)In-memory computationan implementation capable to use the available cores on a single VM and to linearly scale over a cluster of connected nodes, still using all available coresclient libraries for Python, R, JavaScriptAutoMLa simple deploy of trained models, based on POJO (and MOJO)Besides H2O, you can use:H2O4GPU, that provides an implementation of the above-mentioned algorithms, running on NVIDIA GPUSparkling Water, an implementation running on top of Hadoop and Apache Spark cluster, for BigData environmentsThe implementation has a client-server architecture: the core is a JEE application (h2o.jar) providing a REST interface. Libraries to support Python, R and JavaScript programming, are provided.H2O architecture (from h2o.ai official documentation)Also, a graphical, “Notebook based” interactive interface that doesn’t require any coding is provided: it is called Flow.Using Flow, if you know what you’re doing, you can travel the entire journey needed to develop and test your model without writing a single line of code. Nice and productive.Last, but not least, H2O has also important AutoML functionalities: to search for the best model, you can simply define search boundaries (which algorithms) and let the engine search through all the space of algorithms and hyper-parameters.For completeness, H2O.ai sells also a licensed product called Driverless AI. Driverless AI makes even easier the development, test, and interpretability of the model, with a nice dark User Interface.Driverless AI (from H2O.ai site).The tests.After some time spent playing with H2O, I decided to set-up a test environment on Oracle Cloud Infrastructure (OCI) to verify:performance and scalability of H2Ohow easy it is to develop a model to some extent with lots of dataThe dataset I decided to use is the “Higgs Dataset”. The problem that this dataset wants to address is fascinating: how to quickly recognize high-energy particles (the famous Higgs Boson) using Machine Learning techniques. You can find online many articles on this challenge.The dataset contains 11 million samples. The size on-disk is about 8 GB. It is provided as a CSV file. The first column (values: 0, 1) contains the label (1 means: it is a true Higgs Boson, 0 means “noise”); the other 28 columns contain numeric values representing kinematics data measured from detectors. You can find a complete description of the dataset at the UCI Machine Learning repository.The main reason why I decided to adopt this dataset is that it is a large dataset and, with some variations, it has been used in a famous Kaggle challenge. Also, I already was aware that this classification task is not easy.The Test Infrastructure is a 2 VM cluster set-up in OCI. Each VM has 24 OCPU (core). The two VMs are connected through the high-performance, non-oversubscribed network provided by OCI. Shared storage for data is implemented using File Storage (NFS mounted). Communication between the two nodes of the H2O cluster is set up to use unicast.Test Environment on OCII have decided to build a Gradient Boosting Machine model, based on the Higgs Dataset. Gradient Boosting is part of so-called “ensemble models”, based on Decision Tree (DT). The basic idea of GBM is to iteratively improve performances of DT applying weights to samples where the previous model was not working well.These are the (non-default) parameters I have adopted for the model (defined after some trials):ntrees 1000max_dept 10min_rows 10learn_rate 0.05The big number of trees and the small learning rate increase the computational power needed to train the model. But using a large number of OCPU the time needed for training can be kept small.The development of the model.As said, the entire development of the model has been done using Flow, the Web UI of H20–3. I have not written a single line of code.In the default setup, H2O Flow is accessible on port 54321.H2O Flow UI starting pageThis is the list of steps needed to develop the model:Import FileSetup parsing of the file; Mark column C1 as ENUM (label)Parse the FileVerify there are no missing valuesSplit (Data)Frame in training/validation set (90%,10%)Build the model; from the menu choose GBM; set non-default values (see table above)Results.Finally, here are the results:Build Time: 9 min. 13 sec.Scoring history:From the scoring history, you can see that after 500 trees performances are not greatly improving and (to be confirmed) there is no big overfitting (differences in performance between training and validation set).GBM gives you also a measure of the feature importances, from which we could decide to develop a simpler model using, for example, only the first 10 features:Finally, the Confusion Matrix, computed on the Validation Set:The Confusion MatrixThe Confusion Matrix is quite interesting (and it confirms that the problem is not an easy one):on Class 1 samples (true Boson) the accuracy achieved is 0.87on Class 0 samples (noise) accuracy is only 0.58Globally, the model gives you an accuracy of about 73%. Not bad. But the model doesn’t perform well on Class 0 samples. There is certainly room for improvement, but the entire exercise has taken me only half a day. And the fact that the time needed for training, in this environment, has been only about 10 min. is reassuring me.Comparing the Confusion Matrix computed on training and validation set we can confirm that there is no overfitting.Ideas for improvements:the first thing I would try is to use AutoML. With AutoML we can define boundaries for exploration (which algorithms to use) and let the engine try to figure out the best model, with best parameters. It will take more time (hours)The second one is to try with Stacked Ensembles.One last question: was the engine able to make full use of the 24+24 OCPU? The answer is yes! The “htop” utility run on both nodes confirmed that H2O runs at full speed, with 100% of CPU utilization (the first picture, at the beginning of the article, is the “htop” report from one node).Conclusion.From the tests I have done, H2O and Oracle OCI seem a good combination to provide a high-performance and easy-to-use platform for developing Machine Learning models.High-Performance Machine Learning on Oracle Cloud with H2O was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter