I was quite fortunate to be awarded a spot with this Beta Course with learning tree. Basically learning tree test out their courses before making them live.
The course instructor Max Van Daalen was very knowledgeable and had made good use of Hadoop and Spark with his work at CERN.
The key infrastructure for the learning environment was built around CentOS we had our own Hadoop cluster and local environments that had Scala and Spark configured.
The course certainly helped my understanding around big data. There are obviously a lot of technologies out there however to keep things simple from a technology perspective most of these work on the paradigm of parallel computing. Basically to process large sets of data e.g. a Peta byte file, its not possible in some cases to even load the file on a single machine. In order to deal with this file you can spread it over multiple machines and get each machine to work on its portion of the file.
What has really opened up in the last few years are the number of solutions with quite a few being open source that allow you to utilise this paradigm. Commodity hardware or the use of cloud computing makes purchasing / deployment of a cluster of computers very cost effective.
One big change that occurs from a programming perspective is how you can take advantage of the parallel architecture and express succinctly the processing request. Functional programming which was around in the 1970’s seems to be the way to go. Scala, which is a functional programming language with some nice hybrid options based around java provides an elegant mechanism for programmers to express their intent.
The course covers Scala at good depth whilst also exposing us to various libraries provided by Apache Spark. Spark is a very powerful open source platform with advanced data analytics capabilities.
My favourite exercise on the course was using spark streaming in conjunction with the Twitter API to perform real time monitoring of social media. We could further enhance this with technologies like Kafka that allow a cluster of machines to ingest various streams
Whilst there was quite a bit covered in the course I think the following are my key takeaways:
- The concept and usages of RDD’s (Resilient Distributed Datasets), Data Frames and Transformations. These are generally core to how Spark works
- The concept of HDFS, Hadoop and other infrastructure particularly the Java Virtual Machine (JVM)
- SCALA particularly concepts around Case Classes, flexible management of collections and clever functions like Map and Flat Map that can be used with various data structures. Some clever examples were provided particularly using tuples
Overall it’s a course I would recommend and certainly has helped me get a better foundation around Big Data.