Finding the most common value in parallel across nodes, and having that as an aggregate function.
Posts by Category
Understanding how Spark runs on JVMs and how the memory is managed in each JVM.
This post is about de-duplication of data while loading to tables using HashSet based indexes in Apache Spark.
General Exception Handling
A module written in Scala for Apache Spark v2.0.0 to batch process mapping of Geo Locations in two skewed data sets. Link to code: https://github.com/anish74...
A follow up post about specifying window frames to SQL analytical functions. This assumes you have already read my previous post where I described the use of...
For a long time I had faced a lot of problems while working with data bases and SQL where in order to get a better understanding of the available data, simpl...
A comparison of use cases for Spray IO (on Akka Actors) and Akka Http (on Akka Streams) for creating rest APIs
Using AWS S3 as a Big Data Lake and its alternatives