Posts by Category

spark

Implementing Statistical Mode in Apache Spark

8 minute read

Finding the most common value in parallel across nodes, and having that as an aggregate function.

Analyzing Java Garbage Collection Logs for debugging and optimizing Apache Spark jobs

10 minute read

Understanding how Spark runs on JVMs and how the memory is managed in each JVM.

Using HashSet based indexes in Apache Spark

14 minute read

This post is about de-duplication of data while loading to tables using HashSet based indexes in Apache Spark.

Exception Handling in Spark Data Frames

7 minute read

General Exception Handling

Geo Location Batch Search in Spark

3 minute read

A module written in Scala for Apache Spark v2.0.0 to batch process mapping of Geo Locations in two skewed data sets. Link to code: https://github.com/anish74...

Back to Top ↑

sql

SQL Analytical Functions - II - Window Frames, ROWS and RANGE

2 minute read

A follow up post about specifying window frames to SQL analytical functions. This assumes you have already read my previous post where I described the use of...

SQL Analytical Functions - I - Overview, PARTITION BY and ORDER BY

6 minute read

For a long time I had faced a lot of problems while working with data bases and SQL where in order to get a better understanding of the available data, simpl...

Back to Top ↑

scala

Akka Actors vs Streams for Rest APIs

11 minute read

A comparison of use cases for Spray IO (on Akka Actors) and Akka Http (on Akka Streams) for creating rest APIs

Back to Top ↑

big-data

Big Data Lake in the AWS Cloud

9 minute read

Using AWS S3 as a Big Data Lake and its alternatives

Back to Top ↑

Anish C

Posts by Category

spark

sql

scala

big-data