A first hand encounter of why brands matter!

Photo by Nikolay Tarashchenko on Unsplash

Before I tell this story, you should know that my birthday falls on October six (feel free to send me birthday gifts).

So yeah, so my birth date is six, and I don’t much believe in supernatural phenomena, well I have seen horror movies like any other person but nothing beyond that.

Until this happened.


Get started with Future and Callbacks in scala

Happy weekend folks!

No, this article is not about how Scala will look like in the year 2030, but instead about getting started with Futures in scala.

Starting out, it took me some time to get familiar with Future in scala and a lot of resources to understand when, how, and why to use Future/Callbacks. With experience, here’s a guide for anyone like me who might need to get started with Future in Scala.

Photo by Bradyn Trollip on Unsplash

I believe what Future is to scala is what promises are to Javascript, which is also asynchronous. …


Some not so common git disasters and their fixes

I’ve had my fair share of encounters(read, disasters) while working with git. Well, that’s how you learn, right? I remember the days when before running any git command I used to copy the whole project as a backup and the process continued by creating a backup of backup and so on. Kind of redundant when using a version control, no?

Well, what about the old days, even last year, I wiped out the project completely on my machine, after copying a command from StackOverflow blindly.

So, two learnings for me from that episode,

  • Never run any git command blindly(well, this…


Working with task callbacks in Airflow

Photo by Conor Brown on Unsplash

We are living in the Airflow era. Almost all of us started our scheduling journey with cronjobs and the transition to a workflow scheduler like Airflow has given us better handling with complex inter-dependent pipelines, UI based scheduling, retry mechanism, alerts & what not! AWS also recently announced managed airflow workflows. These are truly exciting times and today, Airflow has really changed the scheduling landscape, with scheduling configuration as a code.

Let’s dig deeper.

Now, coming to a use case where I really dug down in airflow capabilities. …


Dive into internal workings and memory management in Scala

Photo by Nathan Dumlao on Unsplash

I came across Scala while working with Spark, which in itself, is written in Scala. Coming from a Python background and with little to none Java knowledge, I found Scala a bit confusing in the beginning, but over time, it grew on me and now, it is my preferred language for most use cases.

With experience, I have picked up a few bits and pieces of scala and its workings. Please read on to find out a bit more about scala, mainly the non-coding part, how exactly code turns to execution. …


Beginner’s Guide to using Delta lake in Apache spark

Photo by Lukas Blazek on Unsplash

This is a follow up of my introduction to the Delta Lake with Apache Spark article, please read on to find out how to use Delta lake with Apache Spark to perform operations like Update existing data, check out previous versions of data, convert data to delta table, etc.

Before diving into code, let us try to understand when to use Delta Lake with Spark because it’s not like I just woke up one day and included Delta Lake in the architecture :P

Delta Lake can be used:

  • When dealing with “overwrite” of the same dataset, this is the biggest…


Step by Step guide for local installation of Apache Flink

Photo by ev on Unsplash

To work with real-time stream processing(not micro-batching, real-time), Apache Flink is the next big thing. The documentation defines Apache Flink as:

Apache Flink is a framework for stateful computations over unbounded and bounded data streams.

Follow along to run Apache Flink locally.

Step 1: Download Apache Flink

  • From the official website of Apache Flink, download the requisite binary. If you want the latest version, then according to your scala version requirements you can download either Apache Flink x.x.x for Scala 2.11 or Apache Flink x.x.x for Scala 2.12. As of August 30, 2020, Apache Flink 1.11.1 is the latest version.


Get to know the storage layer which enabled ACID and updates with Spark

Photo by Franki Chamaki on Unsplash

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.
  2. Updates in the data.

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may…


Load testing of an API of real-time AWS Kinesis based pipeline with the help of JMeter

Photo by Icons8 Team on Unsplash

So, I was working on a real-time pipeline, and after set up, the next step for me was its load testing as I don’t live dangerously enough to productionize it right away!

A minute of silence for the bugs we have all faced in production.

Okay, the minute is over. Back to the use case, I expected around 100 records/second on average on my pipeline. Also, I wanted to find the threshold of my pipeline: how much load can it take before breaking.

The tool I used for load testing was, JMeter. I learned how to use JMeter during my…


Internals of Spark Join & Spark’s choice of Join Strategy

While dealing with data, we have all dealt with different kinds of joins, be it inner, outer, left or (maybe)left-semi. This article covers the different join strategies employed by Spark to perform the join operation. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out.

Photo by Russ Ward on Unsplash

Spark Join Strategies:

Broadcast Hash Join

Before beginning the Broadcast Hash join spark, let’s first understand Hash Join, in general:

Jyoti Dhiman

Big Data Engineer at Linked[in]

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store