Working with task callbacks in Airflow

Photo by Conor Brown on Unsplash

We are living in the Airflow era. Almost all of us started our scheduling journey with cronjobs and the transition to a workflow scheduler like Airflow has given us better handling with complex inter-dependent pipelines, UI based scheduling, retry mechanism, alerts & what not! AWS also recently announced managed airflow workflows. These are truly exciting times and today, Airflow has really changed the scheduling landscape, with scheduling configuration as a code.

Let’s dig deeper.

Now, coming to a use case where I really dug down in airflow capabilities. …

Dive into internal workings and memory management in Scala

Photo by Nathan Dumlao on Unsplash

I came across Scala while working with Spark, which in itself, is written in Scala. Coming from a Python background and with little to none Java knowledge, I found Scala a bit confusing in the beginning, but over time, it grew on me and now, it is my preferred language for most use cases.

With experience, I have picked up a few bits and pieces of scala and its workings. Please read on to find out a bit more about scala, mainly the non-coding part, how exactly code turns to execution. …

Beginner’s Guide to using Delta lake in Apache spark

Photo by Lukas Blazek on Unsplash

This is a follow up of my introduction to the Delta Lake with Apache Spark article, please read on to find out how to use Delta lake with Apache Spark to perform operations like Update existing data, check out previous versions of data, convert data to delta table, etc.

Before diving into code, let us try to understand when to use Delta Lake with Spark because it’s not like I just woke up one day and included Delta Lake in the architecture :P

Delta Lake can be used:

  • When dealing with “overwrite” of the same dataset, this is the biggest…

Step by Step guide for local installation of Apache Flink

Photo by ev on Unsplash

To work with real-time stream processing(not micro-batching, real-time), Apache Flink is the next big thing. The documentation defines Apache Flink as:

Apache Flink is a framework for stateful computations over unbounded and bounded data streams.

Follow along to run Apache Flink locally.

Step 1: Download Apache Flink

  • From the official website of Apache Flink, download the requisite binary. If you want the latest version, then according to your scala version requirements you can download either Apache Flink x.x.x for Scala 2.11 or Apache Flink x.x.x for Scala 2.12. As of August 30, 2020, Apache Flink 1.11.1 is the latest version.

Get to know the storage layer which enabled ACID and updates with Spark

Photo by Franki Chamaki on Unsplash

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:

  1. Data “overwrite” on the same path causing data loss in case of Job Failure.
  2. Updates in the data.

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may…

Load testing of an API of real-time AWS Kinesis based pipeline with the help of JMeter

Photo by Icons8 Team on Unsplash

So, I was working on a real-time pipeline, and after set up, the next step for me was its load testing as I don’t live dangerously enough to productionize it right away!

A minute of silence for the bugs we have all faced in production.

Okay, the minute is over. Back to the use case, I expected around 100 records/second on average on my pipeline. Also, I wanted to find the threshold of my pipeline: how much load can it take before breaking.

The tool I used for load testing was, JMeter. I learned how to use JMeter during my…

Internals of Spark Join & Spark’s choice of Join Strategy

While dealing with data, we have all dealt with different kinds of joins, be it inner, outer, left or (maybe)left-semi. This article covers the different join strategies employed by Spark to perform the join operation. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out.

Photo by Russ Ward on Unsplash

Spark Join Strategies:

Broadcast Hash Join

Before beginning the Broadcast Hash join spark, let’s first understand Hash Join, in general:

Problem Statement:
You are given coins of different denominations and a total amount of money. Write a function to compute the number of combinations that make up that amount. You may assume that you have infinite number of each kind of coin.

Example 1:

Input: amount = 5, coins = [1, 2, 5]
Output: 4
Explanation: there are four ways to make up the amount:
Example 2:

Input: amount = 3, coins = [2]
Output: 0
Explanation: the amount of 3 cannot be made up just with coins of 2.
Example 3:

Input: amount = 10…

Amazon Kinesis is an excellent tool to be a part of the real data world. It is a managed service and saves a lot of time in building a stable real-time pipeline of let’s say, Kafka. Within Kinesis, you can start ingesting your data in a few minutes. Let’s see how.

I won’t talk about how to create a Kinesis Data Stream, that’s pretty simple and involves filling necessary information & clicking lot’s of Next and you are done! One thing I do would like to talk about is shards.

Shards in Data Stream

To decide the number of shards…

So, I was working with one use case to send a spark DataFrame results over e-mail as an attachment. I looked through multiple sources for the same in Scala and couldn’t find an appropriate way to send the results in DataFrame over e-mail. Finally, I decided to work my way through it. I’ll discuss my approach here so it might help others along the way :)

The steps mentioned are of Scala but can be used in Java too with minimal changes.

Step 1: Create an email session, if not already created:

import javax.mail.PasswordAuthentication
import javax.mail.Session
import java.util.Properties
props: Properties…

Jyoti Dhiman

Big Data Engineer@LinkedIn | Data > Opinions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store