We are living in the Airflow era. Almost all of us started our scheduling journey with cronjobs and the transition to a workflow scheduler like Airflow has given us better handling with complex inter-dependent pipelines, UI based scheduling, retry mechanism, alerts & what not! AWS also recently announced managed airflow workflows. These are truly exciting times and today, Airflow has really changed the scheduling landscape, with scheduling configuration as a code.
Let’s dig deeper.
Now, coming to a use case where I really dug down in airflow capabilities. …
I came across Scala while working with Spark, which in itself, is written in Scala. Coming from a
Python background and with little to none
Java knowledge, I found
Scala a bit confusing in the beginning, but over time, it grew on me and now, it is my preferred language for most use cases.
With experience, I have picked up a few bits and pieces of scala and its workings. Please read on to find out a bit more about scala, mainly the non-coding part, how exactly code turns to execution. …
This is a follow up of my introduction to the Delta Lake with Apache Spark article, please read on to find out how to use Delta lake with Apache Spark to perform operations like Update existing data, check out previous versions of data, convert data to delta table, etc.
Before diving into code, let us try to understand when to use Delta Lake with Spark because it’s not like I just woke up one day and included Delta Lake in the architecture :P
Delta Lake can be used:
Apache Flink is a framework for stateful computations over unbounded and bounded data streams.
Follow along to run Apache Flink locally.
Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:
Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.
Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may…
So, I was working on a real-time pipeline, and after set up, the next step for me was its load testing as I don’t live dangerously enough to productionize it right away!
A minute of silence for the bugs we have all faced in production.
Okay, the minute is over. Back to the use case, I expected around 100 records/second on average on my pipeline. Also, I wanted to find the threshold of my pipeline: how much load can it take before breaking.
The tool I used for load testing was, JMeter. I learned how to use JMeter during my…
While dealing with data, we have all dealt with different kinds of joins, be it
left or (maybe)
left-semi. This article covers the different join strategies employed by Spark to perform the
join operation. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out.
Before beginning the Broadcast Hash join spark, let’s first understand Hash Join, in general:
You are given coins of different denominations and a total amount of money. Write a function to compute the number of combinations that make up that amount. You may assume that you have infinite number of each kind of coin.
Input: amount = 5, coins = [1, 2, 5]
Explanation: there are four ways to make up the amount:
Input: amount = 3, coins = 
Explanation: the amount of 3 cannot be made up just with coins of 2.
Input: amount = 10…
Amazon Kinesis is an excellent tool to be a part of the real data world. It is a managed service and saves a lot of time in building a stable real-time pipeline of let’s say, Kafka. Within Kinesis, you can start ingesting your data in a few minutes. Let’s see how.
I won’t talk about how to create a Kinesis Data Stream, that’s pretty simple and involves filling necessary information & clicking lot’s of Next and you are done! One thing I do would like to talk about is shards.
Shards in Data Stream
To decide the number of shards…
So, I was working with one use case to send a spark DataFrame results over e-mail as an attachment. I looked through multiple sources for the same in
Scala and couldn’t find an appropriate way to send the results in DataFrame over e-mail. Finally, I decided to work my way through it. I’ll discuss my approach here so it might help others along the way :)
The steps mentioned are of
Scala but can be used in
Java too with minimal changes.
Step 1: Create an email session, if not already created:
import java.util.Propertiesprops: Properties…
Big Data Engineer@LinkedIn | Data > Opinions