Before I tell this story, you should know that my birthday falls on October six (feel free to send me birthday gifts).
So yeah, so my birth date is six, and I don’t much believe in supernatural phenomena, well I have seen horror movies like any other person but nothing beyond that.
Until this happened.
Happy weekend folks!
No, this article is not about how Scala will look like in the year 2030, but instead about getting started with Futures in scala.
Starting out, it took me some time to get familiar with Future in scala and a lot of resources to understand when, how, and why to use Future/Callbacks. With experience, here’s a guide for anyone like me who might need to get started with Future in Scala.
I’ve had my fair share of encounters(read, disasters) while working with git. Well, that’s how you learn, right? I remember the days when before running any git command I used to copy the whole project as a backup and the process continued by creating a backup of backup and so on. Kind of redundant when using a version control, no?
Well, what about the old days, even last year, I wiped out the project completely on my machine, after copying a command from StackOverflow blindly.
So, two learnings for me from that episode,
We are living in the Airflow era. Almost all of us started our scheduling journey with cronjobs and the transition to a workflow scheduler like Airflow has given us better handling with complex inter-dependent pipelines, UI based scheduling, retry mechanism, alerts & what not! AWS also recently announced managed airflow workflows. These are truly exciting times and today, Airflow has really changed the scheduling landscape, with scheduling configuration as a code.
Let’s dig deeper.
Now, coming to a use case where I really dug down in airflow capabilities. …
I came across Scala while working with Spark, which in itself, is written in Scala. Coming from a
Python background and with little to none
Java knowledge, I found
Scala a bit confusing in the beginning, but over time, it grew on me and now, it is my preferred language for most use cases.
With experience, I have picked up a few bits and pieces of scala and its workings. Please read on to find out a bit more about scala, mainly the non-coding part, how exactly code turns to execution. …
This is a follow up of my introduction to the Delta Lake with Apache Spark article, please read on to find out how to use Delta lake with Apache Spark to perform operations like Update existing data, check out previous versions of data, convert data to delta table, etc.
Before diving into code, let us try to understand when to use Delta Lake with Spark because it’s not like I just woke up one day and included Delta Lake in the architecture :P
Delta Lake can be used:
Apache Flink is a framework for stateful computations over unbounded and bounded data streams.
Follow along to run Apache Flink locally.
Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:
Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.
Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may…
So, I was working on a real-time pipeline, and after set up, the next step for me was its load testing as I don’t live dangerously enough to productionize it right away!
A minute of silence for the bugs we have all faced in production.
Okay, the minute is over. Back to the use case, I expected around 100 records/second on average on my pipeline. Also, I wanted to find the threshold of my pipeline: how much load can it take before breaking.
The tool I used for load testing was, JMeter. I learned how to use JMeter during my…
While dealing with data, we have all dealt with different kinds of joins, be it
left or (maybe)
left-semi. This article covers the different join strategies employed by Spark to perform the
join operation. Knowing spark join internals comes in handy to optimize tricky join operations, in finding root cause of some out of memory errors, and for improved performance of spark jobs(we all want that, don’t we?). Please read on to find out.
Before beginning the Broadcast Hash join spark, let’s first understand Hash Join, in general:
Big Data Engineer at Linked[in]