Why LinkedIn changed their tech stack

How to learn a new codebase/language quickly. Future trends in software engineering. Technical choices in Java's implementation of Hash Tables and how they compare to other languages. And more!

Dec 10, 2021

Hey Everyone,

Today we’ll be talking about

Why LinkedIn changed their data analytics tech stack
- LinkedIn previously used third party proprietary platforms for their data analytics tech stack.
- This approach led to scaling problems and made it hard to evolve the systems.
- LinkedIn switched to using open source software and the Hadoop ecosystem.
Trends in Software Engineering - An interesting article on trends that are currently changing the way we develop software.
- Infrastructure as Code
- Progressive Web Applications
- Remote Work
- Python
Plus, a couple awesome tech snippets on
- How to learn a new codebase/language fast
- An awesome, free textbook on cryptography
- How Java implements HashMap
- How to learn mathematics with the asterisk method

We also have a solution to our last coding interview question, plus a new question from Microsoft.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Evolving LinkedIn’s Analytics Tech Stack

Steven Chuang, Qinyu Yue, Aaravind Rao and Srihari Duddukuru are engineers at LinkedIn. They published an interesting blog post on transitioning LinkedIn’s analytics stack from proprietary platforms to open source big data technologies.

Here’s a summary

During LinkedIn’s early stages (early 2010s), they were growing extremely quickly. To keep up with this growth, they leveraged several third party proprietary platforms (3PP) in their analytics stack.

Using these proprietary platforms was far quicker than piecing together off-the-shelf products.

LinkedIn relied on Informatica and Appworx for ETL to a Data Warehouse built with Teradata.

ETL stands for Extract, Transfer, Load. It’s the process of copying data from various sources (the different data producers) into a single destination system (usually a data warehouse) where it can more easily be consumed.

illustration-of-linkedins-legacy-analytics-tech-stack

This stack served LinkedIn well for 6 years, but it had some some disadvantages:

Lack of freedom to evolve - Because of the closed nature of this system, they were limited in options for innovation. Also, integration with internal and open source systems was a challenge.
Difficulty in scaling - Data pipeline development was limited to a small central team due to the limits of Informatica/Appworx licenses. This increasingly became a bottleneck for LinkedIn’s rapid growth.

These disadvantages motivated LinkedIn engineers to develop a new data lake (data lakes let you contain raw data without having to structure it) on Hadoop in parallel.

You can read about how LinkedIn scaled Hadoop Distributed File System to 1 exabyte of data here.

However, they did not have a clear transition process, and that led to them maintaining both the new system and the legacy system simultaneously.

Data was copied between the tech stacks, which resulted in double the maintenance cost and complexity.

illustration-of-maintaining-redundant-data-warehouses — Maintaining redundant systems led to unnecessary complexity

Data Migration

To solve this issue, engineers decided to migrate all datasets to the new analytics stack with Hadoop.

In order to do this, the first step was to derive LinkedIn’s data lineage.

Data lineage is the process of tracking data as it flows from data sources to consumption, including all the transformations the data underwent along the way.

Knowing this would enable engineers to plan the order of dataset migration, identify zero usage datasets (and delete them for workload reduction) and track the usage of the new vs. old system.

You can read exactly how LinkedIn handled the data lineage process in the full article.

After data lineage, engineers used this information to plan major data model revisions.

They planned to consolidate 1424 datasets down to 450, effectively cutting ~70% of the datasets from their migration workload.

They also transformed data sets that were generated from OLTP workloads into a different model that was more suited for business analytics workloads.

The migration was done using various data pipelines and illustrated bottlenecks in LinkedIn’s systems.

One bottleneck was poor read performance of the Avro file format. Engineers migrated to ORC and consequently saw a read speed increase of ~10-1000x, along with a 25-50% improvement in compression ratio.

After the data transfer, depreciating the 1400+ datasets on the legacy system would be tedious and error prone if done manually, so engineers also built an automated system to handle this process.

They built a service to coordinate the deprecation where the service would identify dataset candidates for deletion (datasets with no dependencies and low usage) and then send emails to users of that those datasets with news about the upcoming deprecation.

The service would also notify SREs to lock, archive and delete the dataset from the legacy system after a grace period.

The New System

The design of the new ecosystem was heavily influenced by the old ecosystem, and addressed the major pain points from the legacy tech stack.

Democratization of data - The Hadoop ecosystem enabled data development and adoption by other teams at LinkedIn. Previously, only a central team could build data pipelines on the old system due to license limits with the proprietary platforms.
Democratization of tech development with open source projects - All aspects of the new tech stack can be freely enhanced with open source or custom-built projects.
Unification of tech stack - Simultaneously running 2 tech stacks showed the complexity and cost of maintaining redundant systems. Unifying the technology allowed for a big boost in efficiency.

LinkedIn’s new business analytics tech stack

The new tech stack has the following components

Unified Metrics Pipeline - A unified platform where developers provide ETL scripts to create data pipelines.
Azkaban - A distributed workflow scheduler that manages jobs on Hadoop.
Dataset Readers - Datasets are stored on Hadoop Distributed File System and can be read in a variety of ways.
- They can be read by DALI, an API developed to allow LinkedIn engineers to read data without worrying about it’s storage medium, path or format.
- They can be read by various Dashboards and ad-hoc queries for business analytics.

For more details on LinkedIn’s learnings and their process for the data (and user) migration, read the full article.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

How to learn a new codebase (or programming language) FAST - Rahul Pandey is a senior software engineer at Facebook who’s written code in multiple languages at Facebook (Kotlin, Java, JavaScript, Python, Hack and C).
Here are his 3 tips to ramp up on a new language/codebase as soon as possible.
1. Start running code quickly and add log statements to observe the state of the program at various points. Don’t get stuck reading documentation about the language / codebase.
2. Intentionally break the program and explain how it broke. An example might be to increment a loop counter by 2 instead of 1 and then explaining how this broke the program. Understanding the current state of the code and how it’s changing is a prerequisite to making your own modifications.
3. Make a low effort code change quickly. An easy way to do this is to add some unit tests. The more quickly you can get through the mechanics of adding a code change, the easier it will be to get to the more interesting code changes that you want to make.
A tale of Java Hash Tables - This is a great blog post that breaks down exactly how Java implements HashMap. For example, Java uses the classical Separate Chaining technique to deal with collisions while other languages (like Python, Ruby and Rust) use Open Addressing.
The article breaks down other design decisions made in Java regarding the HashMap data structure.
How to learn mathematics - The Asterisk Method. - This is a method of learning mathematics that’s been used for decades. It can be applied to study academic concepts in CS too.
The key is to handwrite your notes because handwriting has been shown to be better for memory than typing.
1. Open the book / lecture notes you wish to study.
2. Copy the relevant parts of the book or lecture notes to a notepad by hand
  - If you’re reading a book, you should be writing a summary or paraphrasing what you are reading
3. Whenever you copy something, ask yourself if you really understand it completely. As long as you are completely comfortable with what you are copying, keep going.
4. If you read something which is difficult to understand, stop and try to think about it until you understand it clearly. Draw diagrams, google it and look for other explanations, etc.
5. If you find something you really can’t understand after a long time, copy it to your notebook but put an asterisk in the margin.
6. While you continue copying, keep going back to the asterisks to see if you can understand them.
7. If you find an explanation later, erase the asterisk.
8. When you’ve copied enough material for one sitting, look over all the asterisks and see which ones you can now understand.
9. After, find other people and see if they can explain the asterisk concepts to you.
The Joy of Cryptography - This is a free undergraduate textbook that is a great introduction to the fundamentals of cryptography. It talks about concepts like hash functions, digital signatures, public key encryption and more.

22 Trends in Software Development

Catarina Gralha wrote an interesting article for DZone with 22 trends in software development for 2022.

Here are a couple of the trends.

Infrastructure as Code (IaC) - IaC is becoming more common as it allows the management of your infrastructure through configuration files and allows you to adopt software development practices in infra.
The big cloud providers are investing heavily in IaC with products like AWS CloudFormation but other providers like HashiCorp (Terraform) have seen immense growth.
Progressive Web Applications (PWA) - PWAs are native solutions that combine the capabilities of a website with a mobile application. Twitter launched a PWA and saw a 75% increase in total tweets sent and a 70% reduction in data used.
Remote & Hybrid Work - This isn’t necessarily a software engineering trend, but it’s too massive to not include. In 2020, there was still a question of whether the world would revert back to in-office work.
2021 showed that there’s no turning back. Remote work is here for the long term and this means huge changes in the way companies onboard new developers, develop software (code reviews, standup, meetings, etc.) and promote engineers.
Python - Regardless of whether you love it or hate it, there’s no denying the massive growth in Python’s popularity. The TIOBE index is a measure of programming language popularity and Python recently overtook Java and C to become the most popular programming language.

You can see the rest of the trends here.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Interview Question

You are given a binary tree in which each node contains an integer value. The integer value can be positive or negative.

Write a function that counts the number of paths in the binary tree that sum to a given value (the value will be provided as a function parameter).

The path does not need to start or end at the root or a leaf, but it must only go downwards.

Here’s the question in LeetCode

We’ll send the solution in our next email, so make sure you move our emails to primary, so you don’t miss them!

Gmail users—move us to your primary inbox

On your phone? Hit the 3 dots at the top right corner, click "Move to" then "Primary"
On desktop? Back out of this email then drag and drop this email into the "Primary" tab near the top left of your screen
A pop-up will ask you “Do you want to do this for future messages from quastor@substack.com” - please select yes

Apple mail users—tap on our email address at the top of this email (next to "From:" on mobile) and click “Add to VIPs”

Previous Solution

As a reminder, here’s our last question

You are given a character array containing a set of words separated by whitespace.

Your task is to modify that character array so that the words all appear in reverse order.

Do this without using any extra space.

Example

input - ['A', 'l', 'i', 'c', 'e', ' ', 'l', 'i', 'k', 'e', 's', ' ', 'B', 'o', 'b']

output - ['B', 'o', 'b', ' ', 'l', 'i', 'k', 'e', 's', ' ', 'A', 'l', 'i', 'c', 'e']

Here’s the question in LeetCode.

Solution

We can solve this question by following these two steps

Reverse the input array
Go through each word in the input array and reverse each individual word

So, for the input example of

input - ['A', 'l', 'i', 'c', 'e', ' ', 'l', 'i', 'k', 'e', 's', ' ', 'B', 'o', 'b']

We first reverse the input array, so we’ll have

input - ['b', 'o', 'B', ' ', 's', 'e', 'k', 'i', 'l', ' ', 'e', 'c', 'i', 'l', 'A']

Then, we’ll go through each individual word in the array and reverse each individual word.

So ‘b’, ‘o’, ‘B’ turns to ‘B’, ‘o’, ‘b’

After we do this for every word in our array, we’ll end up with

input - ['B', 'o', 'b', ' ', 'l', 'i', 'k', 'e', 's', ' ', 'A', 'l', 'i', 'c', 'e']

Which is our output array.

There are some constraints and questions you should ask your interviewer.

An important one is how you should deal with whitespace.

Based on the LeetCode example, they want you to remove all trailing whitespace and multiple spaces between words.

The output should only have a single space separating the words.

So, we’ll just follow that constraint for our solution.

This question is pretty straightforward, but it’s extremely easy to make “off by 1” errors and other small bugs.

Therefore, it’s important to write clean code with well defined functions and logic.

We’ll be splitting our code into 4 functions.

First, we’ll have a function that strips out all leading and trailing whitespace.

Then, we’ll have another function that removes multiple white spaces between the words, so that there’s just 1 space in between all the words.

Now that we’ve handled that, the problem becomes much simpler.

We’ll have a third function that takes in a start and end index, and then reverses all the characters in the array between those indexes.

We’ll use that function to reverse our array.

Last, we’ll have a function that reverses each individual word in our array.

The function will iterate through all the words in our array, and then call the third function (the reverse characters function) on each individual word.

That will give us our final output that we can return.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Quastor System Design Case Studies

Discussion about this post

Ready for more?