How do you explain distributing computing and Apache Spark with different levels of complexity

Vijaya Phanindra
9 min readJun 2, 2020

How do you explain spark distributed computing to a 7 yrs old kid, 9th-grade student, a software engineer (java), ETL Engineer, Machine Learning engineer and an executive

7 Year Old

Me: Do you have domino blocks?

7 Year Old: Yes many

Me: Do you have different colors

7 Year Old: Yes, Red, Blue, Green, Orange, Yellow

Me: Do you know what is distributed computing

7 Year Old: What is that I don’t know

Me: How much time you take to count all of your dominos by color?

7 Year Old: 10 mins

Me: Imagine you have many dominos in all these colors, full of your room

7 Year Old: Oh that many

Me: How much time you will take it to count all of them?

7 Year Old: I don’t know, Ummm long time 10 days!!!, puzzled face, Can I have all of this for myself!!!

Me: How can you count all of the dominos in 10 mins?

7 Year Old: That’s not possible, you can keep only some of the dominos maybe

Me: No, I want all of the dominos filled in the entire room to be counted?

7 Year Old: Full of this room!!!!!!

Me: What else we could do if we don’t want to remove dominos and count all of them?

7 Year Old: Can I ask my friends to come and help

Me: That’s a great idea, Yes you can how many can help you?

7 Year Old: 3 of my friends

Me: are 3 of your friends enough to count all of them in 10 mins

7 Year Old: no. we need more

Me: let’s say your friends, my friends and their friends all are going to help us, 100 of them.

7 Year Old: Yes we can count in 10 mins

Me: that is called distributed programming and spark as a tool can help with that

7 Year Old: So when are we getting dominos to fill the room

9th Grader

Me: Do you know anything about Spark distributed computing?

9th Grader: No, is it a computer and what it can do?

Me: Do you take your photos with your phone

9th Grader: Yes many, actually lots of them, keep clicking them

Me: do you know that 1TB space can hold 2 million photos

9th Grader: Wow really, I can take more then if I have a more space

Me: do you share the photographs with friends in social networking sites

9th Grader: yeah..

Me: How about tagging you and your friend’s face in the photos.

9th Grader: that’s automatically done by where I upload

Me: Assume you have 2 million photos on your phone, and you have to tag all of them

9th Grader: What!! that’s impossible, that’s silly, I have to spend the rest of my life to do that.

Me: lets a say you take a minute to tag one photo

9th Grader: that’s 2 million minutes….gasps

Me: that’s equivalent to ~45 months or 3 and half years

9th Grader: Oh my god

Me: do you know that 95 million photos are uploaded to Instagram every day

9th Grade: Whattt!!!!!!!!

Me: this is where Spark distributed computing can help you?

9th Grade: How?

Me: You know about computers and programs in general right

9th Grade: Yes

Me: So what you do is you write an algorithm and send this algorithm and photographs to many computers and they all do the task parallelly. If you have more work to be done add more worker computers to do the task.

9th Grade: You are saying we can put as many computers as possible to run the algorithm and they all run in parallel

Me: Yes, In your example of 2 million photos, manually tagging taking a single photo takes a minute, a computer takes lesser time, let’s assume a computer takes 100 milliseconds for each photo, a single computer takes ~2.3 days and 100 computers will take ~33 minutes

9th Grade: That’s awesome.. so all the computers doing the same task of tagging the photo in parallel all the time, so they are faster and so take lesser time.

Me: Yes, you summed up great.

9th Grade: that’s good it’s not as bad as I thought so I will share more photographs

Engineer (java)

Me: Do you know anything about distributed computing?

Engineer (Java): Yes I know a little bit, something like dividing a problem into many tasks that are run concurrently across a network of computers.

Me: What else you can relate to distributed computing?

Engineer (Java): It’s similar to multithreading. In multithreaded processing, a single JVM process internally has several execution paths running concurrently, all threads sharing the same process state and using synchronization concepts the access to the shared state is controlled.

Me: Yes, it is similar to that, instead of a single JVM with multiple threads, in a spark distributed programming there will be several worker processes (JVM’s) across the network running in nodes, coordinated by a central node again a JVM process. The workers run concurrently and execute the tasks assigned to them by the driver process. The worker internally uses multithreading to run the tasks, the number of tasks run concurrently by a worker is determined by the number of cores allocated to the worker. Spark has a default cluster manager in spark standalone mode and can connect to external cluster managers such as YARN, Mesos, Kubernetes.

Engineer (Java): Fine you have explained about computing part of it, what about data is it shared among workers by the drivers, that would be terrible.

Me: No, the driver process doesn’t push the data to workers. Only the instructions on what data and where to read from is sent to the Workers, the workers read their share or part of a bigger dataset across the cluster. How data is divided into how many partitions depends on the underlying file system. How much concurrency or parallelization of processing depends on computing resources available to the spark cluster.

Engineer (Java): How a shared data is processed across the cluster?

Me: The spark distributed programming model follows Map Reduce design pattern. In the mapper phase, all data operations that can be performed in parallel on a part of the larger data set are performed, these operations can be filtering, transforming, etc while in reducer phase the summary operations such as reduce, group by happens.

Engineer (Java): What’s the difference between Spark and Hadoop MapReduce pattern in terms of data sharing?

Me: Spark’s main abstraction of data is called Resilient distributed dataset RDD, which is a collection of elements partitioned across different nodes.

Engineer (Java): So can I call an RDD an Array that is distributed across multiple nodes/workers

Me: More or less yes, the workers will have RDDs as part of their allocated memory. RDD is a fault-tolerant collection, which means it can be reconstructed on node failures from its lineage. The important points to consider about RDD’s

  • Constructed from parallelized collections or external datasets
  • Are parallel and fault-tolerant, can be reconstructed based on its lineage on node failures
  • Caching / Persistence helps speedup the next operation by caching the data locally and enables reuse of the data
  • Supports Operations called transformations and actions
  • Transformations are lazily evaluated processed only if required.
  • Stores the number of partitions
  • Certain operations result in the shuffle of data across nodes.

Engineer (java): Oh! wow, that was a crash course on spark where do I start?

Me: https://spark.apache.org/

ETL Engineer

Me: What do you know about Spark distributed computing

ETL Engineer: Some kind of processing tool for big data

Me: Sure, Spark is a general-purpose computing engine for large scale data processing. It can be used to process different workloads batch/streaming data, structure/unstructured data types, from simple ETL to complex machine learning algorithms in the same application.

ETL Engineer: Fantastic, can spark ever replace traditional ETL tools.

Me: Maybe or maybe not. Traditional ETL tools are well suited for structured data processing for updating data warehouse with star and snowflake schemas. While spark can process these kinds of workloads, it’s not as straightforward as an enterprise tool. An ETL tool helps develop ETL mappings along with providing operational capabilities to schedule and monitor workflows. With Spark, we need several other tools to achieve similar operational capabilities, many cloud vendors are filling the gap fast in this area.

ETL Engineer. Do we have a pipeline or workflow designer with Spark

Me: Spark is a programmable distributed computing engine, you have to write code in Scala/Java/Python/R with Scala and Java as first-class citizens using the Spark API. The job produced as a jar is to be deployed on a cluster.

ETL Engineer: you mean there is no UI to design ETL mappings?

Me: Spark doesn’t come with a UI any ETL tool, but there are a couple of UI tools from third-party vendors. Bit it is still very early in UI for Spark and it may not be a priority for open source community or other vendors. More importantly, moving to big data, and spark the mindset has to be changed from data integration to data engineering, workflows to data pipelines, tool-based to coding, enterprise licensed to open source system.

ETL Engineer: Does Spark support SQL and transactions

Me: Well Yes, Spark has SQL API though SQL is not ANSI SQL compliant but the majority of SQL commands and analytical functions are supported. While some of the ETLs provide trivial support for transactions rolling back data in the target in case of failures that has to code in Spark and best is to push transactions to the database later than do it in the Spark processing layer.

ETL Engineer: Where do I start

Me: Starting learning any programming language, Java, Scala or Python and then dive into coding using Spark API

Machine Learning Engineer

Me: What do you know about Spark distributed computing

Machine Learning Engineer: its a tool used for distributed processing of big data and can also be used for running machine learning algorithms.

Me: what are specific computing problems with ML algorithms?

ML Engineer: ML algorithms run in Python and R environments are iterative, and the large datasets required model generation doesn’t fit into single computer memory, neither these tools smartly cache the datasets, and inherent support for distributed computing is not available.

Me: Sure. Spark’s main abstraction of data is called Resilient distributed dataset RDD, which is a collection of elements partitioned or distributed across different nodes. Spark programming model allows distributing your code to workers which execute tasks in parallel on the partition of the data available on that worker node and is a scalable based on computing needs

ML Engineer: Does that mean all ML algorithms are automatically parallelized?

Me: Not necessarily, running an algorithm for model generation on Spark doesn’t name your algorithm parallel, you have to write the parallelized version of your algorithm. Spark provides MLlib with parallelized algorithms for logistic regression, classification and many more and there are other 3rd party libraries as well.

ML Engineer: Does Spark support SQL and data frames.

Me: Yes, Spark has SQL API which includes data frame abstractions. the SQL is not ANSI SQL compliant but the majority of SQL commands and analytical functions are available.

ML Engineer: Great does Spark support deep learning?

Me: Yes, deep learning another pattern of computation heavy and iterative processes, Spark is one of the best computing engine suited for that. In addition to the spark deep learning pipelines from databricks, there are many other vendors proving libraries to TensorFlow or Keras, the next major version of Spark 3.0 provides support for full GPU acceleration that is important for deep learning.

ML Engineer: Great where do I start?

Me: if you are not comfortable with Java/Scala you can start right away with Python/R

Executive: There is a lot of news and hype about data platforms, data & analytics. Our competitor was all over in the news yesterday about their data platform initiative and how the Spark tool helped them.

Me: Sure, Apache Spark is an open-source unified analytics computing engine for large-scale SQL, batch processing, stream processing, and machine learning.

Executive: How does it help with our data platform initiative.

Me: Different teams — business analysts, data engineers, and data scientists can build analytical data products using a unified analytics engine such as Apache Spark. Apache Spark makes distributed and big data processing scalable and much easier.

Executive: Are there any vendors providing enterprise support and what about security?

Me: Sure, Spark as a service offering is provided by many vendors starting from the founders of spark to major public vendors. Vendors provide value addition on top of core software by providing operational tools, pay as you go services, security, role-based access controls, etc

Executive: So when we can start with our data platform initiatives and when we can start hiring Spark talent

Me: So, what was the business case or problem statement again?

Please signup below to receive newsletter directly in your inbox

Disclaimer: All the opinions expressed are my personal independent thoughts and not to be attributed to my current or previous employers.

--

--

Vijaya Phanindra

I am a Cloud and Data Architect and I write about tech (data analytics, data products, real time streaming analytics), career development and decision making