We all made some of the below statements on our decisions or on someone else decision, after outcomes were known
“I told you so”
“I knew it”
“I should have known this”
“How dumb I could be to not know this”
“I didn’t see this coming”
An important aspect of high-quality decision making is learning from experience but there are two aspects to watch out for when learning from experience,
Good decisions will always result in good outcomes, and I was wrong.
First, I have long believed that good decisions will result in good outcomes, and I am wrong.
Second, I believed it is simple to decide provided we have the necessary facts at a decent confidence level. I didn’t realize this is not true for all. Some of us get confused with the availability of more information and will do anything to avoid deciding. We don’t realize that no decision is a decision in itself.
Third, we make logically sound decisions, no that is not always true, and we…
Once in a while database systems go through a phase of bundling and unbundling enabling a new set of use cases and value addition.
The DBMS bought the efficiency as a packaged system managing every aspect of data, enabling ACID properties and transactions, providing consistency guarantees compared with file-based data systems before, developed a niche area around OLTP applications, and became the backbone of ERP systems.
Then Hadoop happened.
Hadoop ushered in the era of unbundling of database management systems to address massive parallel processing of unstructured data.
The DBMS’s are composed of
c) Data files…
1) Separation of storage and compute, there is no going back the tight coupling between storage and compute for analytics data warehouses is gone for good. Bring your own storage BYOS and Bring your own compute BYOC is the norm.
2) Distributed programming at programmers hand using simple map->shuffle->reduce pattern.
3) Columnar Databases — Columnar databases are not new for the analytics world and those advantages are repeated, proprietary or open-source columnar storage format is the fundamental block to cloud data warehouses.
4) Unavailability of Integrity Constraints — cloud data warehouses either support constraints as decorators to aid in migration…
Top 15 learnings from interviewing hundreds of big data professionals for service and consulting organizations for 10 years and having attended a few.
There are more than 100 big data open-source projects, unfortunately, you have can’t avoid adding as many projects as possible so your CV gets picked (SEO keyword optimization). In service organizations panels would be taking interviews every day so they would hardly spend time reviewing your CV thoroughly, they will probably read the first page of the CV for 10 seconds and then move on,
So make two things very…
Digitization and collection of more data points at every interaction point, your firm has with external entities, is the crucial first step in providing amazing digital experiences and achieving a successful digital transformation.
Lemonade is a socially impactful new age digital insurance company that went public on 2nd July. It’s the company everyone is talking about due to its differentiated business model with behavioral sciences at the center of it. …
There is no doubt that poor quality data will have an impact on business outcomes. “Getting in front on data quality presents a terrific opportunity to improve business performance”, writes Thomas C. Redman in the article seizing opportunity in data quality published in MIT Sloan Management Review.
The cost of bad data is an astonishing 15% to 25% of revenue for most companies. Two-Thirds of These Costs Can Be Eliminated by Getting in Front on Data Quality — Thomas C. Redman author of Getting in front on Data
There is another article by the same author Improve Data Quality for…
The word count example below illustrates the importance of caching the RDD when the RDD lineage breaks/branches out.
Case 1: Reads the input file twice
The loading of the file for loremCountCase1 and ipsumCountCase1 operations can be verified in the log. Based on the partitions and parallelism you will the below two lines twice in the log indicating file was read twice.
INFO HadoopRDD: Input split: file:/data/lorem_ipsum.txt:1816+1817
INFO HadoopRDD: Input split: file:/data/lorem_ipsum.txt:0+1816
As you can see from the above DAG all operations are executed twice for each of collect() operation.
Case 2: Reads the input file only once with…
How do you explain spark distributed computing to a 7 yrs old kid, 9th-grade student, a software engineer (java), ETL Engineer, Machine Learning engineer and an executive
Me: Do you have domino blocks?
7 Year Old: Yes many
Me: Do you have different colors
7 Year Old: Yes, Red, Blue, Green, Orange, Yellow
Me: Do you know what is distributed computing
7 Year Old: What is that I don’t know
Me: How much time you take to count all of your dominos by color?
7 Year Old: 10 mins
Me: Imagine you have many dominos in all these colors, full…