How Hadoop forever changed our thinking about relational databases and influenced cloud data warehouses design, some hits and misses
1) Separation of storage and compute, there is no going back the tight coupling between storage and compute for analytics data warehouses is gone for good. Bring your own storage BYOS and Bring your own compute BYOC is the norm.
2) Distributed programming at programmers hand using simple map->shuffle->reduce pattern.
3) Columnar Databases — Columnar databases are not new for the analytics world and those advantages are repeated, proprietary or open-source columnar storage format is the fundamental block to cloud data warehouses.
4) Unavailability of Integrity Constraints — cloud data warehouses either support constraints as decorators to aid in migration or don’t support them at all.
5) Batch loads to real-time streaming analytics — there is a real shift towards real-time streaming analytics, updating data once in a day is no more a viable option.
6) Semistructured data — there is a huge value in storing JSON/XML hierarchical, nested semistructured data for analysis and improving query response times.
1) Can’t get around without SQL — Relation or non-relational, nested data types, JSON/XML or any data type, SQL is the preferred choice, even majority of SPARK workloads (excluding AI/ML) are written using Spark SQL API
2) Immutable datasets — Hadoop is designed for append-only immutable datasets, but change data is important
3) Transactions — CDC means some sort of ACID transactions required with some level of concurrency control if not full concurrency and isolation levels.
4) Master/Server architecture for many tools in the Hadoop ecosystem creating operational and maintenance challenges, forcing shift towards serverless or zero ops.
5) Dynamic auto-scaling is good but hard to manage, thus shifting this responsibility to vendors to auto manage.
6) Multiple security tools and protocols in the Hadoop ecosystem where consistent unified IAM/network/encryption controls are preferred.
7) BI Reporting/Dashboards, Hadoop never meant to give a response in seconds, many attempts to do direct reporting on top Hive created failures and inconsistent experience to business users thus new cloud warehouse created query caching mechanism to support this use case.