Schema management with Scalameta by Lars Albertsson

October 11, 2023

In the presentation Schema management with Scalameta, Lars Albertsson the founder of Scling, discusses his background in data engineering and the evolution of data management from an artisanal approach to an industrial one. He shares his experience working at Google and building Spotify’s data platform, and explains how the shift from manual data processing to industrial-scale data lakes was enabled by tools like Hadoop. However, he also notes the challenges and failures companies encountered in implementing these industrial-scale solutions and the subsequent rise of user-friendly tools. Lars emphasizes the importance of reducing the cost of working with data to access its value and the significance of collaboration and modern software engineering practices in managing data factories. He also discusses the concept of data factories and their relation to value production, the importance of efficient software engineering processes, and the shift from mutable data in databases to immutable data in functional programming. Additionally, he explores two different approaches to schema management, the use of Scalameta for offline code generation, and the challenges of handling complex data types and maintaining privacy in a data lake. Lars also discusses the importance of balancing data access and security and the benefits of encrypting sensitive data using techniques like Shield formation. Overall, he argues that data engineering is behind software engineering in terms of maturity and the use of factories and that data engineering, operations, and security are completely dependent on software engineering to build factories.

Schema management with Scalameta: A comprehensive overview

In the ever-evolving landscape of data engineering, efficient schema management is pivotal for handling large-scale data processing and ensuring data integrity. Lars Albertsson, founder of the data engineering company Gling, provides a comprehensive overview of schema management using Scalameta in his insightful YouTube presentation, “Schema Management with Scalameta.” Here, we delve into the key takeaways and techniques discussed by Lars, offering a roadmap for data engineers aiming to optimize their schema management practices.

 

The Evolution of data management

Lars begins by reflecting on his extensive background in data engineering, which includes significant stints at Google and Spotify. His experiences underscore the shift from manual data processing to industrial-scale data lakes, driven by technologies like Hadoop and other open-source tools. Despite the initial enthusiasm, many companies faced challenges with these large-scale solutions, prompting a return to more traditional data warehouses or the adoption of newer, user-friendly tools. This cyclical trend mirrors the rise and fall of fourth-generation programming languages in the 1990s, highlighting the ongoing quest for efficient data management solutions.

 

The concept of data factories

Lars introduces the notion of “data factories,” where the goal is to automate data processing to generate business value. He explains that the volume of data produced is a marker of a company’s maturity. Large enterprises, such as Google, produce billions of data sets per day, leveraging data to enhance their products. This underscores the importance of reducing the cost of working with data to unlock its potential value, particularly in the “fat tail” of data, which contains significant but often underutilized insights.

 

From mutable to immutable data

A significant shift in data engineering has been the move from mutable data in databases to immutable data sets in functional programming. This transition was driven by Hadoop’s limitations, which do not support indexing or transactions, leading to a preference for functional transformations. Lars emphasizes the need for defining schemas in data storage, differing from traditional relational databases. Avro is highlighted as a preferred serialization format, storing the schema with the data for convenience.

 

Schema management approaches: Scheme on write vs. scheme on read

Lars contrasts two approaches to schema management:

  1. Scheme on write: This approach enforces strict data checks before storage, ensuring conformity to a clear schema definition. It is common in relational databases, where invalid data is prevented from being written.
  2. Scheme on read: This more flexible approach stores data without strict checks, expecting certain data formats during read operations. While useful for analytics, it poses risks if schema typos are not detected early, making it less desirable for real-time applications.

Managing compatibility issues across old schema versions is crucial, requiring strategies to handle incompatible changes over time.

 

Leveraging Scalameta for schema management

Scalameta, a powerful Scala parsing tool, plays a central role in Lars’ schema management strategy. Scalameta parses Scala source code, generating a syntax tree used for formatting, static analysis, and code transformation. This capability is particularly valuable in the context of Big Data tools, bridging the gap between JVM’s robust typing and Python’s data science flexibility.

Lars elaborates on the use of Quasi quotes and Scala case classes for schema management. Quasi quotes, a macro feature, allow for type-safe code generation within strings, providing compile-time error checking. Scala case classes simplify data record handling, enabling the generation of tools like Avro definitions, Java-to-Scala converters, and CSV parsers.

 

Addressing challenges in schema management

Lars highlights several challenges and their solutions in schema management:

  • Floating point comparisons: Using type classes called record equality to handle floating point comparisons.
  • Bridging Scala and Python: Defining schemas in Python for relational databases, given Python’s workflow orchestration role.
  • Logical types: Custom logical types ensure semantic correctness, like defining sets as lists for uniqueness.

Privacy concerns are addressed through encryption techniques, such as the “lost key pattern,” separating personal data with encryption keys to maintain privacy while allowing data usability.

 

Data access and security

Balancing data access with security is paramount. Lars introduces a “lake model,” where different teams control their data schemas while democratizing non-personal data access. Ensuring synchronization and rigorous software engineering practices, they generate schemas and perform checks to maintain data integrity.

 

Learning and adopting best practices

Lars concludes by reflecting on the maturity of data engineering compared to software engineering. He emphasizes the need for data engineering to evolve into a factory model, akin to software engineering. For aspiring data engineers, he recommends gaining experience in environments like Spotify or Scling, where industrial-scale data processing and schema management practices are implemented effectively.

 

Conclusion

Lars Albertsson’s presentation on schema management with Scalameta offers a wealth of knowledge for data engineers striving to optimize their data processing workflows. By leveraging tools like Scalameta and adopting best practices in schema management, data engineers can significantly enhance the efficiency, accuracy, and security of their data operations, unlocking the true potential of their data assets.

 

Additional resources

Check out more from the MeetUp Func Prog Sweden. Func Prog Sweden is the community for anyone interested in functional programming. At the MeetUps the community explore different functional languages like Erlang, Elixir, Haskell, Scala, Clojure, OCaml, F# and more.