How to Reinvent Your Data Flow Architecture: A Summary

This is part of Solutions Review’s Premium Content Series, a collection of reviews written by industry experts in maturing software categories.

Data volumes are exploding, driven primarily by tech-focused companies (e.g. not just large corporations, but emerging players in fintech, adtech, edtech, etc.). However, old-school companies have also taken the leap, adding IoT sensors to automobiles, factory lines and oil pipelines, as well as collecting and analyzing customer interactions from their websites. , digital products and customer support. The data is used for everything from automating road toll payments to measuring earthquakes to monitoring assembly lines, as The Wall Street Journal reports. But most data never leaves the nest. IDC estimates that nearly two-thirds of 2020 data existed only briefly. Of the other third, much of it sits in storage for years, unused.

This is because data is born as a new event in a source system and the value of these data events degrades very quickly. So you need to use new data quickly so you can take action while it still matters. And to understand and act on real-time events, you need to implement streaming data technologies.

Make no mistake about it, leveraging streaming data requires a paradigm shift. We’ve spent decades taking data events and batch processing them on an hourly, daily, or weekly basis. Delivering real-time action requires a stream processing mindset where new data is continually compared to historical data to identify changes of interest that the business can act on.

This means that your organization can no longer rely on the data infrastructure it has deployed over the past three decades. Unless it’s updated to support a streaming process, your analytics will fare poorly compared to what’s possible.

In short, you need to change your mindset and reinvent your architecture. Here’s how you can do it.

Architect for events

Although data is born in the form of events, these events are usually grouped as this is the process supported by traditional data infrastructure. A traditional data pipeline typically involves batch processing with defined start and end times (for example, batch processing at the end of every hour).

Obviously, you cannot enable real-time scanning using batches. Batch processing creates delay, and that delay equates to missed opportunities. To access more up-to-date analytics, you need to adopt a streaming data pipeline approach where logic runs on each new event as it happens and allows you to detect changes as they happen. they happen.

In addition to providing more recent data, stream processing has the added benefit of being traceable upstream. A well-designed streaming architecture uses an “event source” approach that keeps a log of every change made to the dataset since its creation. This allows you to make changes to your logic and rerun new logic on old data. So if you discover a bug or an unexpected change in your source data, you can rerun your pipeline. This makes your data operations more flexible and resilient.

Re-engineered for freshness at scale

Imagine you want to implement a “next best offer” system, which combines real-time behavioral data of an app user with all sorts of contextual data about them (e.g. browsing history, location , demographics) to determine which offers make the most sense to that individual. This kind of instant action relies on data freshness enforced by strict service level agreements (SLAs).

As these SLAs get tighter, the amount of data you get and process increases, and the number of pipelines you run increases (as data is used by more data consumers), you will need to scale your infrastructure in a flexible way to keep your data fresh.

As your needs change and your demand grows, ensure you can maintain real-time performance by using cloud data processing services with elastic scaling. Make sure you don’t have long-running operations that would slow down your process, or excessive memory usage that might cause service-level inconsistencies.

Implement a real-time data lake

Data streaming requires a number of new technologies. You need a way to ingest events into a stream, store them affordably, process them efficiently, and distribute the transformed data to various analysis systems. The good news is that there are technologies available and proven in the market. Altogether, these technologies constitute a lake of real-time data.

Stream ingestion: You must install and manage a message queue such as Apache Kafka, Amazon Kinesis, or Azure Event Hub that can collect events in streams.

Data lake storage: streaming data can accumulate to enormous size, so a cloud data lake based on object storage such as Amazon S3 or Azure Data Lake Service (ADLS) is the way to go the most economical to manage it. Many tools will allow you to connect message queues to a data lake.

Processing platform: Once the data is stored in the data lake, you can either use Apache Spark (the tasks are written in Python, Java, or Scala) or a SQL-based tool, such as Upsolver, to combine recent real-time data with historical data to feed. downstream systems. This mixing happens continuously as new data arrives.

Data lake optimization tools: Data lakes are very affordable, but they require performance optimization, what we call PipelineOps. Here are some examples. First, a compression process is needed to turn millions of single-event files into larger files that are processed efficiently. Second, the mixing of data streams and batch data requires the orchestration of tasks required to perform the processing task. Third, a state store is required for stateful processing that joins batches to streams.

These processes can be accomplished through a set of specific tools such as Airflow for orchestration or RocksDB, Redis or Cassandra as a state store, these tools being glued to the processing engine via code. You can also implement a declarative data pipeline platform such as Upsolver, which automates these PipelineOps functions.

Analytics systems: Once you’ve successfully optimized your data lake, you’ll be able to generate your broadcast data as “live tables” (these update automatically as new events arrive) that can be used by any analysis system.

Streaming is the way to go – and it’s time to move on

As you can see, streaming requires a change of mindset, thoughtful planning, and new infrastructure. However, if you invest the time and effort to implement modern streaming processes and systems, you can take advantage of timely new data and analytical insights.

Latest posts by Ori Rafael (see everything)

Comments are closed.