BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Stream All the Things — Patterns of Effective Data Stream Processing

Stream All the Things — Patterns of Effective Data Stream Processing

41:25

Summary

Adi Polak discusses patterns for effective data stream processing, highlighting common pitfalls and the complexities of balancing data infrastructure. Learn about exactly-once semantics, the challenges of join operations in streaming (including the "Puppies shelter" concept), and crucial error handling strategies.

Bio

Adi Polak is Director, Advocacy and Developer Experience Engineering @Confluent.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Polak: I'm very happy to share a little bit from what I've learned and what we've learned from our customers' journeys as well. Streaming all the things, patterns of effective data stream processing, and what does it even mean?

In this session, I'm going to cover some of the patterns. I'm not going to go deep dive into exactly how you configure each one of the things, because this is oftentimes a very specific use case that requires a specific understanding of what it is that you're physically building, what's the business that you're serving. I am going to give you an overview of all the streaming patterns that takes place and where usually people make mistakes, just because it requires so much details and so much attention. Sometimes things can get confusing, especially when we're balancing multiple data infrastructure for multiple parts in the organization.

As you know, data is everywhere. Yes, data streaming exists. There's also a system that does batch. There's a huge part of analytics in different organizations, either if you're using a data warehouse or if you're using things to process your own data, like Apache Spark or things like that, but batch exists. On the other side of data streaming, there are the side of the applications. We're building microservices applications that also process data. Sometimes we leverage things like Kafka. We have applications that do produce events, applications that consume events. Sometimes we'll use Redis, sometimes we'll use other databases in order to maintain our data. When we think about it, data is everywhere. It's not only the data streaming. What happens to most of us is when we enter the data streaming world, we are already experienced engineers.

Most of the people I met that came into data streaming, come from a background either in batch or in the application space. This is where things get complicated because we are bringing the patterns and the things that we knew before, and we're trying to mimic a similar behavior in the data streaming. It's a great practice as an engineer, as the industry evolves, as our data architecture evolves. Sometimes it's also important to remember to forget things. That can be a little bit confusing, because we're doing a lot of work to gain that knowledge, but sometimes we do need to rethink what is the fundamentals, what are the basics that we're targeting, and how we can really build stream processing that is as efficient as possible.

Professional Background

I wrote two books for O'Reilly, and helped review more than 10 books for O'Reilly related to the data infrastructure. I'm a people manager. I'm very privileged to be able to manage a great team, brilliant team. In my background, I was a software engineer full-time. I'm coming from the big data world. I did a lot of batch processing. I did a lot of machine learning at scale. Recently, I joined Confluent, almost a year ago where I entered the streaming. This is why I'm sharing all my knowledge with you in order for you to also take the insights.

Challenges in Streaming

What are the challenges? There's a huge list of challenges that exist in streaming. Yes, we saw it before. There's the throughput. There's the scalability. There's the latency: it can be millisecond, it could be maybe 5 seconds. We are going to speak about only specific challenges that I saw people oftentimes get confused the most. The first one is exactly-once semantics. Exactly-once semantics is part of our different semantics that we're building. We were very good at building exactly-once semantics when our infrastructure is constructed of one platform that we need to manage. We do know that there is an end-to-end platform and there is a connecting tissue between the tools that we're using. This is where exactly-once semantics become even more challenging. We're going to talk about join operations.

Join operations in the analytics space if you're coming from a batch world, if you're coming from a DBA world, you know it's prevalent, it's everywhere. We're doing a bunch of joins operations. Yes, it takes longer. Yes, it requires dedicated strategies to optimize. We are bringing them into the stream space. That means we need to understand how join works when we have endless stream of events that comes in. It's unbounded. We're not looking at specific anymore. We need to be able to do that at a huge scale and the speed that the data comes in. We're going to talk about error handling and recovery. There are some new capabilities coming from the open source which I think is really exciting. If you're managing a cluster, or if you're a developer using a managed cluster, this is for you. Lastly, we're going to talk about guarding the gates, how we're building quality and security right at the beginning. These are the patterns that we're going to cover here.

Exactly-Once Semantics

Let's start with exactly-once semantics. Exactly-once semantics is one of our three different event guarantees. We usually discussed exactly-once, which means in processing and transaction exactly-once. At most once, I know I'm processing the transaction at most one time or at least once. I can process it multiple times as long as I'm processing it once. What happened in the past? In the past, we had something called Lambda architectures. Any people here that still or in the past managed Lambda architectures? In the past, we used to manage Lambda architectures. Why did we do that? We needed the data. We needed fast information and right in time information. We had the speed layer, but we couldn't trust the data that came out of the speed layer to be 100% accurate. While we had the speed layer, we constantly ran batch processing either in hourly or nightly in order to make sure and compare the two results between the stream processing and the batch processing.

While we were using it, we're still going back and making sure that this thing's still relevant for our data. Now, our technology evolved and we now know how to do better things than that. We moved into Kappa architecture. Kappa architecture usually is build, where you get Kafka cluster, you ingest data, ingest events. You have some stream processing engine, and you have that exactly-once or you want to have this exactly-once between these two systems. Each one of them knows how to create exactly-once for themselves. How does notion of time work in these systems? If we take a step back in order to create exactly-once, we need to sit on the building blocks of what is the fundamentals of stream processing. The fundamentals of stream processing are events coming into the system, state of the application, now managing a stateful application, is it a stateless application? Where do I live? What is time?

What is time is a really interesting question. I don't know if you ever got a chance to look at logs, to look at events, to see a timestamp and ask yourself, what does this timestamp represent? What do I know about this timestamp? It exists, but what is the meaning of that specific timestamp?

Usually, in the stream world, we'll look into four different timestamps that we can leverage. One is the event, when does the event occur? One is the storage, when did we actually write that event into an existing storage? Third one is ingestion, when did the event get ingested? Lastly, is the processing, when did we process that specific event? How we're going to pick between these four ways of looking into timestamp is going to determine how our system is going to work and our architecture. I'll give you an example. This is time and calculations. If I'm looking at event time, and if I'm looking at storage time, because my system is distributed, I'm looking at time that is immutable, it never changes. Event time occurred at one specific time. This event will never change. That is what it is.

It was written to disk, that is what it is, it was written at a specific time. Versus, a mutable system that is not a deterministic system, it's often harder to debug, harder to create any promises around that. This is when we're using time that comes from ingest. Why is that? It's because anyway, our ingest is a distributed system behind the scenes. Ingest could be depending on which partition ingested that specific data. Similar thing with processing, which one of my nodes actually processed the data, is going to get a different timestamp. One takeaway is sometimes the system will give you that flexibility to decide which timestamp is going to be the timestamp that you're going to use for your event guarantees. Strong recommendation, use event time or use storage time, probably event time, because you don't want to do the travel to a specific storage if you're building that, and work with this.

What is event? How do we look at event? When are events considered late? When are events considered on time? One of the challenges of exactly-once consistency is you want to be able to process the event at time, but sometimes events are late. If we compare event timestamp to the most current watermark, if we see that a timestamp specifically is smaller than the watermark, this is when we're saying that the event is late, because it's a timeline of the events. If the timestamp is bigger than the watermark, this is where we're saying the event is on time, it's within the window that we're expecting it to come into the system. As you're creating this new language for stream processing, please use these references of event is late or event is on time.

This is another example of how to see the different event timestamps. Sometimes bad things happen to good timestamps, because what happens, we get late events, as we can see there in the bottom, and we know we have to still exactly-once process them. How do we do it when we have the notions of windows? When we have the notions of windows, usually these would be time-driven or event-driven. We'll talk about it. They will allow some lateness for late arriving events, especially if we have exactly-once. It's going to be dependent by the type of window that we're processing. Here, we're going to dive a little bit deeper. Specifically, I'm going to use reference from the world of Flink, Apache Flink, just because it's going to make it a little bit easier for us to understand.

One notion of events is a tumbling window. A tumbling window is, I have the exact timestamp of where I'm reading. I'm reading 1 second. All the events are going to come inside this 1 second, so are going to be the ones that I'm processing. Then it moves really nicely along the way. The second one is sliding window. Sliding window, it's a little bit more complicated, because although it has a fixed amount of time where it lives, it still moves, but it moves either 1 second or 2 seconds, depending on how we're defining it in the beginning. This is great if you're doing events from GPS, geolocation, or anything like that, if you want to calculate the specific amount of devices, for example, that exists in a specific place, and process it.

The last one is considered the more advanced one, but often being used in systems where we see a peak of information coming in and out of the system, that's a session window. A session window will take all the events that it sees within a time range, which means there's a TTL. If it didn't get any events here, for example, up until half a second, it will close the windows. Once it starts getting more events past that idle time, it will open another window, which means it's willing to wait that idle time before it's going to start processing and opening another window again. This is great when we have peaks in our event where we're serving users. We know it's a specific time. We want to have the cluster always up. We also don't want to keep a window open forever or we don't want to process windows that gives us, at the end, empty results, because usually downstream will turn into null values that's going to harm our processing as well. If you are dealing with something like that, session windows can really help.

How do we put all these systems together? Each one of our systems today, if I'm looking at Apache Kafka or if we're looking at Apache Flink, each one of them has exactly-once guarantees by themselves. We know that the connecting tissue, usually in between, is where things can get messy. Here, for example, I have a data source that reads from Kafka. It can be any data connector that I'm building. Flink takes that information. There are some window aggregations taking place.

Then, data syncs back into Kafka and later on some applications read through it. In reality, because both of the systems are distributed systems, there's going to be multiple things happening at one time. This is where things become even more challenging because we want to be able to manage that and control it. While the scope of each one of them is going to work well, and Flink specifically used checkpoint mechanism under the hood, Kafka has different guarantees with the protocol, we need to be able to close the loop of that exactly-once across the system. This is one challenge where I see some of the people get confused a little bit building their own system.

There is a protocol that already exists that I would love for you all to know about it, and if it's something that could be relevant to you, please use it because it's already built in. That's the exactly-once two-phase commit. How exactly-once two-phase commit works. If I have my Kafka cluster, Kafka on its own has exactly-once. I'm configuring it to do exactly-once, and it's all good. Then, the Flink Job Manager, what it's going to do, it's going to inject checkpoints barrier. What it means is it's going to write a file with dedicated magic bytes in the beginning that tells it that this is going to be the beginning of the two-phase commit transaction. It's going to take that information after the data source got in, it could start processing, and it's going to save a snapshot offset into the state backend. The reason that it does it is because Flink under the hood for exactly-once leverage the checkpoint mechanism.

If there's any failure happening in the stream processing, it's going to go back and pull that information so we're guaranteeing that exactly-once. Here's where we're connecting Kafka to Flink at the beginning of the ingestion. What we do from the data source is we're passing that file, we're passing the responsibilities of the checkpoint barriers to the actual window processing. This is where we're connecting back to the windows that we just talked about before. After the windows, the windows again snapshot against the state backend, and it passes the checkpoints barrier to the data sink later on, for example, here to our Kafka. After the data sink, all the information is sending pre-commit transaction back to my Kafka cluster. In order to close the loop, my Job Manager is going to do a check-in on each one of the parts that are part of this transaction to make sure it's asking them if a checkpoint is complete or not. If Kafka has finished, it's exactly-once as well, and the other side is going to commit the external transaction back.

This is how, essentially, we're closing the loop where we have Flink that manages the protocol for us, and we're implementing it across. This implementation exists for Kafka. There's an interface in Flink. There are other implementations out there. If you're using something else and you do want to implement, there's a list of four or five functions that you will need to implement for each one of the steps that I just discussed. Essentially, what it means is checking that you have all the information and writing the special files that Flink needs in order to give us these guarantees.

If we sum it all up, we know we have the pre-commit, that issues a commit in the beginning. If any one of the pre-commit fails, rolling back, doing abort, leveraging the checkpoints mechanism. After everything succeeds, then we know for sure, we're guaranteed that it completely succeeds with exactly-once. This is how everything connects back to our system. This is how we're doing exactly-once across different systems. This is one of the patterns.

Join Operations

The second pattern is going to be join operations. This is again another pattern depending on which space we're coming from. We're coming from a more of a DBA world. We're coming from a more batch, distributed batch processing world. This is definitely going to impact the way we're mentally thinking about stream processing. There are two notions of join operations in a stream.

One is considered slightly simpler or easier to build for. The other one is considered a little bit more complicated. The first one is Stream: Batch. Let's assume I have a table of information and I have a stream, and I need to do a join operation between that stream and the table. It's very easy for me if I save the table in some state that it's fast to access. I can do the join and there's no problem at all when I'm processing that data because my table is static. This is a happy scenario. All you need to do when you build that is thinking, how big is the table? Do you need to do a lookup to the table or can you save it somewhere? Can you spread it across the nodes? Can you save it somewhere that has fast access? That's about it. If you're coming from the analytics, the batch processing space, it's very similar to that.

The second one is the more challenging one because here we have two streams. Each one comes into the system. We want to do some join operations on top of them based on some mutual information that they share, either a timestamp or some content of that information. Here sometimes we're saying, I want to do exactly-once. I want to make sure I'm joining all the events exactly-once when I'm processing this Stream: Stream. This is where things become really complicated because here's when we're trying to build different solutions in order to make it happen.

One of the things, it's a pattern, I really like to call it Puppies shelter. This is a shelter for events that never found their friends. For example, you can see here in the views, there's event D, there's event E, but there's no join result at the end. Both D and E are going to go to the Puppies shelter. We want to make sure these events are being adopted because we're doing exactly-once. What happens is we go back and we're trying to build more data pipelines, more data flow, try and ingest them again with different timestamps, doing all these back operations.

While all of these works, if we have to, maybe we can find a different solution and put it in some queue that we can process them separately. We can take care of these puppies, but also don't overload the systems with things that are usually unnecessary just to solve a very specific end case, because this is where Stream: Stream becomes even more complicated. If you're dealing with this thing, think about this cute little guy here and how you want to treat it, but make sure that you are not over-complicating a system that sometimes can be already complicated. You can see here how all the other events found their friends, even B, it's like at the end, it's going to get here, because at the end it did arrive.

Guarding the Gates - Data Integrity

Let's talk about guarding the gates, discussing data integrity, data quality, and some of the challenges that we have there. What is data integrity and how does that relate to data streaming? It's a common topic that we talk about in the data world, so how things connect. Data integrity for data streaming means it could be a physical integrity.

Oftentimes, we can check things. If it's really the data that's supposed to be there, this is usually the work of a data science later on. There's a logical integrity that we can do. We can check the types. We can check the range, things like that. There's also something called referential integrity that checks relationships between different tables. You won't get to these lost puppies, but at the end, something that we need to consider. This is where data cleaning and data validations is the real words that we need to use. If anyone tells you data integrity, go look in these things. These are the things that we need to take care of. We need to take care if there's no null data, we need to take care if there's no bad types in our data, and we need to make sure it's part of our architectures when we're doing streaming.

Any applications or any downstream processing is already leveraging what exists there. If I'll go for a second back to Kafka? Producer, Kafka brokers at the middle, consumers. At the bottom I have some Flink applications. What we can do is on the connecting tissue between the producer and the consumer, right over here, is where we can make sure that the data that flows over the system actually gets there at the right quality. What are the things that we can do today if we already have a Kafka cluster? We can do schema validation. We can do schema evolution. We can do input validation, really critical: type, format, range. There's also generic things that we can do, but this gives you a high-level understanding of the things.

There is one tool that exists for it, but you can build your own. Schema Registry is the tool where you can manage the schema throughout the data serialization process. It also helps you manage versioning. Sometimes for some people, versioning is critical for the schema, especially because we know the data that comes into our system from upstream sometimes change the versions all the time, and we need to be able to capture these changes and address them. We need to be able to validate that this is the right schema or block the data. This is another pattern that data streaming really needs to mature, in my opinion.

Then we need to take care of the interoperability. How does it look like in the architecture itself? If I have a Kafka producer and a Kafka consumer on different sides, and I'm connecting what I call the governor, someone who is going to govern around the rules and going to help me make decisions, I can give it the schema, and I can give it the rules that I want to enforce on top of this architecture. When my producer produces, it gets the schema. When my consumer consumes, it gets the schema. It also gets the rules that it has to comply. My data now has to comply. What happens here in the stream is instead of me building more applications to validate the data or only validating it downstream after I have multiple data streaming pipelines pushing data, I can push all these operations to this.

Already here, I'm stopping what needs to be stopped. That's really useful because what happens to us in organization is, usually we build thousands of pipelines, and thousands of consumers are going to consume our data, and we want to make sure it's in healthy shape.

How does it look like? This is a JSON file. I have my schema. I have my ruleset. I'm giving it some name, check device temperature, some IoT devices. The kind here is conditional, which means I'm going to give it a condition that it needs to follow. CEL, it's common expression language, something that Google created some years back. Then I'm giving it the expression, for example, message name shouldn't be null. What happens on failure, send it to a DLQ. Here I'm already saving, creating another application that I need to manage and maintain by creating this JSON file with the dedicated ruleset. Now, which DLQ topic, send it to some bad_device13, that later on I can go back and check these events and fix them as necessary, or roll them back, or just dump them if I wish to.

Another thing that happens to us is because the upstream data that comes in, especially in the stream processing, always changes the schema. That sometimes breaks our system downstream because we're very much dependent on the schema that comes in. We need to be able to make some conscious decision here. You are either blocking it and saying, we're not accepting this evolution, or we're accepting it and we're making the necessary changes, triggering the necessary alerts and so on. How do we do it? Similar thing, we have a JSON file. Here, this kind changes. It's not conditional anymore, it's transform. Transform, what type, CEL field.

For example, my expression here, if my name is full name and value is null, then construct this value from here, which means I'm allowing this transformation to happen to my schema, and I'm already filling it out with some value. This is very simple to a short if in Java. Otherwise, take the value. That helps me again not create more applications to manage that downstream. I'm already pushing it and loading it on top of my existing infrastructure. If I'm putting it together, I have my Kafka ingesting data, making sure I have this exactly-once guarantees. Then I have this quality rules and governance rules that help me manage the interactions between the different systems.

Error Handling and Recovery

We talked about guarding the gates. I have one more thing that I want to say around the challenges that we have, before I say a couple of words about AI. This is about errors that happen in our system. When we are building streaming systems, one of the biggest challenges that we have is failures and knowing where the failure originated. At the end of the day, we're going to get multiple logs. We're going to get some Jobserver that's going to give us some information.

Essentially, what happens, because Flink is part of the Apache Foundation, FLIP is essentially the Jira ticket that we're implementing in the open source. There's a new Jira ticket that currently we're working on, it's going to be released. This is Pluggable Failure Enrichers. Why that's critical for data streaming is because we want to be able to automate some processing around this failure, to either mitigate the failure in real-time or know who is the right person to wake up at night, or build some process that are healthier, around this stream processing. Like you see here is additional of labels. This is an interface when we are creating exception in our applications, we can be conscious and say, this is a user problem or this is an admin problem, which gives me a lot of power to decide what to do next with this exception.

Things like user pattern can be any arithmetic expression, something in my app specifically. Things that are more admin problem is heartbeat, I don't get an access to the data that I need, or something happened, disaster, I don't get any access, nothing works in the system. This is something that an admin can probably solve versus a user when we have some arithmetic failure, for example. Best practices when we're building stream processing is to make sure, as we're building these applications, we want to make our lives easier, hopefully, by being deliberate in giving the labels of what a failure here could be and what would that mean.

AI

Lastly, because sometimes our data infrastructure will need to support something like AI, or data processing for AI, or data modalities where we need to now not only support structured data but we need to also support unstructured data and things like that. I always have this, what I think is like a funny meme. Sometimes we bring AI into places where we don't need it, but if we're building data architecture, we should be ready. AI jobs are on the rise. There's a very interesting connection between AI and data streaming applications that I've seen across our customers.

Things like fraud detection with all the banks and security organizations. Recommendation system, we want to be able to recommend product or recommend a next show that you should watch in real-time. Dynamic personalization usually goes together with recommendation. Predictive maintenance, if anything is going to happen, we want to be able to predict it. IoT-driven insights. I met a mining company, they don't do crypto mining, they actually do physical mining. IoT-driven insights is critical for them because everything that happens in the mines, there's no people monitoring everything, there is robots and everything, there's IoT. Sensors that send signals and the insights from there is critical for them to get it in real-time, because if anything happens with the carts in the mining, every minute could cost them millions of dollars. It's really critical for them. Supply chain or any real-time optimization that happens.

Where real-time comes into place and how our architecture changes as we go. We have data ingestion. Oftentimes, this already exists. Sometimes, our schema is going to change. What we do need for machine learning is we need to make sure that data validation is in the right format, the right size, and something that our system can process. Going back, if there's something I can overload onto my Kafka cluster, it's going to save me a lot of time and a lot of effort in that space. Also, if the schema changes, if I can control it and decide that this event is getting blocked early on, it's going to be better for me. On the processing layer, it's probably a similar approach of the state management.

Sometimes, we will need access to dedicated hardware to process some of the requests. We will need sometimes here dedicated storage. Sometimes, especially with GenAI and LLMs, there is a notion of short memory and long memory. Short memory is what I got right now from the user. It should be very fast to access memory because I want to maintain the context. Long memory means that there's a longer conversation, things that I'm learning about my user from my interaction, and every once in a while, I want to load it back. I don't necessarily need it accessible all the time. This is where the processing layer together with the storage becomes really interesting where you have different type of storage. You can imagine hot storage and later on cold storage or mid-tier storage. All the retry, recovery logic, checkpointing becomes really critical. Here, for example, with Flink. Flink is stateful, has a state management, what we call a built-in storage that it can tap into, which could become my short-circuit knowledge if I need that.

Lastly, the AI integration, the model itself. Sometimes what we'll do is we'll serve the model through REST API, and we want to be able to connect it to the rest of the system. We want to build an application where we get our input from the user, we're sending it to some model, maybe we're enriching it with existing data. This is sometimes what we call RAG pattern, Retrieval Augmented Generation. We're enriching the data so later on when we're sending it back to the model, the model has the context and the information it needs to give us an answer that hallucinates less. That's usually going to be part of my stream processing pattern, just because this is where I'm combining my batch and stream, oftentimes, of the information. Here I can connect to a REST API and have my processing layer, for example, Flink.

Flink has Flink AI capabilities where I can register my model together with Flink through a REST API. I'm going to give it the URI. I'm going to give it the credentials that it needs in order to connect, and make it happen for me and call those existing applications that enables that. Under the hood, similar thing that always we have in the data space. We need to be able to monitor. We need to be able to observe. One thing that happens in machine learning is feedback loop. Feedback loop helps me fine-tune my model. This is something that I want to be able to do as close to real-time as possible if I can.

Usually here, there will be a separate data pipeline that tracks the actual response from the user, if the user is happy or not. Sometimes you'll see things like thumbs up, thumbs down. Or if we're doing predictive, like we're predicting what's going to be the actual time of one of my rides, for example, it took 15 minutes, but we predict it's going to be 10 minutes. There's going to be a feedback loop that continuously feeds my algorithm back so we can fine-tune and improve that. It's going to be a separate pipeline. It's going to be part of the monitoring and observability of my machine learning system.

Takeaways

One of the things that we talked about was an overview of all the different problems that you can imagine in the data streaming, and touching on a little bit of what's coming down the road for machine learning as well. We talked about data quality and DLQ, and putting that governance where we need in order to eliminate challenges downstream as much as possible. We talked about exactly-once end-to-end. We talked about why join-join is hard, and why the Puppy shelter, think always about the Puppy shelter, what the event is going to be, either out of order or sometimes not find a friend somewhere. What do you do when you are coding that join-join in your Flink application? Make a conscious decision about that early on. Speak with your product manager and make sure they're on board. How to deal with healthy errors, so we can all work together as a team, the infrastructure and the application developers. We already talked about join operations that requires strict planning.

Questions and Answers

Participant: In the streaming system, if you want to do AI integration, you said we can do like a REST API call. That slows down the processing, because in some of our use cases, we have integrated with AI models, and we see a lot of latency because the model can take time, and our stream can have a lot of backpressure. Is there any capability in this system which can be leveraged to speed the processing?

Polak: Yes, because we're connecting a stream processing with either probably a REST API or some microservice that lives somewhere, there's a question about rate limiting, how many requests can I send the server, and also what's the size of the request, what's the size of the context? Sometimes, depending on the machine learning that we're building, we have to go through an embedding process. That embedding process can take longer than we anticipate. It's a little bit about, our stream processing works really well, but then we're adding this new capability that needs to be able to operate at the same speed as we wish.

Then the question is, where do I get my models from? Are my models internal or something that I deploy locally? I took one of the open-source models that exist today in the industry. I was able to fine-tune it and get it to the level that I need. Then I'm hosting it very closely to my stream processing, and then it's a question of time travel. What's the request and what's the size? Or, am I using existing models like OpenAI, or Claude, or Gemini, or things like that? This is where I'm essentially dependent on somewhere else. Some of the solutions that I've seen that people are building, and that's again a use case decision to make, is building caching in your application.

If you have some information about this customer, some response that you can give them back to help them wait a little bit longer, but they still feel like they're engaged, this is one solution that you can do. You kind of prompt them, if that's a chat, for example, we're reaching out, we're checking this, kind of a list of events that can keep them engaged. Or if it's something that is more predictive, you're going to return a number. For example, maybe you have some caching information about previous or similar calculations that you did in the last couple of minutes, and then you can bring it back from your hot memory, instead of sending it to the model. You have a caching system right in there that validates these reads. I think with the generative AI part, it becomes even more challenging, especially around time and latency, because we have to think smarter about how we give something to the user, knowing that there is a person there anticipating a response, while our package is going and traveling to the model and back.

 

See more presentations with transcripts

 

Recorded at:

May 08, 2025

BT
OSZAR »