Scalable Customer Lifecycle Intelligence: A Hybrid  MLOps Framework Integrating Automated Machine Learning with Enterprise Data Warehousing 

by FormulatedBy | Technology

Reading Time: ( Word Count: )

Optimizing SaaS Customer Lifetime Value (CLTV) requires continuous collection and monitoring of  customer behavior data. Traditional data science pipelines lack scalability due to manual feature  engineering and misalignment between transactional and analytical systems. This article  introduces a Hybrid AI architecture to operationalize predictive modeling enterprise-wide. The  proposed end-to-end architecture connects an enterprise data warehouse (Snowflake) and an  advanced machine learning sandbox back to the source of truth using a recurring transfer tool and  Delta Sharing.

By leveraging AutoML for model creation and a centralized Feature Store for  accuracy, the system mitigates data leakage, concept drift, and scalability issues. The pipeline  fully automates bi-weekly retraining and daily inference with guaranteed freshness. This framework  has led to a 15% increase in upsell identification, a 92% reduction in manual modeling overhead,  and processed over 500,000 daily records with sub-hour latency. 

Introduction 

The customer path is like a map that shows how people interact with us at times. It includes things,  like when they buy something from us when we help them set it up when they start to see the  benefits when they decide to buy more from us or renew what they have and even when they think  about leaving us or actually do leave. Each of these stages leaves a different data trail of hidden  signals that, if decoded early on, can forecast the health and future direction of the customer  relationship. For example, if you have a user who has bought a product but not started their first  project in the 30 days following activation, they are pretty likely to be a churn risk. On the other  hand a customer at Value Realization that is logging into some advanced modules to manage their  usage may be an ideal candidate for an upsell.

Figure 1: Customer Lifecycle Stages. 

1.1 The Signal Extraction Challenge 

We have a problem when it comes to finding these important signals. The problem is that it is really  hard to do this on a scale. Now people are trying to solve this problem in their own way. They are  using analysis, which is often done by a team of data scientists who work alone. This is causing a  lot of issues. For example: 

Resource Scarcity: The Signal Extraction Challenge needs special models, for every stage  and this has to be done on a large scale. This means we would need a number of data  scientists, which is just not possible.  

Data Latency: When the behavioral data gets into the system it takes a time to get into the  modeling environment. This happens because the systems do not work well together. 

Training-Serving Skew: The way we make features for the model when we are training it  like when we use a Jupyter notebook is often different, from the way we do it when the  model is actually being used. This difference affects how well the Data Latency and  

Training-Serving Skew of the model works. The Data Latency and Training-Serving Skew  problem is caused by the fact that the Data Latency and Training-Serving Skew systems are  not integrated properly.  

1.2 The Hybrid Framework Solution 

To deal with these problems this paper talks about an MLOps framework that can handle a lot of  work and uses Automated Machine Learning and Enterprise Data Engineering. The framework we  are talking about is called the Hybrid Framework. It uses Snowflake to keep all the records safe.  Databricks to do all the complicated calculations. The rest of the paper will go into details, about  the Hybrid Framework. 

2 System Architecture 

The idea behind our system is to keep things by separating the Storage Layer, which we call the  System of Record from the Processing Layer, which we call the Intelligence Engine. This way we  can make sure our data is safe and we can do complicated calculations quickly. We do not want to  slow down our system with database work 

1. The System of Record is a Storage Layer that stores our data. It is the word on what our  data is and it is designed to keep our data safe when we make changes to it. The System of  Record is not meant for trying out things or playing around with our data. It is, for storing it  safely.

2. The Intelligence Engine is, like the muscle that makes things work. It is a layer that handles  machine learning and big data analytics. It also does real-time transformations. The  Intelligence Engine can handle a lot of work when it needs to and it can also slow down  when things are not so busy. This helps prevent the database from crashing when it gets  many big requests. 

3. The Bridge is a connection that helps move data from one place to another really fast. It is  called the High-Frequency Transfer Utility. The Bridge moves the data from the vault to the  Intelligence Engine so it can be used for analytics. 

Figure 2: High-Level Hybrid Architecture. 

2.1 The Data Transfer Utility 

We use a tool to move the data around. This tool makes sure that the data is updated every  hour/day. This helps us understand what is going on faster. We write the data to Databricks Delta  tables.  

2.2 Data Schema Design 

We want our machine learning pipeline to be really good and work well so we make sure to follow  rules when it comes to naming and organizing our data. This starts from where we get the data,  which’s Snowflake and goes all the way to where we process and use the data, which is Databricks. By being very specific about how our data’s organized we can automate a lot of tasks and make it  easier for our data engineers and machine learning teams to work together. 

2.2.1 Snowflake Schema (Source) 

Snowflake is the data source for the ML lifecycle, storing raw, pre-processed, and final output data.  Tables are prefixed with customer_playbook_{project}.

Naming Convention Role in ML  LifecycleDescription
customer_playbook_{project}_train Training Data Comprehensive historical dataset  (features, labels, metadata) for  stable model training and  validation.
customer_playbook_{project}_inference Incremental  Scoring DataRecent, incremental data batches  for real-time/near-real-time model  scoring.
customer_playbook_{project}_output Model  PredictionsFinal destination for model results  (scores, identifiers, confidence) for  downstream use and reporting.

2.2.2 Databricks Bronze/Silver Schema (Target) 

Snowflake data syncs to Databricks, where Spark performs feature engineering, transformation,  and inference. Processed, standardized data is stored in the prod_silver_dts_data schema (our  Lakehouse’s key layer), with tables optimized for Spark. 

Naming Convention Role Description
{project_name}_feature_store Centralized  Feature  RepositoryCore repository for all cleaned, transformed,  and engineered features, ensuring  consistent use for training and inference,  preventing drift.
{project_name}_inferred_output Inference Job  Results (Pre Sync)Temporary Databricks storage for inference  output before validation and synchronization  to the customer_playbook_output table in  Snowflake.

3 Feature Engineering Strategy: A Foundation for  Reliable Predictive Modeling 

Feature Engineering Strategy is very important for making predictions. We need to change the way  we do things in Machine Learning Operations. Now we make features in a lot of different ways and  it is not organized. We want to have a place where we keep all our features like a Feature Store.  One big problem we have is that our features are not correct at the time. This is called the “Point-in-

Time” problem. We want our Machine Learning models to be good and work well when we use  them. Feature Engineering Strategy and Feature Store are key, to making this happen. 

3.1 Point-in-Time Correctness: Preventing Future Peeking 

Point-in-Time correctness is really important for models that we can trust. When we make training  data to predict something that will happen in the future we have to make sure that every piece of  information we use is something we would have known at that time. To make sure, we have Point in-Time correctness we have to link every feature to a date, which is called the as_of_date  timestamp. For example the days_since_last_login is figured out from the as_of_date, not the date  on the system now. This means the features we have are a copy of what we would know when we  make a prediction. The days_since_last_login is basically the days_since_last_login from the  as_of_date. We do this so the features are what we would have at the time of the prediction. 

3.2 Leakage Detection Protocols 

All the features we think about must go through a check to make sure we do not have any data  leakage before we put them in the Feature Store. This check is, like a quality control. It includes: 1. Timestamp Validation (PIT Check): we do a Timestamp Validation also known as a PIT  Check we make sure that the timestamps for events that happen because of a feature are  never later than the target labels as_of_date. This helps prevent leakage. For example we do  not want to use a feature that comes from an event that happens after the target event. 2. Correlation Analysis (Suspicious Feature Check): We also do a Correlation Analysis,  which is also known as a Feature Check. This helps us find features that’re very closely  related to each other or have a lot of mutual information. Sometimes these features can  actually be a substitute, for the label we are trying to find. For instance a feature that only  exists after the target event is suspicious and needs to be looked into. Usually we have to  exclude these features from our analysis. 

3. Availability Checks (Historical Consistency Check): Availability Checks are really  important. We do these checks to make sure a feature is available all the time for all the  things we care about. This is done over the time we are training. We do not want some  things to be missing from groups because of changes we made to the way we do things. 

3.3 Normalization and Drift Management 

To guarantee consistent and reliable behavior between the model’s training phase and its  subsequent inference phase a critical aspect of production ML all numeric features undergo  standardization. Specifically, we employ Z-score normalization: 

z = x− µ/σ 

The statistics (µ and σ) must be computed only on the training data and stored as immutable  metadata in the Feature Store. During inference, these stored training-time µ and σ are applied to  live data. This is crucial to prevent “training-serving skew,” to ensure the model receives data  scaled against the exact distribution it was trained on, maintaining predictable production  performance.

4 The Hybrid Workflow for Model Development and  Deployment 

The standard deviation help us make sure the model is working with the right kind of data. Our  Methodology is about the Hybrid Workflow for Model Development and Deployment. We use a  workflow that has a Human-in-the-Loop. This means our Hybrid Workflow combines the speed of  Artificial Intelligence for making models and optimizing hyperparameters with the expertise of  humans. Artificial Intelligence does the work of computing so data scientists can focus on defining  problems refining features interpreting outputs and planning deployment strategies. This Hybrid  Workflow model makes sure we can work fast and still have quality and fair governance. 

4.1 Automated Model Generation (AutoML) 

Automated Model Generation, which we call AutoML is part of this Hybrid Workflow. Our  framework uses Databricks AutoML to create models for us. This means we do not have to choose  and adjust the algorithms. It works fast and tries out many different models at the same time to find  the best ones. The main models we use are: 

XGBoost: High-performance for structured data, used as a benchmark. • LightGBM: Optimized, distributed gradient boosting for speed with large datasets. • Logistic Regression: An interpretable baseline ensuring simplicity and stability. 

Each model is evaluated against a single, pre-defined priority metric (e.g., metric_priority:  “AUC_PR”). This strategic automation enables the rapid prototyping and deployment of highly  optimized models, drastically reducing the weeks of dedicated data science labor typically  required. 

4.2 The Model Registry & Promotion Logic 

The MLflow Model Registry serves as the single, authoritative source for all models, ensuring  reliability and governance through a strict four-stage lifecycle: 

1. Experimentation: AutoML generates “Candidate” models, automatically tracking all runs,  parameters, and metrics. 

2. Staging: The single best-performing Candidate model moves to “Staging” for final, pre production assessment. 

3. Validation: Staging models must pass an automated validation job against a held-out test  or shadow dataset, ensuring robust generalization. 

4. Production: Promotion is conditional: the Staging model must outperform the current  Production model on validation metrics, or be the first model for the task. 

The system works properly because of the Databricks workflows that run on a schedule. These  Databricks workflows are very reliable. They make sure that the predictive model is working well all 

the time. The automated pipelines are what make the model work, in real life. Figure 2: Automated Training and Evaluation Pipeline. 

5 Implementation & Orchestration 

The system’s operational heartbeat is maintained through a highly reliable and scheduled series of  Databricks workflows, designed to ensure both timely inference and continuous model  performance improvement. These automated pipelines are the backbone of how the predictive  model is deployed and managed in production. 

5.1 Daily Inference Pipeline 

The wf_{project}_inference_daily job generates daily predictions by processing only incremental  data for efficiency. The process involves five steps: 

1. Incremental Data Ingestion: The Transfer Utility fetches new/updated rows from  Snowflake using timestamp columns and stages the data in Databricks. 

2. Production Model Retrieval: The approved production model is loaded from the MLflow  Model Registry (models:/marketing_{project}_model/Production) to ensure consistency. 3. Prediction Generation: The scoring engine uses the fresh data and production model to  generate predictions (labels), probabilities (confidence scores), and model version  metadata for traceability. 

4. Results Publication: The scored results and metadata are synced back to designated  tables in Snowflake. 

5. Consumption: The structured data is immediately available to downstream business  systems like Salesforce/CRMs to inform marketing and sales strategies. 

5.2 Bi-Weekly Training Pipeline 

To combat model drift and maintain high predictive accuracy, a dedicated biweekly training and  promotion job, wf_{project}_training_biweekly, is executed every 14 days. 

The four phases are:

1. Refresh: We do an update of all the old data, including the things we are looking at and the  results we want to achieve so that our training information is complete and current. 2. Retrain: The main machine learning process uses a tool to work with the new data and  teach many different models, such, as decision trees and neural networks and then pick  the ones that work the best based on the standards we have set. We use machine learning  to make this happen.  

3. Evaluate & Promote: We need to evaluate and promote models. These new candidate  models have to go through a lot of testing against the model we are currently using. We use  things like AUC and precision and recall and stability to see how well they do. If a new  model does a lot better than the one we have now and meets all the rules it will  automatically be moved to the Staging area in the MLflow Model Registry. This means it is  one step away, from being used in Production. 

4. Notify: The system gives us a report on how the best model did what happened with the  champion and challenger and if a new model was promoted. This report is sent to our  project Slack channel so everyone can see it away. 

5.3 Scheduling and SLAs 

The whole prediction system works well because we follow the rules we set. We have rules called  Service Level Agreements that make sure our data is up to date. These rules are important because  they help our business team get the information they need on time which affects how well our sales  and marketing teams do their jobs. The daily inference job usually runs before the business day  starts like at 6:00 AM CST. This way decision-makers and automated systems have the up, to date  information. We use monitoring tools to keep an eye on when the pipelines start and end. If  something goes wrong or takes long these tools send out alerts right away. This helps us fix  problems quickly and avoid affecting the business much.  

Table 1: Operational SLAs and Schedule 

Job Type Frequency Time (UTC) SLA
Inference Daily 3:00 < 60 min
Training Bi-Weekly 2:00 < 120 min
Stats/Metrics Weekly 4:00 < 30 min
Data Sync Hourly – < 15 min

Table 2: Operational Performance Comparison

Metric Manual Process Hybrid Framework
Model Dev Time 3-4 Weeks 2-3 Days
Data Latency 24+ Hours < 1 Hour
Feature Reuse 0% (Siloed) 100% (Shared)
Deployment Freq. Quarterly Bi-Weekly

6 Operational Observability 

In order to maintain stakeholder trust, we developed a robust operational observability layer. This  system offers a real-time, 360-degree view of model health and performance, enabling proactive  anomaly resolution. Immediate, pervasive notifications are integrated via Slack, our central  communication hub. 

6.1 Drift Detection 

Our Model needs data to make good predictions. So we have to check for Drift all the time. We look  at the statistics for the most important features when the Machine Learning model is being trained  and when it is being used. This includes the mean, the deviation, and the skew of the Machine  Learning data. The system for Machine Learning then uses a score to see how different the live data  is from the data it was trained on. If the score for any of the features being watched goes above a  level the system, for Machine Learning sends out a Drift Alert right away for the Machine Learning  model. The MLOps team gets notified about problems with the system, such as when the pipeline  fails or behaves strangely or when there are errors, with the instrumentation. This helps stop the  predictions from getting worse without anyone noticing which keeps the quality of the decisions. 

6.2 Alerting Mechanisms 

The platform has a way to keep an eye on things. It uses a smart alert system to tell people what is  going on. This system has levels and it sends real-time updates to special Slack channels so  people can quickly figure out what to do. 

#ml-model-stats: Posts detailed, quantitative reports (AUC, Precision, Recall, F1-Score,  confusion matrices) after every training/validation run for data scientists to track  performance and stability. 

#ml-platform-alerts: Critical operations channel for system health incidents and failures  (e.g., missed SLA latency, feature pipeline crashes, data sync errors, deployment failures),  requiring immediate action from MLOps and engineering. 

Successful training generates a comprehensive notification payload including: • Leaderboard Summary: Compares the top model’s performance to competing AutoML  algorithms. 

Model Version/Artifact Hash: Unique ID of the new, validated, containerized production  model (e.g., v3.1.5-prod). 

Quantitative Improvements: Clear metrics showing the new model’s gain over the  production version (e.g., “+0.02 AUC vs previous version,” “25% reduction in False  Positives”). 

Resource Utilization: Compute cost/optimization metadata from the training run. 

7 Results and Business Impact 

The strategic deployment of this hybrid, automated machine learning framework has delivered a  transformative impact on how organizations leverage its customer data across the entire customer  lifecycle, shifting the organization from reactive analysis to proactive, predictive engagement.

7.1 Operational Metrics 

The shift to an automated, framework-driven MLOps paradigm has vastly improved efficiency and  reduced operational load. 

Faster Deployment: Standardized MLOps pipelines cut model deployment time by over  80%, from weeks to a single day. 

Feature Reuse: The centralized Feature Store enables instant reuse of validated features,  halving feature preparation time and eliminating redundant code. 

Fresher Models: Automated weekly retraining significantly reduces the average age of  models in production. 

7.2 Business Outcomes 

We can see that it is making a big difference in important metrics. We tried it out in a test project  that focused on the “Value Realization” phase, for opportunities and the results were great. The  model did a job with Prediction Accuracy. It got a score of 0.89 which’s really good at telling things  apart. The model also reduced positives by 22 percent compared to the old system. This helped the  sales team a lot by letting them focus on the leads. As a result the team was able to convert leads  into sales, which went up by 15 percent for some accounts. This meant money for the company.  The model is also very stable. Can handle a lot of work. It looks at, over 500,000 customer records  every day, which shows it can handle a lot of stuff without any problems.  

8 Conclusion 

Scaling SaaS companies have to deal with an amount of customer data that is growing really fast.  This is a problem because it is too much for traditional analysis to handle. The Hybrid AI Framework  is a solution to this problem because it helps us get the most value out of this data. We used Snowflake to store our data and Databricks AutoML to build models quickly. This  combination worked well and we were able to build a system that is very efficient and can handle a  lot of data.

Author: Shishir Tewari, Senior Manager, Data Engineering, Procore Technologies  

Post Category: Technology