Machine Learning and Batch Processing on the Cloud — Data Engineering, Prediction Serving and Everything in Between

6 min readDec 15, 2020

During the course of our fall 2020 semester at Carnegie Mellon University, we had the opportunity of taking the 17–400/700 Data Science and Machine Learning at Scale class offered by Professor Heather Miller and Professor Michael Hilton. The class included 4 programming assignments where we worked on high volume ETL and Distributed Machine Learning, primarily using PySpark, Spark MLlib, Tensorflow and Databricks notebook.

The most exciting part of the course however was the project where we worked on analyzing over a billion NYC taxi rides on the cloud by building an end to end Machine Learning Pipeline on AWS as well as deploying the model using Cortex as our Prediction Server. Here’s our journey!

Problem Motivation and Data

We decided to tackle the task of analyzing NYC taxi data and predict the duration of a trip for both Green and Yellow cabs. One of the main reasons we chose this problem statement is the increased need in efficient and optimized scheduling in the mobility and logistics industry. In addition, the original dataset was around 150GB in size which was a scale appropriate for the project. The dataset was spread across multiple CSV files, where each CSV file contained data for a single month. We performed the analysis using data between 2013–2020. We used the following Github Repository to understand how to get access to the data and process it: https://github.com/toddwschneider/nyc-taxi-data

Prior to diving into our data engineering process, we performed Exploratory Data Analysis on a small sample (1 month) to understand the data and get a sense of potentially problematic columns.

We can observe that columns such as tip amount, payment type and trip type are completely filled with NULL values. This helped us prepare better for our Data Engineering and ETL process.

System Design

We decided to use Amazon Web Services (AWS) as our Cloud Service Provider. One of the main reasons for our decision was the fact that the NYC Cab Dataset was already hosted on an S3 bucket on AWS. The potential cost savings by avoiding data transmission across cloud providers was a huge plus point. In addition, AWS EMR with Jupyter Notebooks provided a scalable platform to perform data engineering and write distributed machine learning code using PySpark. Also, since we were also deploying our model, AWS EKS was always on our mind. This turned out to be helpful when we were setting up Cortex, our prediction server.

ETL and Feature Engineering

We first dumped the data from the public S3 bucket onto Hadoop Distributed File System (HDFS). We then wrote a PySpark ETL program that reads data from HDFS, transforms it and loads it onto our S3 bucket. We ensured that we localized our compute to the US-East-1 Data Center on AWS as that was where the dataset was stored.

The transformations primarily included removing all rows with null values, extracting specific time information (such as Pickup hour and Dropoff hour) and ensuring consistency with respect to schema.

Upon completing ETL, we then performed Feature Engineering to find the most relevant features. This was done manually and the final engineered features included pickup hour, trip distance, rush hour and a boolean weekend flag.

We also tracked the platform i/o metrics in the process.

The percentage of HDFS storage currently used.

The total number of concurrent data transfers

We observed that during data loading — HDFS utilization spiked from 0.5% to 1.5% utilization. Also, during Feature engineering, total load scaled from 10 concurrent reads to 28 concurrent reads. In addition, during Feature Engineering and Model Training (more on this soon!), the Available memory kept fluctuating between 122000 MB and 10000MB. This showed that close to 90% of the memory was getting allocated for jobs running in the cluster.

Machine Learning

After investigating our data, we concluded that it is a classical regression problem. We decided to use Linear Regression as our Machine Learning model. Our model training and storage configurations included 10 AWS EC2 instances with an m5.xlarge 4 vCore having 16 GiB memory and EBS only storage of 64 GiB. We performed 10 iterations with a learning rate of 0.1 on the engineered features.

The metric we decided to track was accuracy of prediction within a time window. The window represented the acceptable absolute difference between our prediction and the value in the test data. We experimented with windows of 2 minutes, 5 minutes, 10 minutes and 15 minutes.

We can observe that we receive the trend as expected, within a 5 minute window the accuracy is as low as 23.9% but it shoots up to around 80% for the 15 minute window.

A possible extension or future work includes evaluating other approaches along with regularization techniques to improve the accuracy.

Prediction Serving

Through prediction serving we wanted to answer a simple question, “How effectively can we deploy our model to serve predictions to simulate real-world production load?”

We used Cortex as our prediction server and performed the following steps:

The trained model was saved in .onnx format
The loaded model in the onnx format was ported onto the prediction server
The actual “serving” of the model was a simple python class that instantiates an onnx runtime session and loads the pre-trained model.
The predict method in the class was invoked by cortex
Once cortex was setup in our AWS instance we had the server running by using the following command: cortex cluster up — config cluster.yaml

We performed unit testing of our model using Postman.

Load Testing using Locust

We decided to use locust for load testing our service given its ease of setup and use. We simulated a load of 5000 users and monitored metrics such as response times of the service, requests per second and user growth.

Total Requests per Second across Timestamps

In addition, we also decided to analyze the data using EKS — Cloudwatch Dashboard Monitoring.

Our Prediction server handled 5000 concurrent users with ~279k requests and performed steadily with just 1% failure

Conclusion

This was a great learning experience for all of us. Working on this project gave us insight into building and deploying an end to end Machine Learning application at Scale. We hope to dive deeper into this space and solve exciting challenges in the future!