Choosing between Google Cloud Run and AI Platform for ML pipeline
Our learning in scaling a machine learning pipeline and why we chose Cloud Run over AI Platform
We are a customer support automation startup, IrisAgent, that processes large quantities of text data coming from support tickets as well as time-series data coming from engineering and product sources. Our business objective is to enable smarter customer support using real-time insights about operational, engineering, and user issues.
We evaluated Google AI Platform and Google Cloud Run for setting up a robust and production-ready ML pipeline. Hope our findings can save you valuable time.
We wanted to move to an ML pipeline for the following objectives:
Easy to manage
We’d rather focus on our core business problems than spend a lot of time managing the ML pipeline. We wanted an out-of-the-box solution that just worked.
Modular and Extensible
We are a young startup and are iterating quickly on different ML approaches. We have different steps in our ML pipeline and wanted a pipeline that allows us to swap out and replace new components easily.
Compatible with our current setup
We currently use containers on Google Cloud Run for deploying all our services and use MongoDB and Google Cloud Storage for storage.
ML pipeline requirements
The first thing we did was to define the ideal setup and our requirements. We wanted modular components for data preparation, training, evaluation, and serving.
Results and Findings
Google AI Platform
Google AI Platform was compatible with our current cloud setup which was also on GCP. It is a managed service so it was easy to manage. However, we ran into a blocker when experimenting with it.
Let me shed some light on it. We had to decide between using a standard container or a custom container and unfortunately, neither worked for us.
We could not use GCP’s standard out-of-the-box container as we were using ML frameworks other than TensorFlow, scikit-learn, or XGBoost. As a customer-support AI company, we have several NLP models that often don’t use the standard frameworks. We also needed to experiment and deploy models quickly without getting blocked by framework limitations.
Standard frameworks run smoothly on the AI platform. However, a non-standard framework required us to configure a custom prediction routine impacting our velocity. The custom prediction routine also had a big limitation that we could only use a legacy (MLS1) machine type with available RAM of just 2GB! We very quickly ran into an out-of-memory issue.
ISSUE: Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and redeploy
Thus, standard containers became a no-go.
Next, we tried using a custom container but it didn’t meet the easy-to-manage requirement we had. It also required a different deployment strategy.
Google Cloud Run
We decided to stay with Cloud Run for our ML requirements. We set up a microservices oriented architecture and used Cloud Scheduler to schedule different ML tasks on a periodic basis.
The biggest advantage of Cloud Run is it handles autoscaling and container crashing gracefully with no operational overhead on us. It is also much cheaper with a generous free tier. The biggest limitation of Cloud Run is max RAM of 8 GB and max CPU count of 4 which is likely to be hit in the future as we use larger ML models. At that time, we will likely migrate to AI Platform or Google Kubernetes Engine.
Interested in joining us and working on exciting and challenging problems in AI and machine learning? Send us a quick note with your LinkedIn profile link.