We hear about “the cloud” all the time. “Music in the cloud” with Amazon, Apple’s iCloud for sync’ing your devices, drop box – the cloud file sharing system. Really — every major service we see seems to exist in this mystical cloud.
As tech lead and operations lead here at nonprofitCMS, I’m in charge of making sure our product line stays up and running, even when faced with incredible spikes in traffic.
To see what I mean take a look at the chart showing our last 2 months of traffic. Mid April and End of May we got bombarded with traffic. Normally we do ~7,000-10,000 transactions per day.
On these spike days we had to process 70,000-110,000 transactions. I’ll spare you the drama — on April 1st we switched to a suite of cloud technologies offered by Microsoft and Amazon — and that saved us from what could have been massive outages.
So that leaves the question, what is the cloud?
Types of Clouds
SaaS – Software as a Service cloud environments. For our end customers, our awards software, conference management system, membership manager, and job board system are all SaaS clouds. Customers get a service, whether they have 1,000 conference registrations or 100,000 they trust that our system will handle the data and scale towards their needs. Other examples of SaaS clouds are online email like Google’s Gmail, Drop Box, iCloud, etc. These are typically specific to a function. Behind the scenes though these SaaS solutions need to have a robust cloud system underneath them. There are two major types: IaaS and PaaS.
IaaS – Infrastructure as a Service cloud systems provide compute power and storage, and the application developer (SaaS provider) can consume resources as needed. For example, I can purchase the equivalent of 5 desktop computers and then choose to do whatever I want with them. I can host a website and as my demands grow, I can order a 6th computer to help meet that demand. IaaS saves the application developer the need from having to setup his or her own computer systems, with the click of a button a new machine can be spun up or spun down.
Because IaaS is so generic, much of the leg work to actually scale these machines relies on the developer. In the case of the hosted website, the IaaS developer will have to setup something called a load balancer and then configure each node of the website to work with it. If changes are made to the web site code, the developer will need to copy it over 6 times. There are lots of areas where trouble can arise — but this solution tends to be popular since it has little learning curve to get started — just a big learning curve to get the most use out of it.
PaaS – Platform as a Service cloud systems provide a specific technical function. For example we use Amazon S3 or Simple Storage Service. As long as we follow Amazon’s directions on how to use the storage we can store 1 KB worth of files or 1,000s of GB. Just as you may not worry about getting thousands of emails, we don’t worry about what happens if we get overloaded with files.
Another PaaS solution we use is Amazon Elastic Transcoder. This lets us take in videos in any format and then convert them to a web friendly format. You might imagine videos take up a lot of space — converting them from one format to another also takes a ton of compute power. We can consume this PaaS simply by asking Amazon to convert a video when we have one for them to use.
You can imagine one major benefit of PaaS — in this case we can use Amazon’s computers to convert videos when we have videos that need converting. Only a fraction of our customer even uses streaming videos — and that too they only receive videos in a small window of their program’s cycles. Without this PaaS technology we would have to pay for transcoding computers that would spend most of their time sitting on the side line.
How PaaS Let’s Us Scale ‘Infinitely’
Now that you have a taste of cloud services, let’s talk about this notion of infinite scaling. To create a service that scales infinitely what SaaS developers do is create the illusion of infinite. We estimate how many transactions per second we can accommodate, and then ensure our setup can handle this. But the real magic comes from the everyday concept of queuing.
Let’s say we decide a transaction should not take more than 20 seconds — and we know that a given machine can handle 1,000 simultaneous transactions. The first thing we should do is determine a list of transactions that take more than 20 seconds:
– Converting a video
– Running a report
– Sending an email blast
– Saving a large document
From there, we need to decide, is it OK for this action to take more than 20 seconds. A user of our system might expect a delay when converting a video or running a report, but when saving a document they need that to be instant. So our next step is to figure out how to get important and urgent actions below this 20 second mark (in this case we break up a large save into multiple parallel smaller saves).
For anything else that can wait more than 20 seconds we push it off onto a “worker thread” — in other words into some sort of background processing system. Here we just process 10-15 tasks at a time per worker — if tasks cannot be processed right away, they wait their turn in line. This means that we can handle an infinite amount of work — just it may take an infinite amount of time to get it done!
To solve this problem, we use a serious of cloud monitors that check how long a task has been waiting in order to get started. We like to start tasks within 1 minute of being requested — if the 90 second mark is reached we request a new “worker” from the cloud — we keep doing this every 90 seconds until our 1 minute or less wait time quota is reached. As more workers are spun up, we can take more items off of the queue faster. When the queue sizes drops down, we can start returning workers to the thread.
Finally, on the front end we have this 20 second limit on purpose. If a transaction takes more than 20 seconds on the front end, we kick it out and send an alert to our developers to “do better.” We do this so that no one transaction can accidentally consume more resources than it needs — transactions should be quick, and if they are not they should be optimized and/or put into a background worker.
Is it Magic?
I wish it was, my job would be a lot easier. Really though, cloud technologies provide a means to get work done in a very parallel way. As long as the software is written to work in small chunks at a time, the cloud can really expand forever to accommodate growth.
Making simple changes like “generate report” to “request report and send an email when it is ready” can make the difference of knocking a system down and frustrating 100s of users VS asking 1 user to be patient for a few minutes while everything goes smoothly for everyone.
Embracing the cloud helped us survive those spikes when our system was needed upon most.