How Cloud Computing Helps Large-Scale Data Analysis

Organisations like the Australian Bureau of Statistics (ABS) collect and manage a large amount of data to estimate key population statistics. However, processing such vast datasets in complex computational models requires significant computing power. Cloud computing, an interconnected network of ‘virtual machines’, are an efficient alternative to investing in in-house IT infrastructure. Choosing cost-effective virtual machines to suit a particular model is difficult due to their variations in performance and pricing. Instead of testing all options, a related but cheaper-to-compute model can be run to estimate costs and execution times. My report explores the area of optimising cloud computing for statistical production, with a view to improving assessment of cost-effective cloud service options for data analysis in organisations like the ABS.

Organisations like the Australian Bureau of Statistics (ABS) collect and manage large amounts of data, which can be used to estimate key population statistics such as population size, employment rate, welfare payments, and more. To do this, they collect and compare survey data from across Australia as well as historical survey data, applying these large datasets to complex models that estimate the desired statistics. This can be used to help inform government policies and decision-making.

However, processing such large datasets in these complex models requires a huge amount of computing power. The larger the dataset, the more complex it becomes to compute the model, making it difficult for traditional computers to handle the workload efficiently.

Instead of regularly investing in their own IT infrastructure, which then still often requires updating, the ABS can use cloud computing services offered by companies such as Amazon, Google, and Microsoft. These services are delivered via ‘virtual machines’, online server-based versions of physical computers. Unlike regular computers, virtual machines can be easily resized and linked together to deliver the computing power that is required.

These services are offered through a pay-as-you-go pricing model, with no additional upfront costs. This means that clients only pay for the time that they spend running their computations on the virtual machines that they have recruited.

However, not all virtual machines are the same. The different specifications of each virtual machine mean that some are faster for a particular model and some are cheaper to use. The problem is that running the target model on different virtual machines can result in different costs and execution times. So how can organisations like the ABS choose the most cost-effective option?

Testing the target model on all available virtual machines would expend more time and money than what it would save. Instead, a smarter approach is to first run a smaller, related model that is cheaper to compute. The cost and runtime results from this smaller test can then be extrapolated to estimate how much the target model will cost and how long it will take to run.

Once these estimates have been determined, an optimisation algorithm can be used to decide which virtual machines to recruit. The goal is to minimise costs while ensuring that the model is completed within a specified time deadline.

Little research has been done on optimising cloud computing in the field of statistical production. My report takes the first step in exploring how these methods can be applied to improve approaches to cloud computing in organisations like the ABS by evaluating the performance of a computational program on different virtual machines.

I hope you found this explanation helpful and interesting!

Toby Mew
Monash University