Creating an algorithm to analyse big datasets

Abstract:

Over the last semester, I took on a subject called Time-Series Econometrics. This was probably the hardest thing I’ve ever done, many late nights were spent coding and trying to wrap my head around models that refused to behave. By the end of the semester, I vowed to never code or even think about time series analysis again. So naturally, I undertook a research project over the summer purely in time-series analysis. The project aimed to create an algorithm that could analyse big datasets quicker, as generally, the methods used take a very long time. A particularly relevant project for me, given that my laptop is incredibly slow.

As an economics student, we are always trying to predict things through a model, often a model which considers time, ordered in, say, days, or years, this is what we call a time-series dataset. Over the few years of learning about different models, I’ve learnt that their applications are very broad, almost every field relies on some form of time-series modelling, whether its biology, economics, physics, chemistry, or even engineering. The issue is that modelling large time-series datasets is computationally expensive – or in non-crazy words, takes a really long time, thus, I set upon my journey to construct an algorithm which would make the process faster.

I started by constructing the Metropolis—Hastings algorithm in the regular time domain. This is a Bayesian variable selection algorithm which jointly decides the model’s parameters, and those parameters’ actual effects on the dependent variable. This first development was surprisingly short, and I began to move into the frequency domain.

We used the Fourier-transform on the dataset, moving it from the time domain into the frequency domain. Essentially, this made us not look at individual days or years but instead look at the frequencies or cycles. Say, if we had a years’ worth of data that was sequenced by days. A frequency index of 52 would mean there is a weekly pattern, and a frequency index of 12 would be a monthly pattern. Therefore, due to some pretty neat mathematical properties of the Fourier transform, we could halve the dataset without any information loss, and because of the literature contributions over the years, evaluation methods in the frequency domain are much more efficient.

After trying to perfectly adapt the algorithm and it failing, I consulted with one of my supervisors and he had let me in on the secret that the proposal for my algorithm was wrong, meaning that I had to completely change it. Therefore, after a long process I had finally adapted the algorithm to the frequency domain.

After this, I was finally done, and my original frustration for time-series econometrics had become a passion!

Jamie Hill
University of Technology Sydney