At the core of portfolio construction is diversity, and at the core of diversity is correlation. One simple mantra has ruled finance for years: invest in a bunch of uncorrelated assets, and your portfolio will be less volatile. Consider an equities portfolio: how easy is it to spot direct or indirect relationships between companies and therefore between stock prices? If the prices of two equities are correlated in the past, will that correlation continue into the future? I want to explore this question, but to do that that I’m going to need a metric ton of data and a few tools to help me sift through garbage and find the gems.

So, in the process, I’ll explore a couple of different questions, like–why is it more painful to run things in the cloud than to get kicked in the kneecaps? How can I simplify my complicated data ingest process so I different people can easily pick up and run with the large dataset I created? Just bear with me for a second, swallow a few of these vegetables I’m about to feed you, and I promise they’ll make you strong enough to tackle the deep interesting questions of the universe. Here’s where I plan to take you:

  • Show you a hassle-free way to ingest a big, messy dataset to answer the question at hand
  • Show you how to secure enough computing power to optimally handle such a dataset
  • Explain stock prices correlation and examine how stable correlations are over time
  • Analyze the market overall from a correlation perspective
  • Show you a sweet movie illustrating the evolution of market relationships over the last 30 years

 

Getting our hands on that sweet, sweet data

One thing that kicked me off on this project is that I happened to have a dataset lying around which I scraped for another purpose. It’s only a few GB, and its living on the cloud in Amazon s3, but the format is a little messy. It’s split up across about 6,000 files- one for each company. Also, I don’t remember the exact folder structure and I really don’t feel like poking around in my bucket before figuring out how I want to slurp it up. I want it to be able to download it all from s3 and bind it up into one nice DataFrame. Luckily, B23 uses a dataset import framework called ‘Lastmile’, and with one magic line command in python the whole dataset loads for me.

 

All I had to do was specify the config file below, and Lastmile will automagically loop through your s3 bucket, download to temp files locally, load each into memory, then concatenate everything into one big pandas DataFrame- whoa that was a breeze!

 

 

Securing the necessary cloud resources

Now I’ve got a dataset which is a couple GB running in memory. I could probably fit this on my local machine, but it would make my life difficult, especially while I’m using some of my computer’s memory to watch the first round of March Madness! So instead I’m going to spin up an R stack on the B23 Data Platform or BDP. This gave me an m4.4XL with 16 cores and 64GB of memory, all for . Basically, a bigger, badder machine which has enough memory to store multiple copies of my data, and enough computing power to run large jobs in parallel, priced at $0.80/hour. Normally the process of setting this up is fairly annoying- I’ve followed several long tutorials such as this one, and sucked up a lot of time trying to figure out the details for additional steps like connecting to my s3 bucket. With BDP, I have about one page of instruction. Essentially, I just choose what stack to run, pick the size of the machine I want, and I’m good to go- ready to start crunching some data!

 

Back to the dataset

In addition to price data, I created a dataset with company summary information for every company currently listed on AMEX, NYSE, or NASDAQ- around 6,000 companies in total, below is a screen shot of the dynamic dataset.

Then I went to work on data cleaning:

  • Eliminated duplicates and stocks which are in the dataset for less than a year
  • Limited daily prices to adjusted close because this accounts for dividends
  • Retained only price data for the companies in my listed which wasn’t in the finance Sector and didn’t have a blank Sector. This left just under 4,000 stocks overall:

Now we have a pretty good, clean dataset to work with. However, there are still a couple of issues to address.

1. We can’t compare prices directly since they are non-stationary. Running correlations between two non-stationary time series is a bad idea. If both series are simply moving in the same direction, their correlations will be wildly inflated. As in, Tom Brady would refuse to play with them inflated. The key issue is that the price of the single stock is dependent on its history, and this dependence must be removed before analyzing its relationship with another stock’s price. An easy way to do this is to take a first difference. In this case, stock price differences are heavily dependent on the price of the stock as well, so a better measure would be log differences. Let’s define the log daily return for day i: log(Pricei)−log(Pricei−1).

Now we have a stationary measure, but it is on a very short time scale. Generally, when building a portfolio, we won’t care about daily correlations as we look at our returns on a longer time horizon. So instead we will use log bi-weekly returns, defined at day i: log(Pricei)−log(Pricei−14). This is equivalent two a running two week sum of daily log returns. In order to retain stationarity, we also need to make sure to take non-overlapping samples, one every two weeks.

These log daily returns turn out to be much nicer to play with than raw stock prices, with far fewer outliers, and a relatively normal distribution:

2. All stocks tend to track with the market index overall. This will cause stocks to seem more correlated to each other than they really are. To see this clearly, check out a plot of a set of stocks over time.

Notice large market events are particularly prevalent such as the crash in 2008. In order to fix this issue, we will create weighted an index of the market which is just the combined price of all stocks weighted by their number of shares outstanding. Now we can simply subtract off the log market return period. Call this bi-weekly excess return- it is the return above what is seen in the market period.

 

Now let’s look at correlations

To measure correlation, we will use the Pearson correlation coefficient, or ρ, between the log bi-weekly excess returns for two stocks over a set time period. In particular, I’m interested in the consistency of correlation over time. So, let’s say I observe Coca-Cola (KO) between 1995 and 2005 and I want to know which stocks are correlated with it. There will be several companies which seem to be related; among them are Consolidated Edison (ED) and Newell Brands (NWL). ED is a power company operating primarily out of the NY area. NWL is a marketer of consumer goods- something we may expect to be more related to Coca-Cola. Both are correlated to Coca Cola at high level of significance (p<10-6) during this period. So let’s look at their overall excess returns through 2005 and beyond:

Interestingly, while the correlation for ED continues as strongly as before, NWL rapidly diverges. In fact, NWL had no correlation whatsoever to Coca Cola between 2005 and 2015. Look at the comparison of bi-weekly returns for every period between the two companies. Here, each point is a week and we can see the relative performance of the two companies against one another.

From 1995-2005, the periods where Coca Cola did well seem to line up with the periods where NWL did well and vice verse, hence the upward sloping line. In 2005-2015, the line is flat, so there isn’t any relationship. So, clearly we’ve got two regimes: one in which there is correlation and one in which there isn’t. How did we evolve between the two of them? Let’s look at the correlation coefficient (ρ) over a running two year window throughout the entire time period 1995-2015:

We can see that correlations can bounce around quite a bit just due to statistical fluctuations. But, there is a very precipitous change around 2007. One theory is that there are many factors pressuring the price of a stock both up and down, and one force overwhelmed NWL, pushing it down significantly and drowning out the relationship with Coca-Cola. Another is that the underlying relationship between the companies disappeared. Perhaps Newel brands used to market certain Coke products, but they stopped doing so in 2007.

Whatever the explanation, I’m interested in seeing how stable correlations are on a broader scale. How predictive are historical correlations in general. That is, if I observe a correlation between any two companies between 1995 and 2005. How likely is it that the companies will continue to be correlated again between 2005 and 2015? To do that, I’m going to need to run correlations for two time periods millions of pairs of companies. Thank science I’m running this in the cloud!

 

Correlations of Correlations

Ok now I’m going to ask you to put on your Leonardo Dicaprio hat and go one more layer deep into the rabbit hole with me. Becuase now we are going to look at correlations of correlations. In other words, if I take the correlation coefficients between two pairs of stocks from 1995-2005 (call this the training period), how much of the result can I expect to be repeated from 2005-2015 (call this the test period); how correlated are the correlation coefficients between the two periods. The answer is about 26%- not nothing, but not great either. Considering most portfolio optimization methods assume constant correlations in some way (most use conditional correlation, and many employee techniques like multi-index or shrinkage to improve estimates), we would expect a higher degree of consistency.

This measure may be clouded by heavy fluctuations around 0 for tons of company pairs which aren’t related at all, so let’s ask a slightly more practical question: if I identify a super high correlation i.e. I think there is a real relationship, how often is this repeated? The correlation of correlations within company pairs which have a ρ>.5 in the training set jumps to about 40%. If a company had ρ>.5 correlation in training, there was about a 50% chance they would have ρ>.5 again in test. For more detail, I split out the distributions of both training and test correlations into buckets based on the training correlation…

Within the bottom group, which we had identified to have a high correlation 1995-2005, there is a much higher rate of continued correlation than in the general population. Also, the ‘misses’ which have lower than .5 correlation, still tend to be higher than average. This is some sightly better news.

 

Seeing the Forest

So roughly half of correlations are stable from time period to time period. What does this mean for the market picture as a whole? This simple statistic can’t show us a view of how the whole market picture changes. For one thing, stocks are coming and going. For another, we can’t see which relationships continue to be stable for a long period of time vs the ones that are very fleeting. Is there any predictability to the nature of connectedness within the market? This is a question that is best explored visually, at least for a first pass. First lets examine what the entire network of correlations looks like in these two time periods. To do this I put together a network graph- which connects each pair of companies which are correlated by more than 50%. I colored the nodes by the company’s sector, and sized them according to market capitalization.

 

The Evolution of Everything

 

On the surface, these look like two very different regimes. However, realize that there are a lot of similarities: there is a dense core of related stocks in the public utilities and basic industries sectors which seems to persist. To examine how we shifted between these two regimes, I’ll try to paint a portrait of full evolution of the market across time. To do this I’ll use a similar strategy, but for edges I’ll use a two-year running correlation between pair’s of stocks, and I’ll do this for every week between 2002 and 2012 (FYI I plan to re-run this to better match the time period used above).

I can’t stress enough how critical it is that this piece of the analysis is running on the cloud. I’m now doing millions of comparisons each week for hundreds of weeks. The job would take hours if not days on my desktop, and it would be tough to parallize since my machine doesn’t have the memory to hold copies of large datasets on every core. But on the m4 that I spun up to do this, the whole job takes just minutes. As if the correlations aren’t enough computation, there’s also a lot of difficulty in creating the graph. In order to create a visually simple layout- these plots use an algorithm called Fruchterman Reingold. This process essentially runs physics simulations in which the nodes repel each other, but connected nodes are attracted according to the weight of their connection. I’m creating layouts on the fly in a way which allow the nodes to move around slowly from time period to time period, minimizing total movement, but also trying to maximize simplicity according to the Fruchterman Reingold algorithm. Those simulations take a TON of computing power.

 

Without further ado- enjoy the film:

Well it may not make it into Sundance, but there is a TON we can learn from this little movie. Its interesting to see how the highly bunched technology sector seems to have broken up around 2004. Another large cluster of consumer goods and services popped up around the same time. Aside from the ebb and flow of particular sector clusters, the relationships among the sectors fluctuates as well. In a brief period around 2011, our main cluster of public utilities gravitates towards consumer services. There are lots of hypothesis we can generate from this explorations. Perhaps company relationships which are part of a larger cluster are more consistent than relations between only two companies. Perhaps relationships within a sector are more consisten than those accross sector.