Event Study: Cybersecurity Breaches Effects on Stock Price

by Zach Zhao and Gabriel Naval

Introduction

Some Background

Most literature regarding financial forecasts of a company's future stock price does not include recent cybersecurity breaches as a key factor to take into consideration. Many studies looking at the economic effects of cybersecurity breaches on companies support this reasoning by showing that cybersecurity breaches have relatively small economic consequences. With the little effect cybersecurity breaches have on future performances, stock prices are able to rebound to their levels before the breach, with supposedly no long term changes. However, there is research suggesting that there are other long-term consequences like decline in firm productivity, research and development spending, patents, and investment efficiency.

What are we doing and why?

It seems that there is differing information on the long-term effects of cybersecurity breaches on companies. Through this project, we would like to see what the long-term effects of cybersecurity breaches on company stock truly are. Hopefully, we can show that cybersecurity breaches do have lasting consequences to show that it is important to consider breach disclosure in predictive financial models. If we were to prove this, we could bolster better cybersecurity efforts and funding.

How will we do this?

We will confirm, through an event-study, that cybersecurity breach disclosure has a short-term impact on company stock price. We then shift our focus to the long-term by analyzing the differences between financial analysts’ predictions and actual stock prices a year after breach disclosure. In this long-term analysis, we hope to find a difference between financial analysts predictions and actual stock prices as to prove that predictive financial models are lacking, which hints that they should be considering cybersecurity breaches as important in long-term predictions for stock price.

Set Up

This will all be done using Python, leveraging Jupyter Notebooks to visualize various plots, graphs and tables. Here some useful information to install these tools: 1, 2, 3, 4.

Below are the packages we will use to collect and visualize our data.

Data Collection

We will be using the WRDS (Wharton Research Data Services) database to find financial analyst predictions on stock price using their IBES dataset. The WRDS database is provided to all UMD staff and students for free, you can sign up for an account here. When running this following section of code, you must supply your account's credentials. We recommend setting up a pgpass to help automate the process.

We will be using the Audit Analytics February 2021 issue of cybersecurity databreaches. To get this dataset, we contacted the University of Maryland's Smith Business School, which provided this Excel file to us. Audit Analytics is an organization that tracks relevant business and financial data such as the information of a company cybersecurity breach.

We will now load in the data by accessing the correct Excel sheet.

Each row in this dataset represents a company's cybersecurity breach. The dataset contains useful information such as the company breached, date of public disclosure of the breach, and other varying information about the company and the type of cybersecurity breach performed.

To clean the data, we need to drop all the rows that don't contain any company tickers. Tickers, consisting of numbers and letters, are short abbreviations uniquely identifying a publicly traded company (Ex: Amazon is uniquely identified by their ticker, AMZN). If a row doesn't have a ticker symbol, then there may have been data corruption, human errors logging the data, or the company may not be publicly traded. Either way, we need to remove the rows. We also extract the relevant columns for our analysis (as shown in the table_columns array). Once this is all done, it leaves us with a dataset of 737 cybersecurity breaches on publicly traded companies.

Now, let's try to find the monthly stock price of each of these firms following the disclosure of the breach.

Before we do that, we define a short helper function that will help us find the closest date in a set that corresponds to X months after the disclosure of a breach. We will make use of this utility in our main function for finding monthly stock prices.

With that out of the way, let's construct a function to obtain the monthly stock prices after the disclosure of the data breach. Let's break it down!

Our function has two parameters: a row from our original breach dataframe and the number of months to get stock prices from. First, our function determines the range of dates to obtain monthly stock prices. Please note that our starting date is a day before the disclosure breach as to control for any fluctuations in stock price due to that disclosure. Following this, we leverage pandas_datareader, an API wrapper for various API's. Specifically, we will be using its YahooFinance API functionality, which will provide us with a dataframe of stock prices (df) beginning at our start date and ending at our end date. We then traverse through this dataframe, using our nearest helper function, to obtain the monthly stock prices and return them as an array.

If the YahooFinance API cannot find a company's stock price for whatever reason, it returns an array of np.nan's. Likewise, if no stock price is available for a month because it's missing or because that month's date lies in the future and hasn't occurred yet, then the array will be populated by np.nan's for those months.

Note: We record the closing stock prices, meaning its the stock price at the end of a day.
Note: The YahooFinance API has a limit of 2,000 requests per hour. As we only have 737 breaches, we won't be hitting that limit, but keep that in mind when using API's.

Let's run our function on each row in our dataset. We'll be finding the monthly stock prices spanning a year after the disclosure of the breach.

Note: This section of code takes a while to run (20-30 minutes) because we will be making API requests, loading in data from a server, and performing operations on said data.

Following this, we can concatenate said data to our original dataframe.

We now have the actual stock prices. Let's move on to finding analyst predictions for these companies.

We define the function below to find the analyst stock price predictions. It makes use of the IBES database in WRDS. The function takes all the financial analyst predictions within a month of the disclosure of the breach that forecast the company's stock price a year into the future. Since multiple financial analysts may make predictions, this function returns the median and mean of these predictions. If no predictions are found, the function returns np.nan's.

Note: This function makes use of SQL, a programming language used to communicate with databases. Here are some helpful resources to get started learning about SQL: CodeAcademy, KhanAcademy

We can now run the function on each company to get the financial analyst forecasts.

Nice! We have now collected all the data to compare actual stock prices with financial analyst predicted stock prices. But before we do some Exploratory Data Analysis (EDA), we need to do...

Data Transformation and Management

As it turns out, we might want to transform some of our data relating to stock prices because of innate variation between companies.

To better understand this problem, consider this hypothetical: Suppose company A's and company B's stock price both double after a year. However, company A's initial stock was much smaller, say it started at \$10 and became \\$20 per share, while company B's stock went from \$100 to \\$200 per share. Their numerical growth are rather different, but their percent growth are the same. When comparing growth between companies, it makes more sense to compare percent growth since it will better control for the already established stock of that company (whether that be high or low). In a way, finding the percent growth is like standardizing each company's current stock price by their initial stock price.

For that reason, we need to be looking at percent stock price change for these companies, where the initial stock price corresponds to the stock price the day of the disclosure for the breach. The code below transforms the data to percent stock price change for the actual and predicted stock prices.

Exploratory Data Analysis

To begin, let's make some boxplots and violin plots to get a better understanding of how actual stock prices change over time. We will be making use of the seaborn Python library. We also make use of melting (more info here).

Well...these plots don't really help but why? It seems that there are some major outliers that are making it hard to see how the percent change of actual stock price shifts over time. We have two options here:

  1. Remove the outliers and re-plot the data.
  2. Find a better metric to represent these distributions over time.

Let's opt to do the second option. There are other metrics to represent these distributions, namely seeing how the "middle" of these distributions change over time. We can define the "middle" of each of these distributions to be the mean or median stock price percent change over time.

Let's take the naive approach of plotting the mean over time.

It seems like the mean trends upwards over time. This is to say that over time, after public disclosure of a breach, the stock price of companies tend to still trend upwards. It seems that public disclosure of a breach might not have long-term consequences to a company's stock price.

But wait! Let's not forget:

"There are three kinds of lies: lies, damned lies, and statistics." - Mark Twain

Means are only good representations of the "middle" of a distribution given that there are no influential outliers and no skewness. A better way of representing the "middle" would be to use the median, which is less affected by outliers and skewness. Let's plot the medians.

It seems like when we plot the median over time, the same trend occurs, but it's important to note that the percent stock price percent change values are not as large as the plot of means. Even more important is that the trend no longer seems exactly linear. It seems like the stock price grows slowly at the beginning and then gradually increases later on. This could be representative of a company recovering from the public disclosure of the breach at the start (which causes less stock gains) but after a while the company's growth goes back to normal. We'll look more into the details of this when we perform the event study.

Another way of representing the "middle" is to take the trimmed mean to get rid of outliers. Below, we took the 5% trimmed means and plotted them. It looks to have the similar trends and observations as the plot of medians.

Let's now compare the actual versus financial analyst predictions of the stock price a year after the public disclosure of a cybersecurity breach. We will do this through a violin plot.

Once again, there appears to be some outliers for these distributions, specifically for financial analyst predictions. As mentioned before, there are two options for handling these outliers. We don't really want to use a new metric like the "middle" because we want to compare the actual distributions, so we will instead opt for option one and remove the outliers.

We will be removing the outliers for the financial analyst median and mean predictions. To remove the outliers, we need some rule to label something as an outlier. There are different methods for classifying outliers, but we opt to abide by the three-sigma rule, which states that nearly all values are taken to lie within three standard deviations of the mean. If a point were to be beyond three standard deviations from the mean, then we will classify that point as an outlier and remove it from the distribution. The following creates distributions without these outliers.

Let's now create a new violin plot without the outliers.

This violin plot is way more legible than the previous. It seems from this violin plot that the financial analyst predictions tend to vary more than the actual stock prices. It also hints that financial analysts tend to overestimate the actual stock price of these companies. A better method of visualizing these differences is to look at the residuals of these stock prices, where the residual is the actual minus the predicted stock price percent change.

The following code will compute the residuals for each prediction type and plot them. Note that for this part, the outlier financial analyst predictions have been kept.

It seems the distribution of these different residuals appear very similar. It also appears that these distributions center around 0 but are skewed towards the negative end. For this to happen, it means that financial analyst predictions are greater than the actual stock prices.

For a better look, here's a display the summary statistics for each residual distribution. I would like to note that our initial sample of cybersecurity breaches was 737, but it has now shrunk down to a set of 474 breaches. This could either be due to the Yahoo Finance API not containing stock prices for certain companies, IBES not having predictions for smaller companies, or a year hasn't elapsed since the public disclosure of the cybersecurity breach.

It does seem that financial analysts overestimate the actual stock prices for these firms that had recently issued public disclosures of cybersecurity breaches. We need a to perform a more scientific/mathematical study to conclude this. This takes us to our next section...

Hypothesis Testing

We want to perform a statistical test to confirm our findings that the financial analysts overestimate the actual stock price. Specifically, we want a test that proves that these residuals we've found skew towards negative. In other words, we want to prove that the mean ($\mu$) of the distribution of residuals is negative and that these results are statistically significant, meaning there is little to no doubt that the mean is 0.

The best test for this situation would be a 1-sample t-test, specifically a paired sample t-test. To get an understanding of how this test works, we need to understand what a null and alternative hypothesis are. The null hypothesis states that a population parameter is equal to a hypothesized value. The alternative hypothesis states that a population parameter is different from the hypothesized value posited in the null hypothesis. These two hypotheses are mutually exclusive (if one is true, the other is false). In this case, our hypotheses are:

The null hypothesis roughly translates to the mean of the distribution of residuals is equal to 0, meaning that we expect no difference between the actual and predicted stock prices. The alternative hypothesis roughly translates to the mean of the distribution of residuals is less than 0, meaning that the predicted stock prices tend to be greater than the actual stock prices.

In this statistical test, we begin with assuming the null hypothesis is true. We then study our sample dataset (through statistical methods) to see if given the null hypothesis is true, the sample dataset could reasonably occur. If our data could not reasonably happen under the premise of the null hypothesis, then we reject the null hypothesis and assume it's false. Since the null hypothesis is false in these circumstances, we would have to accept the alternative hypothesis as true. In the case that the data is not significantly opposed to the premise of the null hypothesis, we fail to reject the null hypothesis - meaning we have no evidence to believe its contrary. This is the essence of most statistical hypothesis testing. For a more through explanation, click here.

Before we can perform this test, we need to meet certain prerequisites. We need to meet the three assumptions: independence of observations, approximately normal distributions, and no major outliers. We can reasonably assume independence of observations since one company’s residual does not affect another’s. From the violin plot, it seems that the distribution is approximately normal and to meet the assumption of no major outliers, we will remove outliers from the dataset when performing the test. Depicted below is a violin plot of the residuals with major outliers removed.

We also have to establish a significance level before running the test. We choose a significance level of 5%, meaning if a sample has a less than 5% chance of occurring given the null hypothesis, then we will reject the null hypothesis and accept the alternative.

The following code will output the p-values for the median and mean residual hypothesis tests respectively. Outliers have been removed for this test.

Note: We are performing a one-tailed test, so we will halve the p-values.

These p-values are extremely small, smaller than our significance level of 5%. Since these p-values are extremely small, the chance that these samples could have occurred given the null hypothesis are nearly impossible, so we reject the null hypothesis and accept that alternative hypothesis that $\mu$ < 0. We can now conclude that financial analysts tend to overestimate the stock price of a company after the public disclosure of a cybersecurity breach.

Aside: What if we hadn't removed the outliers for the residuals?

Suppose we hadn't removed the residuals and still ran the t-tests as is. We would have gotten these p-values:

We would have still rejected the null hypothesis and accepted the alternative, but we would have to say that a major caveat was that we don't meet all the assumptions to run this test.

Event-Study

Conclusion

TODO FIX THIS

The most widely accepted procedure of studying the impacts of events on firm value is an event study. In our research, we use an event study to examine the difference between a company’s actual stock price after an event as well as a predicted stock price, derived from stock price data before the event. This difference between the actual return and the expected return is called the abnormal return. We show that the deviation in abnormal return is attributable to the disclosure of the data breach. We also show that financial analyst predictions tend to be too optimistic after the disclosure of a cybersecurity breach.

Introduction

Our event study seeks to establish the effect of breaches on the stock price of affected companies. To measure this effect, we analyze the abnormal returns ($AR_{i,t}$), the actual returns ($r_{i,t}$) minus the normal returns ($NR_{i,t}$), in the aftermath of a breach announcement. Actual returns, $r_{i,t}$, are the real stock price changes, measured as $(p_{i,t} - p_{i,t-1}) / p_{i,t-1}$, where $p_{i,t}$ is the real adjusted stock price of firm $i$ on day $t$. The normal returns is what would have been the stock return of firm $i$ on day $t$, barring the occurrence of the breach event. The normal returns needs to be estimated using a model, because it is hypothetical.

There are a variety of different models for normal returns, and they can generally be classified into two types: statistical and economic models. We chose to use statistical models because they offer good performance for their simplicity.

Out of the statistical models, there are two major types: constant mean model and market model. A constant mean model takes the mean average of a firm's returns over the estimation period and uses this mean for all normal returns. A market model builds upon this concept, and creates a linear model that relates the market return (S\&P 500 returns) to the firm's return. The constant mean model can be viewed as market model with the market coefficient $\beta=0$.

Market models make the following assumptions:

Data Loading

We use the stock_indicators.csv file prepared earlier in the data preparation section. This csv file cleans up some of the stocks with invalid tickers and adds several columns of stock prices months after the breach.

Here, we define the estimation window and event windows for our analysis. The max_normal_range denotes the length of the maximum estimation window. We will retrieve max_normal_range days worth of stock prices before the event, but we won't neccessarily use. Similarly, the max_event_range denotes the days of stock prices we retrieve for the event window.

We use the S&P 500 index as our market basket. Individual stock price's performance would be compared against this market index

Market Model

Here, we implement the market model to predict the normal returns of the affect companies' stock price.

The normal returns are the returns of a stock that would have happened without the breach. This market model consists of a regression with the following form: $$r_{i,t} = \alpha_i + \beta_i * r_{m,t} + \epsilon_{i,t}$$

where $r_{i,t}$ is firm $i$'s normal return and $r_{m,t}$ is the actual market return. $\alpha_i$ and $\beta_i$ are the linear constant and coefficient, respectively, and $\epsilon_{i,t}$ is the regression error term. Our time increments are done by business days, so if the date of breach, $t=0$, is 4/23/2021 (Friday), then $t=1$ would be 4/26/21 (Monday). We then use this model to estimate $NR_{i,t}$, using $ \alpha_i + \beta_i * r_{m,t}$

Our market model estimates, which describes the daily percentage changes in stock price. To measure the the full impact of a breach, we need to collect these percentage changes over a period. We track the Cumulative Average Return (CAR) within +/- 7 business days of the breach. The $CAR_i(a,b)$ is defined as the following:

$$CAR_i(a,b) = \sum_{t=a}^{b}{AR_{i,t}}$$

For small intervals, the CAR serves as a good aggregate of abnormal returns. For longer intervals, the abnormal returns would compound, causing the real stock price to differ from the cumulative returns.

Stock price retrieval

In this function, we use the specified estimation and event windows to retrieve the stock prices of interest for every company. We use yfinance library to retrieve stock prices by ticker and date from Yahoo Finance, a stock price database. After retrieving the relevant stock prices for a company, we execute. market_model() to find parameters alpha and beta. We cache the stock prices, markel parameters, normal returns and abnormal returns into a dictionary for every breach event.

We execute the normal_return_model(), which downloads and runs the market model regression, over all companies in the breach database. For some stocks, our downloader might fail because the ticker is malformed or because our stock database (yahoo finance) might not have the stock.

After running this regression, we save the Cumulative Abnormal Returns (CAR) into a seperate csv file and pickle our auxillary data into raw_results.pkl. This serves as a good checkpoint, as downloading and running the regressions takes a substantial amount of time.

Reloading the data

We join the results of the run back into our original dataframe (with company ticker and other related info)

Individual Company Analysis

In this section, we analyze the financial impact of a particular breach, the SolarWinds breach, in detail.

On December 14, 2020, SolarWinds publically announced that it experienced a major breach in its "Orion" system. Solarwinds is an information technology company that supplies software to major firms and government organizations. Through this breach, the actors were able to gain access to many organization's IT systems, allowing them to install further malware. Over 18,000 companies were affected - including Fortune 500 companies like Microsoft and government organizations like the Pentagon.

We plot the market model and the Cumulative Abnormal Returns (CAR) for Solarwind during this breach.

General Company Data

Market Model Return History (Estimation)

Here, we compare the returns of SolarWinds alongside the S&P 500 index. The SolarWinds stock has higher variance than the S&P 500, as the S&P 500 is an aggregate over many different companies (and therefore diversified).

In this regression plot, we pair each day's S&P 500 return with SolarWind's return. The data satisfies most of linear regression's assumptions:

For the independence assumption, which suggests that data points must be independent, may be slightly violated as stock returns are correlated with one another temporally. However, over a large window, these violations do not affect the model by much, and research into market models suggest that linear regression remains an effective tool for estimating normal returns.

Event Analysis

In this plot, we notice a substantial deviation in stock price proceeding the breach event. The company's stock price stopped over 15% in two of the event window days. The S&P500 index remained stable across the week.

Here, we plot the S&P 500 and SolarWinds stock prices across the estimation and event window. Notice how the SolarWind's stock price deviated substantially from the S&P500 during this period. This serves as supporting evidence (but not sufficient) that it was the breach, and not market wide downturns, that caused this drop in stock price.

In this plot, we impose the event window returns on our market model. Once again, we observe substantial deviations of SolarWind's stock price relative to our market model, which is based on the S&P 500.

CAR Plots

Here, we analyze the Cumulative Abnormal Returns (CAR) over the entire database of breaches. We would like to detect whether there was a statistically signficant change in stock price after the breach.

The cumulative returns over different date windows follows a normal distribution. Based on these histograms, there is a slight shift of the Cumulative Average Returns towards the negative.

The t-tests of the CARs across the (0, 0) to (0, 4) windows suggests that the deviations are statistically significant. Four of the five windows show a p-value that is under an alpha of 5%. This supports our initial hypothesis that breaches would negatively affect a company's returns.