This blog entry was originally written by Patrick Florer, I am just migrating the post to the new SIRA site.
(this is the third post of three)
Experiment #1 – how well does BetaPERT predict the actual data?
Using the overall minimum and maximum for the 94 observations, I ran another Monte Carlo simulation using these parameters:
Min = 0
Most Likely = 2,000,000 (derived as described below)
Max = 7,500,000
gamma/lambda = 4
Monte Carlo iterations = 100,000
Excel could not calculate a mode because all 94 values in the Ponemon study were unique. Using a value for the bins = 76 (0 through 7,500,000 binned by 100,000), I obtained a value of 2,000,000 from the histogram, which I used as the mode/most likely estimate.
Here is the histogram that was created by ModelRISK, using the VoseModPERT function:
As you can see, the shape of the histogram created by the BetaPERT function is similar to the histogram for the actual data. But is it similar enough to be believable? A comparison of values at various percentiles tells a better story:
With the exception of the minimum estimate, where the variance is due to using 0 as a minimum estimate instead of 378,000, and the estimates at the 5th and 10th percentiles, the remaining variances are within +/-20%. In fact, all of the variances between the 1st and 99th percentiles are positive, which means that, up to the 99th percentile, the BetaPERT function has over-estimated the values.
Is this close enough, especially the 8% underestimate at the 99th percentile? For me, probably so, because we already know that we are going to have trouble in the tails with any kind of estimate. But for you, maybe not - you be the judge.
Experiment #2 – is there another distribution that predicts the actual data more closely than BetaPERT?
The ModelRISK software has a “distribution fit” function that allows you to select a data set for input and then fit various distributions to the data. Using the 94 compliance cost values from the Ponemon study as input, I let ModelRISK attempt to fit a variety of distributions to the data.
The best overall fit was a Gamma distribution, using alpha = 3.668591 and beta = 588093.8. The software calculated these values – I didn’t do it and would not have known where to start.
Here is the histogram:
And here is a comparison based upon percentiles:
This gamma distribution is even a better fit than the BetaPERT distribution. Except at the extreme tails, the variances fall within +/- 11% - closer than the +/- 20% of the BetaPERT.
From a theoretical point of view, this is interesting. But from a practical point of view, it is problematic. It’s one thing to fit a distribution to a data set and derive the parameters for a probability distribution. But it’s quite another matter to know, in advance, which distribution might best predict a data set and what the parameters should be for that distribution. In addition, the Gamma distribution is typically used for creating distributions that describe the random occurrence of events during a time-frame. I am not sure how appropriate its use might be to describe a distribution of loss magnitudes or costs – I plan to find out!
Concluding remarks:
While tests on a single, small dataset do not provide conclusive proof for the ability of the PERT and other distributions to match up to actual data, they do provide encouragement and motivation for further testing.
It would be useful to perform tests like these on larger datasets. Perhaps one of you has access to such data? If so, how about doing some tests and writing a blog post?
It might also be possible to use the “Total Affected”/Records exposed data from datalossdb to test the ability of BetaPERT to model reality. I would invite anyone interested to give it a try.
As we build our experience fitting parametric distributions to different data sets, our knowledge of which distributions to try in which circumstances will surely grow, and lead, hopefully, to more useful and believable risk analyses.