An excessively in depth analysis of absence data

If you’re not interested in statistics, probably best give this post a miss – this is one of my technical posts (next laser projector post coming up in a few days). If, on the other hand, you enjoy compulsivley analysing everything – this post is for you!

A few days ago I was asked by a friend to take a look at some absence data for an organisation they are involved with. My friend was convinced they have an absence problem with employees but couldn’t really prove it. I don’t really like the idea of convincing a bunch of managers that their workers are consistently bunking off work. Nevertheless I couldn’t help but have a play around with the data. Turns out it’s pretty interesting.

Obviously, I will not be naming the organisation here, and I have changed the actual data so as not to make it public. The results are still identical.

The data

I was given several spreadsheets with absence dates for all employees who took sick leave during the previous year. The employees normal working hours were listed, along with whether they work weekends or not.

Since Excel is nearly useless for any meaningful statistical analysis, I loaded the data into R and started looking for anything of interest.

A quick overview

First I made a heatmap of the absences. Each row represents a calendar month. Annoyingly, the rows are in reverse order to a normal calendar (ie, the bottom row is January and the top is December). I may go back and sort this out but I don’t want to get carried away. Each day is represented by an element, with the shading indicating the number of employees off during that day.

In an ideal world, we’d expect to see a relatively even shading. The following would also be probable:

  • Two in every 5 days would see a reduced absence as these days are weekends and only a minority of the company staff work on weekends.
  • We’d possibly expect higher absences during the winter months.

Here’s what I got:


Number of absences by day of the year (read like a reverse calendar – January is the bottom row)

Well that is interesting. Let’s have a look at what we’ve got:

  • Generally speaking, it looks like the number of sickness absences decreases as the year progresses. I have no idea why this is.
  • There are tons of absences in January. A lot of these are from a Monday to Friday – more on that in a bit.
  • There were bank holidays on the 3rd and 6th of April (Monday and Friday), look how many people were ‘sick’ for the rest of that week…
  • Same for the last week of August.
  • The above said, there were bank holidays on the 5th and 25th of May. The absence looks higher than normal here but not strikingly so.
  • Everyone seems to have gone crazy in September. It really stands out but I’d need more info to determine the cause (if any).

This is all well and good, but we have no idea whether the above patterns occurred due to chance. To determine that we’ll need to get a bit more technical.


The probability density function

We’d expect the distribution number of days of each absence spell to resemble something like a Poisson Distribution.

A Poisson distribution is defined by a parameter Lambda which is equal to both the mean and the variance. So, are the mean and variance similar for our dataset?

> M
[1] 4.027108
> V
[1] 27.33714

No. No they are not.

Well, let’s plot the kernal density estimate for our data anyway. We’ll try and fit a Poisson distribution with Lambda =4.03:


Kernal density estimate for the dataset (black) with Poisson distribution for mean=Lambda(red)

Ok, the data isn’t anything like a Poisson distribution so let’s forget about that. Force-fitting an incorrect distribution to a bunch of data is a classic rookie error.

But look at the density function for our data. There is the main peak which kind of follows the expected curve – with a long tail. But there are a couple of other smaller peaks. This is a classic indicator of data being influenced my more than one means.

In this case the minor peaks are around 10 and 20 days. So, it looks like people tend to take sick leave for a whole number of weeks. This looks suspicious but does not prove anything unusual – this could just be due to doctor’s notes recommending taking ‘2 weeks off’.

Something which is a bit more likely to be folk bunking off is the number of 5 day absences taken from a Monday to a Friday. Using the given dates, I was able to infer which day each absence day was taken. There were 10 instances of a worker taking 5 days off. Four of these cases ended on a Friday – implying the individual took the whole week off. This seems a little high but not overwhelmingly so.


Confidence intervals on certain days

If we want to find anything particularly telling, I decided it was best to keep looking at certain days of the week and compare the expected number of absences with the actual total for that day throughout the year.

But, rather than lookning at how many people were absent on those days, I decided to look at the number of sick leave instances that started or ended on those days

Monday is a good start. I bet loads of people wake up on Monday morning and decide to skip the day. This is why I decided to look at when instances of absence started or ended – this would also catch people who decided to take the week off, or maybe just a couple of days.

Since a proportion of the workforce can work any of the 7 days of the week, the expected proportion of absence intervals beginning on a given day of the week is not 1/5 (or 1/7 for that matter). The expected number of absences beginning (or ending) on a given weekday is actually 18.7% – or 30 days for this dataset (full derivation for this figure in the R script). Likewise, we expect 8.6 instances of absence to start on weekend days.

In other words, we expect 30 absences to begin on Monday, 30 on Tuesday etc..and 9 on Saturday and 9 on Sunday.

So let’s look at absences starting on Monday. There were 23. That’s less that we’d expect if they were distributed randomly. Maybe my friend is being needlessly suspicious and there is no sick leave problem at all.

Let’s try another day

How about absences ending on a Friday? This would capture anyone trying to get a 3 day weekend (‘ending’ is inclusive of the day itself in this case), along with anyone deciding to take the whole week off.

There were 37 absences ending on Fridays. This is a bit more interesting. Since we know the probability of an absence period ending on a given day of the week (calculated above as 18.7%), we can determine whether this is statistically significant. We calculate ‘p’ for a binomial distribution with size=166 (the total number of absence instances),  probability = 0.187 and x=37.

And we get 6.2%. So, there is a 6.2% probability that the number of instances of absence ending on Friday observed occurred due to chance. In my opinion, this is not a high enough confidence interval to conclude fair play.

Let’s try one more

Absences ending on a Sunday

Since there are a reasonable number of employees at the company who work during weekends, I hypothesised that the most ‘vulnerable’ day was Sunday. For a random distribution of sick days, we would expect only a small number to be taken on Sunday since only a minority of workers normally work on Sundays. In fact, we’d only expect 8.6 instances of absence to end on a Sunday.

The observed figure was 22. Woah, that’s more like it. Let’s put that into the Binomial test:

So, size=166, prob=0.052, x=22.

We get a probability of 0.02% that this occurred by chance. That’s more like it!



Well, I’d say there is something up. After just three hypothesis tests (yes, I didn’t perform any more and cherry pick), we found an unusually high absence figure for Sundays which are a day that would seem particularly liable to workers bunking off.

On top of that, there are apparent patterns in the calendar, with workers likely to take days off immediately around bank holidays. This is only a qualititive observation and would require further hypothesis testing to confirm.

There are also an unusually high number of absences which are exactly one, two or four weeks in length.

In a perfect world, we’d expect absence data to vaguely fit a Poisson distribution. It does not in this case, but I don’t think we can really draw any conclusions from that – this isn’t really a quantitive observation.

I may perform further analysis on this data at a later date but I think that’s enough for now.

Moral of the story. If you want to skip work, do it on a Monday, and don’t tell any statisticians about it.

R code available here:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s