AFL AnalyticsΒΆ

Author: Johann de Boer

Australian Football League (AFL)ΒΆ

Year-on-year comparison of club success, membership size, popularity, game attendance and Australian population wealth.

Identified data sources:

  • Memberships by club and year: afl.com.au, news.com.au, aflonline.com.au, footyindustry.com and wikipedia.org
  • Crowd attendance at AFL games: afltables.com.au
  • Number of Twitter followers by club: Twitter API
  • Australian yearly Gross Domestic Income (GNI) per capita: The World Bank API
  • Historical AFL game results: Excel file downloaded from aussportsbetting.com

IntroductionΒΆ

The Australian Football League (AFL) is a professional competition in Australia in the sport of Australian Rules Football. Currently there are 18 teams spread throughout Australia. Each season consists of each team playing approximately 22 games. The top performing eight teams then play off in a finals system with the last two remaining teams playing off in a grand final to determine the winner for the year. Some key points relevant to this project are:

  • An admission fee is charged for people who wish to attend an AFL game.
  • Each football club sells memberships to its supporters every year which entitles members to free access to a number of games depending on their subscription type.
  • Each club has a presence on Twitter to connect with its members and broader fans.

Questions for data analysis:

  • Does on field success have an influence on club memberships?
  • How does club membership size affect crowd attendance?
  • Does attendance at games and size of club memberships increase with population wealth?
  • Is a club's membership size reflected by the number of followers they have on Twitter?

Install the necessary librariesΒΆ

In [1]:
!pip install tweepy
!pip install pandas_datareader
!pip install xlrd
!pip install tables
Requirement already satisfied: tweepy in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: six>=1.7.3 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: requests>=2.4.3 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: requests-oauthlib>=0.4.1 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: oauthlib>=0.6.2 in /usr/local/lib/python2.7/dist-packages (from requests-oauthlib>=0.4.1->tweepy)
Requirement already satisfied: pandas_datareader in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: pandas>=0.19.2 in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests>=2.3.0 in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: wrapt in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: lxml in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests-file in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests-ftp in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from requests-file->pandas_datareader)
Requirement already satisfied: xlrd in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: tables in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: numexpr>=2.5.2 in /usr/local/lib/python2.7/dist-packages (from tables)
Requirement already satisfied: numpy>=1.8.0 in /usr/local/lib/python2.7/dist-packages (from tables)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from tables)

Load the necessary libraries and import required functionsΒΆ

In [2]:
from IPython.display import display
from scipy import stats
from itertools import groupby
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pylab import rcParams # controls for plot rendering
import sqlite3
from datetime import datetime
import tweepy
import time
import os.path
import json
from pandas_datareader import data, wb
import requests
from bs4 import BeautifulSoup
import xlrd

Set report presentation style optionsΒΆ

In [3]:
mpl.style.use('ggplot')
pd.set_option('display.html.border', 1)
pd.set_option('max_columns', 12)
pd.set_option('max_rows', 100)
pd.set_option('display.width', 100)
pd.set_option('multi_sparse', False)
pd.set_option('expand_frame_repr', True)
pd.set_option('max_colwidth', 60)
pd.set_option('precision', 4)
%matplotlib inline

Source the data requiredΒΆ

The World Bank APIΒΆ

We obtain the Gross National Income (GNI) of Australia per capita (Atlas method in terms of current US$) using The World Bank data API. For details about this data, please refer to this link: http://data.worldbank.org/indicator/NY.GNP.PCAP.CD

Our hypothesis is that crowd attendance and club membership is correlated with the shared wealth of the country, where football supporters subscribe to clubs and purchase tickets when they have more disposal income. GNI per capita has been chosen as a indicator of the general wealth of the population. Later in this analysis we will compare this measurement against crowd attendance and club membership figures.

In [4]:
# Executes World Bank API query, removes the country information,
# renames the metric column and removes entries with no data,
# i.e. outside of the reportable time-frame.
dnp_pcap_au = wb.download(
    indicator = 'NY.GNP.PCAP.CD',
    country = ['AU'],
    start = 1960,
    end = 2018
).reset_index("country").drop("country", axis = 1).rename(
    columns = {"NY.GNP.PCAP.CD" : "GNI_per_capita"}
).dropna()
[2018-02-03 14:26:42,603] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): api.worldbank.org
In [5]:
dnp_pcap_au = dnp_pcap_au.append(pd.DataFrame(data = {'GNI_per_capita' : 46560}, index = ['2017']))
dnp_pcap_au = dnp_pcap_au.sort_index(ascending = False)
dnp_pcap_au
Out[5]:
GNI_per_capita
2017 46560.0
2016 54420.0
2015 60330.0
2014 64860.0
2013 65640.0
2012 59840.0
2011 50150.0
2010 46560.0
2009 44040.0
2008 42390.0
2007 37340.0
2006 34170.0
2005 30340.0
2004 25520.0
2003 21130.0
2002 19980.0
2001 20130.0
2000 21130.0
1999 21460.0
1998 21790.0
1997 21980.0
1996 20520.0
1995 19320.0
1994 18890.0
1993 18930.0
1992 18560.0
1991 18250.0
1990 17350.0
1989 15610.0
1988 14140.0
1987 12570.0
1986 12330.0
1985 12030.0
1984 12050.0
1983 11600.0
1982 12090.0
1981 11880.0
1980 10840.0
1979 9610.0
1978 8480.0
1977 8090.0
1976 7880.0
1975 7230.0
1974 5960.0
1973 4660.0
1972 3890.0
1971 3480.0
1970 3280.0
1969 2980.0
1968 2730.0
1967 2540.0
1966 2350.0
1965 2270.0
1964 2110.0
1963 1970.0
1962 1880.0

A full set of data was obtained from the World Bank API, which goes back as far as 1964 and up until 2014 (inclusive). The plot below shows the trend in GNI year-on-year.

In [7]:
# Plot GNI per capita in order of year.
dnp_pcap_au.sort_index().plot()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa46d3de390>

We can see from the above plot that the average wealth of Australian's in terms of current US dollars has generally increased on an on-going basis over the entire time-frame, with only a very few small downturns. The change in the level of wealth appears to have exponentially risen year-on-year.

Twitter APIΒΆ

Each of the AFL clubs have a public presence on Twitter. It is hypothesised that the number of followers each club has on Twitter will be an indicator of a club's current popularity. We will now obtain the list of AFL clubs and their followers on Twitter. Later in this analysis, the number of Twitter followers will be compared to total membership and crowd attendance for each club to determine if these parameters are correlated.

In [ ]:
# PLEASE DO NOT SHARE THESE CREDENTIALS AS THEY ARE PRIVATE AND CONFIDENTIAL
consumer_key="XXYY"
consumer_secret="XXYY"
access_token="XXYY"
access_token_secret="XXYY"
# ==========================================================================
In [8]:
# Twiter public API documentation: https://dev.twitter.com/rest/public
# Tweepy documentation: http://tweepy.readthedocs.io/en/v3.5.0/index.html

# Set the credentials for accessing the Twitter API.

# Perform the authentication handshake with the Twitter API using the above credentials.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)

# Create an object to use as a connection to the authenticated API
api = tweepy.API(auth)

# Rate limit handler for the Twitter API.
def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            print "Rate Limited. Waiting 15 minutes..."
            time.sleep(15 * 60)
            print "OK, wait is over."
            continue
        except tweepy.TweepError as e:
            print "Tweep Error: {0}".format(str(e))
            break

afl_clubs = api.list_members("AFLPlayers", "clubs")
[2018-02-03 14:26:45,193] {binder.py:107} INFO - PARAMS: {'slug': 'clubs', 'owner_screen_name': 'AFLPlayers'}
[2018-02-03 14:26:45,205] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): api.twitter.com

In the above code, a list of AFL Players and their Twitter handles is obtained from the 'clubs' list of the AFL Players' Association's Twitter account. The Twitter API is used to obtain the details of this list and then used to obtain the number of followers of each club in that list.

To ensure consistency across each of the data sources used in this analysis, club names will be adjusted to be consistent.

In [9]:
afl_clubs_df = pd.DataFrame()

# Define a function that will be used to add clubs to the above empty data frame of AFL club
# Twitter accounts along with a selection of associated details (including number of followers).
def add_club_to_df(df, club):
    df = df.append({
        "id" : club.id_str,
        "name" : club.name,
        "followers" : club.followers_count,
        "location" : club.location,
        "tz" : club.time_zone,
        "handle" : club.screen_name,
        "image" : club.profile_image_url
    }, ignore_index = True)
    return df

# The above function is then applied to each AFL account from the list returned from the
# Twitter API.
for afl_club in afl_clubs:
    afl_clubs_df = add_club_to_df(afl_clubs_df, afl_club)

# The resulting data frame is then tidied up
afl_clubs_df = afl_clubs_df.set_index("id")
afl_clubs_df.followers = afl_clubs_df.followers.astype("int")
afl_clubs_df = afl_clubs_df[["name", "location", "tz", "followers", "handle", "image"]]
afl_clubs_df = afl_clubs_df.rename(columns = {"name" : "club"})
afl_clubs_df.club = afl_clubs_df.club.str.replace("GWS GIANTS", "GWS")
afl_clubs_df.club = afl_clubs_df.club.str.replace("GC SUNS", "Gold Coast")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" FC", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Dockers", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Lions", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Cats", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Eagles", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Crows", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Swans", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(r'\W+', "")
afl_clubs_df = afl_clubs_df.set_index('club')

The following graph shows the number of Twitter followers for each club:

In [10]:
# Plot number of followers for each club in order of lowest to highest.
afl_clubs_df['followers'].sort_values().plot.barh()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa46d282750>

Web scrapingΒΆ

Number of members per club from 2006 to 2016ΒΆ

Club membership sizes for each club from 2006 to 2016 are obtained from various websites using web scraping.

Memberships for 2006 through to 2009 are obtained from footyindustry.com:

In [11]:
# Fetch the page with the club membership details on it
memberships_2006_2009_url = "http://www.footyindustry.com/wiki/index.php?title=Club_Memberships"
memberships_2006_2009_response = requests.get(memberships_2006_2009_url)
[2018-02-03 14:26:47,142] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.footyindustry.com
In [12]:
# Parse the page and extract the tables from it. Each table represents a single year.
memberships_2006_2009_response
memberships_2006_2009 = BeautifulSoup(memberships_2006_2009_response.content, "lxml")
memberships_2006_2009_tables = memberships_2006_2009.findAll("table")

# Convert the tables into data frames, selecting only the tables we require.
memberships_2006_2009_dfs = pd.read_html(
    str(memberships_2006_2009_tables[3:7]),
    header = 1,
    index_col = 0
)

# Remove rows with missing values and the final totals row from each table, keeping only the
# the number of members with 'game access' type memberships.
memberships_2006_2009_dfs = [df.dropna().iloc[0:-1, 3] for df in memberships_2006_2009_dfs]

# Combine all the tables into a single data frame with a column indicating the year
memberships_2006_2009_dfs = pd.concat(
    {year: df for year, df in zip(['2006', '2007', '2008', '2009'], memberships_2006_2009_dfs)},
    names = ['year', 'club']
)

# Convert the series into data frames with clubs as rows and years as columns
memberships_2006_2009_df = memberships_2006_2009_dfs.reset_index(name = "members").pivot_table(
    index = "club", columns = "year", values = "members", aggfunc = sum
)

Memberships for 2010 and 2011 are obtained from Wikipedia:

In [13]:
# Download the webpage with the membership figures on it
memberships_2011_url = "https://en.wikipedia.org/wiki/2011_AFL_season"
memberships_2011_response = requests.get(memberships_2011_url)
[2018-02-03 14:26:47,899] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): en.wikipedia.org
In [14]:
# Parse the webpage, extracting the table found in the club membership section
memberships_2011 = BeautifulSoup(memberships_2011_response.content, "lxml")
memberships_2011 = memberships_2011.find(attrs={"id" : "Club_membership"}).find_next("table")

# Convert the table into a data frame, naming the columns and discarding the last
# column that we do not require.
memberships_2011_df = pd.read_html(str(memberships_2011), header = 0)[0].iloc[:, 0:-1]
memberships_2011_df.columns = ["club", "2011", "change_from_2010"]
# Clean up the data in the table to calculate the 2010 figures
# If the 'change' column does not start with a digit, then it must be a negative change.
# Extract the change from 2010 to 2011 to calculate the original 2010 figures.
memberships_2011_df.change_from_2010 = memberships_2011_df.change_from_2010.str.replace(
    r"^[^0-9]+", "-"
).str.extract(
    r"(.+) !", expand = False
).astype(np.number)
memberships_2011_df['2010'] = memberships_2011_df['2011'] - memberships_2011_df.change_from_2010
# Arrange the dataframe with club as the rows and keeping just the 2010 and 2011 columns.
memberships_2011_df = memberships_2011_df.set_index('club')
memberships_2011_df = memberships_2011_df[["2010", "2011"]]
# Tidy up the club names for consistency and remove the final 'total' row from the data frame.
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Lions", "")
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Eagles", "")
memberships_2011_df = memberships_2011_df.drop("Total")

Memberships for 2012 are obtained from a sports news article on the news.com.au website:

In [15]:
# Download the webpage with the club membership stats on it
memberships_2012_url = "http://www.news.com.au/sport/afl/collingwood-and-hawthorn-record-membership-increases/story-fnelctok-1226466581212"
memberships_2012_response = requests.get(memberships_2012_url)
[2018-02-03 14:26:51,522] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.news.com.au
In [16]:
# Parse the webpage, defining a regular expression to extract the club name and membership
# size from the text contained in bulleted paragraphs.
memberships_2012_memberships = BeautifulSoup(memberships_2012_response.content, "lxml")
memberships_2012_regex = re.compile("([a-zA-Z ]+): ([0-9]{2},[0-9]{3})")
memberships_2012_tags = memberships_2012_memberships.find_all(string = memberships_2012_regex)
# Extract the club name and membership size seperately (using the regular expression already defined)
# for each paragraph as a list of tupples and coerce these into a Data Frame naming each column
# appropriately to 'club' and 'members'.
memberships_2012_df = pd.DataFrame(
    [re.match(memberships_2012_regex, club).groups() for club in memberships_2012_tags],
    columns = ["club", "members"]
)
# Set the club to index the rows, format the club sizes into numbers, set the column to the year
# and tidy the club names for consistency.
memberships_2012_df = memberships_2012_df.set_index("club")
memberships_2012_df.members = memberships_2012_df.members.str.replace(",", "").astype("int")
memberships_2012_df = memberships_2012_df.rename(columns = {"members": "2012"})
memberships_2012_df.index = memberships_2012_df.index.str.replace(" Giants", "")

Memberships for the years 2013 and 2014 are obtained from an article from the official AFL website:

In [17]:
# Download the webpage
memberships_2013_2014_url = "http://www.afl.com.au/news/2014-08-02/club-memberships-rise"
memberships_2013_2014_response = requests.get(memberships_2013_2014_url)
[2018-02-03 14:26:52,390] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.afl.com.au
In [18]:
# Parse the webpage by extracting the table and converting into a data frame with appropraitely
# named columns, setting the club name as the row index and order the years from left to right.
memberships_2013_2014 = BeautifulSoup(memberships_2013_2014_response.content, "lxml")
memberships_2013_2014_table = memberships_2013_2014.find("table")
memberships_2013_2014_df = pd.read_html(str(memberships_2013_2014_table), header = 0)[0]
memberships_2013_2014_df.columns = ["club", "2013", "2014"]
memberships_2013_2014_df = memberships_2013_2014_df.set_index("club")
memberships_2013_2014_df[["2013", "2014"]] = memberships_2013_2014_df[["2013", "2014"]].apply(
    # Format the club sizes into numbers.
    lambda s: s.astype(str).str.replace(r"[*,]", "").astype(int),
    axis = 1
)
# Tidy up the club names to be consistent
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Lions", "")
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Swans", "")

Finally, memberships for the years 2015, 2016 and 2017 are obtained from an articles on aflonline.com.au:

In [19]:
def get_memberships_article(year = "2016"):
  # Download the webpage
  memberships_url = "http://www.aflonline.com.au/afl-news/current-afl-membership-numbers-for-" + str(year) + "/"
  memberships_response = requests.get(memberships_url)
  # Parse the webpage by extracting the table, converting into a data frame and naming the columns.
  memberships = BeautifulSoup(memberships_response.content, "lxml")
  memberships_table = memberships.find("table")
  memberships_df = pd.read_html(str(memberships_table), header = 0)[0]
  memberships_df = memberships_df.iloc[:, 0:3]
  memberships_df.columns = ["club", "year2", "year1"]
  # Set the club name as the row index.
  memberships_df = memberships_df.set_index("club")
  # Tidy the club names for consistency.
  memberships_df.index = memberships_df.index.str.replace(" Lions", "")
  memberships_df.index = memberships_df.index.str.replace(" Giants", "")
  memberships_df.index = memberships_df.index.str.replace(r"Melb$", "Melbourne")
  # Order the years from left to right.
  memberships_df = memberships_df[['year1', 'year2']]
  memberships_df = memberships_df.iloc[memberships_df.index != "TOTAL", :]
  return memberships_df
In [20]:
memberships_2015_2016_df = get_memberships_article("2016")
memberships_2016_2017_df = get_memberships_article("2017")
memberships_2015_2016_df.columns = ["2015", "2016"]
memberships_2016_2017_df.columns = ["2016", "2017"]
memberships_2015_onwards = memberships_2015_2016_df[["2015"]]
memberships_2015_onwards = memberships_2015_onwards.join(memberships_2016_2017_df, how = "outer")
memberships_2015_onwards
[2018-02-03 14:26:53,913] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.aflonline.com.au
[2018-02-03 14:26:54,681] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.aflonline.com.au
Out[20]:
2015 2016 2017
club
Adelaide 52920 54307 56865
Brisbane 25408 23286 21362
Carlton 47305 50130 50326
Collingwood 75037 74643 75879
Essendon 60818 57494 67768
Fremantle 51433 51889 51254
GWS 13480 15312 20944
Geelong 44312 50571 54854
Gold Coast 13643 12854 11665
Hawthorn 72924 75351 75663
Melbourne 35953 39146 42233
North Melbourne 41012 45014 40343
Port Adelaide 54057 53743 52129
Richmond 70809 72278 72669
St Kilda 32746 38009 42052
Sydney 48836 56523 58838
West Coast 60221 65188 65064
Western Bulldogs 35222 39459 47653

The memberships from each of the above sources are then concatenated together to form a single table.

In [21]:
# Concatenate the scraped membership stats into a single data frame with a multi-index 
# for clubs and year forming the rows.
club_memberships_df = memberships_2011_df.join([
        memberships_2006_2009_df,
        memberships_2012_df,
        memberships_2013_2014_df,
        memberships_2015_onwards
], how = "outer")
club_memberships_df.index.name = "club"
club_memberships_df = club_memberships_df.stack()
club_memberships_df.index.names = ['club', 'year']
club_memberships_df = club_memberships_df.reset_index(name = "members")
club_memberships_df = club_memberships_df.set_index(['year', 'club'])

The heatmap below compares the membership size of each club year-on-year. Note that Gold Coast and GWS started as AFL clubs in 2011 and 2012, respectively, and therefore had no prior membership figures.

In [22]:
# Transform the memberships data frame into wide format.
memberships_club_year_pivot = club_memberships_df.reset_index().pivot_table(
    values = "members", index = "club", columns = "year", aggfunc = sum, dropna = False
)
# Produce a heatmap, with the clubs sorted in order of total membership renewals
sns.heatmap(
    memberships_club_year_pivot.loc[
        memberships_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

We can see that Collingwood memberships increased significantly in 2011. As we will see later in this report, this corresponds with an increase in game wins. The same can be said for Hawthorn in 2014 onwards. However we can see that there is little to no change for many teams regardless of the number of wins. For example, the Western Bulldogs have seen minimal change to membership even with large variations in game wins (eg. 10 wins in 2010, 5 wins in 2012 and 14 in 2015).

Interestingly, we can see that in 2009 club memberships were generally low across all of the clubs but then increased again in 2010. This is at the time of the Global Financial Crisis (GFC), although population wealth at the time did not decrease as we saw earlier from the plot of GNI per capita year-on-year. Therefore we cannot, at this stage, state whether the GFC is the reason for the coinciding dip in memberships.

Now we obtain the attendance numbers from afltables.com:

In [23]:
# The following function can be used to scrape a series of pages from the same website, where the
# difference in extracting the crowd attendance data is the year given in the URL.
def get_attendance(year = "2016"):
  # Download the required page
  url = "http://afltables.com/afl/crowds/" + str(year) + ".html"
  response = requests.get(url)
  # Parse the page by extracting the table and converting into a data frame, keeping only the required
  # columns.
  attendances = BeautifulSoup(response.content, "lxml")
  attendance_by_team = attendances.find("table")
  df = pd.read_html(str(attendance_by_team), header = 0)[0].iloc[:-1, 0:7]
  # Name the columns
  df.columns = [
      "club",
      "home_attendance", "home_games", "home_avg",
      "away_attendance", "away_games", "away_avg"
  ]
  # Use club name as the index for the rows.
  df = df.set_index("club")
  # Remove these columns we later decided we are not interested in them:
  df = df.drop(["home_avg", "away_avg"], axis = 1)
  df = df.drop(index="Team")
  # Coerce the numbers into integers.
  df.home_attendance = df.home_attendance.astype("int")
  df.home_games = df.home_games.astype("int")
  df.away_attendance = df.away_attendance.astype("int")
  df.away_games = df.away_games.astype("int")
  # Rename clubs to be consistent with other datasets.
  df.index = df.index.str.replace(" Lions", "")
  df.index = df.index.str.replace("Greater Western Sydney", "GWS")
  df.index = df.index.str.replace("Kangaroos", "North Melbourne")
  return df
In [24]:
# Apply the above function to scrape the website for attendance data for the years 2006 to 2017 (inclusive)
attendance_by_year = {str(year): get_attendance(year) for year in range(2006, 2018)}
attendance_by_year = pd.concat(attendance_by_year, names = ["year"])

# Calculate new fields for total attendees, total games and average attendance, keeping only
# the calculated columns in the data frame already indexed by year and club name.
attendance_by_year['attendees'] = attendance_by_year['home_attendance'] + attendance_by_year['away_attendance']
attendance_by_year['games'] = attendance_by_year['home_games'] + attendance_by_year['away_games']
attendance_by_year = attendance_by_year[['attendees', 'games']]
attendance_by_year['avg attendance'] = attendance_by_year['attendees'] / attendance_by_year['games']
[2018-02-03 14:26:56,364] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,588] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:56,742] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,816] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:56,962] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,997] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,152] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,191] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,413] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,449] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,627] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,662] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,045] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,080] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,244] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,279] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,443] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,478] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,676] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,711] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,928] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,966] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:59,195] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:59,229] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com

The following heatmap compares the average attendance per game by playing team, year-on-year:

In [25]:
# Transform the attendance by club and year into wide format
attendance_club_year_pivot = attendance_by_year.reset_index().pivot_table(
    values = "avg attendance", index = "club", columns = "year", dropna = False
)
# Plot a heatmap of attendance by year and club. Sort the clubs by highest total spectators to least
sns.heatmap(
    attendance_club_year_pivot.loc[
        attendance_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

There is a tendency for Victorian based clubs to record higher attendance figures when compared to non-Victorian clubs. This may be due to cultural reasons such as the sport having a stronger following in Victoria compared to other states.

We now read the Excel file containing the match results for each game:

In [26]:
# Open the spreadsheet file, selecting only the columns we are interested in.
odds_df = pd.read_excel(
    "afl.xlsx",
    skiprows = 1,
    usecols = "A:AY"
)
# Tidy up the team names to be consistent
odds_df[['Home Team', 'Away Team']] = odds_df[['Home Team', 'Away Team']].apply(
    lambda club: club.str.replace(" Giants", ""),
    axis = 1
)
# Keep the columns we are interested in and rename the column indicating whether the game is a final.
odds_df = odds_df[['Date', 'Home Team', 'Away Team', 'Home Score', 'Away Score', 'Play Off Game?']]
odds_df = odds_df.rename(columns = {"Play Off Game?": "Final"})
# Encode the Final column into Y and N.
odds_df.Final = odds_df.Final.fillna(value = "N")
odds_df = odds_df[odds_df.Final == "N"]

# Identify games where the home team won, the away team won and where there were draws.
odds_df['Home Win'] = odds_df['Home Score'] > odds_df['Away Score']
odds_df['Away Win'] = odds_df['Home Score'] < odds_df['Away Score']
odds_df['Draw'] = odds_df['Home Score'] == odds_df['Away Score']
# Identify the names of the teams that won each game
odds_df['Home Win'] = odds_df['Home Team'].where(odds_df['Home Win'])
odds_df['Away Win'] = odds_df['Away Team'].where(odds_df['Away Win'])

# A function to return the name of the team that won a game (represented as a row)
def decide_winning_team(row):
    if row['Draw'] == True:
        return "Draw"
    elif pd.isnull(row['Home Win']):
        return row['Away Win']
    elif pd.isnull(row['Away Win']):
        return row['Home Win']

# Reduce the columns down to just one column using the above function that indicates which
# team was the winning team, otherwise a draw.
odds_df['Winning Team'] = odds_df[['Home Win', 'Away Win', 'Draw']].apply(
    decide_winning_team,
    axis = 1
)

# Remove the redundant 'Draw' column as we have this encoded in the above already.
odds_df = odds_df.drop("Draw", axis = 1)
# Determine the year (season) of the game.
odds_df['year'] = odds_df['Date'].dt.year.astype("str")
# Determine the number of games won, grouped by each team
winning_team_by_year = odds_df.groupby(['year', 'Winning Team']).size()
winning_team_by_year = winning_team_by_year.reset_index(name = "wins")
# Format the number of games won by team data frame into the same structure as our other
# datasets, i.e. rows indexed by year and club.
winning_team_by_year = winning_team_by_year.rename(columns = {"Winning Team": "club"})
winning_team_by_year = winning_team_by_year.set_index(['year', 'club'])
# Any years where a team is not mentioned in this must be a team that had zero wins that year.
winning_team_by_year = winning_team_by_year.reset_index().pivot_table(
     values = "wins", index = "club", columns = "year", fill_value = 0, dropna = False
).stack().reset_index(name = "wins").rename(
    columns = {'level_1' : 'year'}
).set_index(['year', 'club']).sort_index()

The following heatmap compares the number of games won by each team year-on-year:

In [27]:
# Transform from long to wide format.
winning_team_by_year_pivot = winning_team_by_year.reset_index().pivot_table(
    values = "wins", index = "club", columns = "year", dropna = False
)
# Plot heatmap, in order of club with most wins to least.
sns.heatmap(
    winning_team_by_year_pivot.loc[
        winning_team_by_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

The graph above ranks the clubs by total number of wins over the entire time frame and compares the wins they have had year-on-year. It also shows when there were draws. We can see that the top two clubs (Geelong and Hawthorn) have had a period of sustained success with both recording a high number of wins each season since 2011. GWS and Gold Coast have recorded the least number of wins however this is due to the fact that they are new clubs and only entered the AFL in 2012 and 2011 respectively. Note: Figures shown for 2016 are incomplete because the season is yet to finish.

The table below shows the same information with the clubs ordered alphabetically. It provides the number of wins for each club, each year.

In [28]:
winning_team_by_year_pivot
Out[28]:
year 2009 2010 2011 2012 2013 2014 2015 2016 2017
club
Adelaide 8 9 7 17 10 11 13 16 15
Brisbane 7 7 4 10 10 7 4 3 5
Carlton 7 11 14 11 11 7 4 7 6
Collingwood 10 17 20 16 14 11 10 9 9
Draw 2 2 3 1 1 1 2 0 3
Essendon 5 7 11 11 14 12 6 3 12
Fremantle 3 13 9 14 16 16 17 4 8
GWS 0 0 0 2 1 6 11 16 14
Geelong 10 17 19 15 18 17 11 17 15
Gold Coast 0 0 3 3 8 10 4 6 6
Hawthorn 3 12 18 17 19 17 16 17 10
Melbourne 3 8 8 4 2 4 7 10 12
North Melbourne 3 11 10 14 10 14 13 12 6
Port Adelaide 3 10 3 5 12 14 12 10 14
Richmond 2 6 8 10 15 12 15 8 15
St Kilda 10 15 12 12 5 4 6 12 11
Sydney 3 13 12 16 15 17 16 17 14
West Coast 5 4 17 15 9 11 16 16 12
Western Bulldogs 8 14 9 5 8 7 14 15 11

We can now combine the club memberships, attendance and winner results into one data frame as they are all consistently indexed by club and year. Note that we do not have game results prior to 2009.

In [29]:
# Join attendance data with club memberships and wins
year_club_stats = club_memberships_df.join(
    attendance_by_year, how = "inner"
).join(
    winning_team_by_year, how = "left"
)
year_club_stats = year_club_stats.reset_index()
year_club_stats
Out[29]:
year club members attendees games avg attendance wins
0 2006 Adelaide 46472.0 796016 22 36182.5455 NaN
1 2006 Brisbane 24873.0 649461 22 29520.9545 NaN
2 2006 Carlton 42408.0 813535 22 36978.8636 NaN
3 2006 Collingwood 45972.0 1164982 22 52953.7273 NaN
4 2006 Essendon 40412.0 935522 22 42523.7273 NaN
5 2006 Fremantle 39206.0 692272 22 31466.9091 NaN
6 2006 Geelong 37160.0 712718 22 32396.2727 NaN
7 2006 Hawthorn 52496.0 691924 22 31451.0909 NaN
8 2006 Melbourne 31506.0 770733 22 35033.3182 NaN
9 2006 North Melbourne 28340.0 634686 22 28849.3636 NaN
10 2006 Port Adelaide 30605.0 586332 22 26651.4545 NaN
11 2006 Richmond 36981.0 855556 22 38888.9091 NaN
12 2006 St Kilda 31906.0 802678 22 36485.3636 NaN
13 2006 Sydney 26269.0 730377 22 33198.9545 NaN
14 2006 West Coast 43927.0 782853 22 35584.2273 NaN
15 2006 Western Bulldogs 28215.0 788467 22 35839.4091 NaN
16 2007 Adelaide 48720.0 788959 22 35861.7727 NaN
17 2007 Brisbane 22737.0 665330 22 30242.2727 NaN
18 2007 Carlton 39360.0 864753 22 39306.9545 NaN
19 2007 Collingwood 42498.0 1226604 22 55754.7273 NaN
20 2007 Essendon 41947.0 1088652 22 49484.1818 NaN
21 2007 Fremantle 43366.0 737569 22 33525.8636 NaN
22 2007 Geelong 36850.0 762050 22 34638.6364 NaN
23 2007 Hawthorn 41436.0 769629 22 34983.1364 NaN
24 2007 Melbourne 29619.0 707402 22 32154.6364 NaN
25 2007 North Melbourne 32600.0 638881 22 29040.0455 NaN
26 2007 Port Adelaide 34185.0 599859 22 27266.3182 NaN
27 2007 Richmond 30820.0 909203 22 41327.4091 NaN
28 2007 St Kilda 30063.0 862848 22 39220.3636 NaN
29 2007 Sydney 26721.0 772705 22 35122.9545 NaN
30 2007 West Coast 44863.0 834907 22 37950.3182 NaN
31 2007 Western Bulldogs 28306.0 721691 22 32804.1364 NaN
32 2008 Adelaide 50976.0 803035 22 36501.5909 NaN
33 2008 Brisbane 21976.0 624232 22 28374.1818 NaN
34 2008 Carlton 35431.0 1070500 22 48659.0909 NaN
35 2008 Collingwood 38587.0 1247391 22 56699.5909 NaN
36 2008 Essendon 32759.0 1044123 22 47460.1364 NaN
37 2008 Fremantle 43343.0 700023 22 31819.2273 NaN
38 2008 Geelong 30169.0 837196 22 38054.3636 NaN
39 2008 Hawthorn 31064.0 910679 22 41394.5000 NaN
40 2008 Melbourne 28077.0 663711 22 30168.6818 NaN
41 2008 North Melbourne 22366.0 680095 22 30913.4091 NaN
42 2008 Port Adelaide 34073.0 556905 22 25313.8636 NaN
43 2008 Richmond 30044.0 935002 22 42500.0909 NaN
44 2008 St Kilda 30394.0 791834 22 35992.4545 NaN
45 2008 Sydney 28764.0 701906 22 31904.8182 NaN
46 2008 West Coast 45949.0 757948 22 34452.1818 NaN
47 2008 Western Bulldogs 28725.0 697930 22 31724.0909 NaN
48 2009 Adelaide 50138.0 761080 22 34594.5455 8.0
49 2009 Brisbane 26459.0 642309 22 29195.8636 7.0
... ... ... ... ... ... ... ...
155 2015 Essendon 60818.0 894335 22 40651.5909 6.0
156 2015 Fremantle 51433.0 691707 22 31441.2273 17.0
157 2015 GWS 13480.0 402018 22 18273.5455 11.0
158 2015 Geelong 44312.0 720060 21 34288.5714 11.0
159 2015 Gold Coast 13643.0 394216 22 17918.9091 4.0
160 2015 Hawthorn 72924.0 899917 22 40905.3182 16.0
161 2015 Melbourne 35953.0 653832 22 29719.6364 7.0
162 2015 North Melbourne 41012.0 608974 22 27680.6364 13.0
163 2015 Port Adelaide 54057.0 737649 22 33529.5000 12.0
164 2015 Richmond 70809.0 1054354 22 47925.1818 15.0
165 2015 St Kilda 32746.0 590021 22 26819.1364 6.0
166 2015 Sydney 48836.0 701028 22 31864.9091 16.0
167 2015 West Coast 60221.0 705440 22 32065.4545 16.0
168 2015 Western Bulldogs 35222.0 595081 22 27049.1364 14.0
169 2016 Adelaide 54307.0 857443 22 38974.6818 16.0
170 2016 Brisbane 23286.0 501909 22 22814.0455 3.0
171 2016 Carlton 50130.0 784445 22 35656.5909 7.0
172 2016 Collingwood 74643.0 972219 22 44191.7727 9.0
173 2016 Essendon 57494.0 802059 22 36457.2273 3.0
174 2016 Fremantle 51889.0 574588 22 26117.6364 4.0
175 2016 GWS 15312.0 403207 22 18327.5909 16.0
176 2016 Geelong 50571.0 750982 22 34135.5455 17.0
177 2016 Gold Coast 12854.0 351648 22 15984.0000 6.0
178 2016 Hawthorn 75351.0 840496 22 38204.3636 17.0
179 2016 Melbourne 39146.0 686641 22 31210.9545 10.0
180 2016 North Melbourne 45014.0 671847 22 30538.5000 12.0
181 2016 Port Adelaide 53743.0 694948 22 31588.5455 10.0
182 2016 Richmond 72278.0 900237 22 40919.8636 8.0
183 2016 St Kilda 38009.0 686520 22 31205.4545 12.0
184 2016 Sydney 56523.0 696570 22 31662.2727 17.0
185 2016 West Coast 65188.0 744774 22 33853.3636 16.0
186 2016 Western Bulldogs 39459.0 702779 22 31944.5000 15.0
187 2017 Adelaide 56865.0 827935 22 37633.4091 15.0
188 2017 Brisbane 21362.0 481838 22 21901.7273 5.0
189 2017 Carlton 50326.0 873618 22 39709.9091 6.0
190 2017 Collingwood 75879.0 1003991 22 45635.9545 9.0
191 2017 Essendon 67768.0 1019757 22 46352.5909 12.0
192 2017 Fremantle 51254.0 690906 22 31404.8182 8.0
193 2017 GWS 20944.0 431975 22 19635.2273 14.0
194 2017 Geelong 54854.0 791228 22 35964.9091 15.0
195 2017 Gold Coast 11665.0 369482 22 16794.6364 6.0
196 2017 Hawthorn 75663.0 866536 22 39388.0000 10.0
197 2017 Melbourne 42233.0 833190 22 37872.2727 12.0
198 2017 North Melbourne 40343.0 554306 22 25195.7273 6.0
199 2017 Port Adelaide 52129.0 699377 22 31789.8636 14.0
200 2017 Richmond 72669.0 1024751 22 46579.5909 15.0
201 2017 St Kilda 42052.0 761222 22 34601.0000 11.0
202 2017 Sydney 58838.0 794909 22 36132.2273 14.0
203 2017 West Coast 65064.0 713672 22 32439.6364 12.0
204 2017 Western Bulldogs 47653.0 729431 22 33155.9545 11.0

205 rows Γ— 7 columns

The above table shows the combined datasets.

In [30]:
# Calculate the totals for each year across the combined clubs
year_stats = year_club_stats.groupby("year").sum()
# As there is always two clubs in every game, the total games should be halved.
year_stats.games = year_stats.games / 2
# Similarly attendees are spectators of both teams at the same game.
year_stats.attendees = year_stats.attendees / 2
# Calculate average attendance at each game
year_stats['avg attendance'] = year_stats.attendees / year_stats.games
In [31]:
# Exclude 2016 from the plot as that is an incomplete year
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "avg attendance")

# Annotate the plot with the year
# Following solution from: http://stackoverflow.com/questions/32154518/annotate-labels-in-pandas-scatter-plot
def annotate_df(row):  
    p.annotate(row.name, row[['members', 'avg attendance']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

We can see a steady increase in total members each year since 2010 however it is interesting to note that this has not resulted in increased average attendances. Average attendances have actually dropped significantly in 2012.

One possible reason for this is the introduction of two new clubs into the competition in 2011 and 2012 (Gold Coast and GWS). These two clubs have shown to have a lower number of wins per year and have therefore reduced the quality of the competition for spectators.

In [32]:
# Similar code to the above, same explanation.
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "attendees")

def annotate_df(row):  
    p.annotate(row.name, row[['members', 'attendees']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

Attendance per year has remained constant with very little fluctuation. This is despite significant increases in total membership numbers.

In [33]:
# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p =  p_data.plot.scatter(x = "GNI_per_capita", y = "members")

def annotate_df(row):  
    p.annotate(row.name, row[['GNI_per_capita', 'members']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

From the above plot it appears that club memberships correlate with GNI per capita.

We can see that GNI per capita has generally increased year-on-year although 2014 showed a small decline from 2013.

It is interesting to see that membership numbers have also increased each year except for 2008 and 2009. One possible explanation for this could be the Global Financial Crisis which affected Australia's economy around this time.

In [34]:
# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p = p_data.plot.scatter(x = "GNI_per_capita", y = "attendees")

def annotate_df(row):  
    p.annotate(row.name, row[['GNI_per_capita', 'attendees']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

GNI per capita does not appear to correlate with total attendance at games which is in contrast to the previous plot which showed some correlation with club memberships.

In [35]:
# As we only have current Twitter followers we must compare to the most up to date
# club stats. The last complete year for club stats is 2015 so we compare with that year.

# Join the Twitter data with the year_club_stats for 2015.
club_stats_2015 = year_club_stats[year_club_stats.year == "2015"].join(
    afl_clubs_df, how = "outer", on = "club"
)

# Keep only the columns we are interested in.
club_stats_2015 = club_stats_2015[
    ['club', 'members', 'attendees', 'games', 'wins', 'followers', 'avg attendance']
]

# Plot two metrics from the above
club_stats_2015.plot.scatter(x = "members", y = "followers")
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa465a54a50>

The above plot compares current Twitter followers with 2015 memberships for each club. It shows a correlation between membership numbers and Twitter followers.

Note: Historical data cannot be obtained through the Twitter API for last year and 2016 membership numbers are yet to be finalised.

In [36]:
club_stats_2015.plot.scatter(x = "avg attendance", y = "followers")
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa465520250>

The above plot shows a correlation between Twitter followers and average game attendance.

In [37]:
year_club_stats.plot.scatter(x = "members", y = "avg attendance")
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa46582d290>

The above plot shows the average attendance compared to membership size for each club per year. This shows a strong correlation between average attendance and membership size.

In [38]:
year_club_stats.plot.scatter(x = "members", y = "wins")
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa465520e50>

The above plot show the number of wins compared to membership size for each club per year. Membership size and number of wins does not appear to be significantly correlated.

In [39]:
year_club_stats.plot.scatter(x = "avg attendance", y = "wins")
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa4654440d0>

The above plot shows the number of wins compared to average attendance for each club per year. It shows some correlation between number of wins and average attendance at games. Teams with poorer performance tend to result in games with lower crowd attendance.

In [ ]:
# Save the data frames to CSV files.
dnp_pcap_au.to_csv("Australia_GNI_per_capita.csv") 
afl_clubs_df.to_csv("Twitter_followers_by_AFL_club.csv")
year_club_stats.to_csv("Stats_by_club_and_year.csv")

ConclusionsΒΆ

  • Increase in total combined memberships across clubs does not correspond with an increase in combined yearly crowd attendance.
  • However, membership size for a club positively correlates with the average crowd attendance at their games.
  • GNI per capita, an indicator of population wealth, correlates with club membership size but does not correlate with crowd attendance.
  • Club memberships declined around the time of the Global Financial Crisis although overall attendance at games did not.
  • Number of followers on Twitter is positively related to club membership size and crowd attendance.
  • Teams with poorer performance correlates with games with lower crowd attendance.
  • There is no clear relationship between club membership and number of wins per year.