AFL Analytics¶

Author: Johann de Boer

Australian Football League (AFL)¶

Year-on-year comparison of club success, membership size, popularity, game attendance and Australian population wealth.

Identified data sources:

Memberships by club and year: afl.com.au, news.com.au, aflonline.com.au, footyindustry.com and wikipedia.org
Crowd attendance at AFL games: afltables.com.au
Number of Twitter followers by club: Twitter API
Australian yearly Gross Domestic Income (GNI) per capita: The World Bank API
Historical AFL game results: Excel file downloaded from aussportsbetting.com

Introduction¶

The Australian Football League (AFL) is a professional competition in Australia in the sport of Australian Rules Football. Currently there are 18 teams spread throughout Australia. Each season consists of each team playing approximately 22 games. The top performing eight teams then play off in a finals system with the last two remaining teams playing off in a grand final to determine the winner for the year. Some key points relevant to this project are:

An admission fee is charged for people who wish to attend an AFL game.
Each football club sells memberships to its supporters every year which entitles members to free access to a number of games depending on their subscription type.
Each club has a presence on Twitter to connect with its members and broader fans.

Questions for data analysis:

Does on field success have an influence on club memberships?
How does club membership size affect crowd attendance?
Does attendance at games and size of club memberships increase with population wealth?
Is a club's membership size reflected by the number of followers they have on Twitter?

Install the necessary libraries¶

!pip install tweepy
!pip install pandas_datareader
!pip install xlrd
!pip install tables

Requirement already satisfied: tweepy in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: six>=1.7.3 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: requests>=2.4.3 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: requests-oauthlib>=0.4.1 in /usr/local/lib/python2.7/dist-packages (from tweepy)
Requirement already satisfied: oauthlib>=0.6.2 in /usr/local/lib/python2.7/dist-packages (from requests-oauthlib>=0.4.1->tweepy)
Requirement already satisfied: pandas_datareader in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: pandas>=0.19.2 in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests>=2.3.0 in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: wrapt in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: lxml in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests-file in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: requests-ftp in /usr/local/lib/python2.7/dist-packages (from pandas_datareader)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas>=0.19.2->pandas_datareader)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from requests-file->pandas_datareader)
Requirement already satisfied: xlrd in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: tables in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: numexpr>=2.5.2 in /usr/local/lib/python2.7/dist-packages (from tables)
Requirement already satisfied: numpy>=1.8.0 in /usr/local/lib/python2.7/dist-packages (from tables)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python2.7/dist-packages (from tables)

Load the necessary libraries and import required functions¶

from IPython.display import display
from scipy import stats
from itertools import groupby
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pylab import rcParams # controls for plot rendering
import sqlite3
from datetime import datetime
import tweepy
import time
import os.path
import json
from pandas_datareader import data, wb
import requests
from bs4 import BeautifulSoup
import xlrd

Set report presentation style options¶

mpl.style.use('ggplot')
pd.set_option('display.html.border', 1)
pd.set_option('max_columns', 12)
pd.set_option('max_rows', 100)
pd.set_option('display.width', 100)
pd.set_option('multi_sparse', False)
pd.set_option('expand_frame_repr', True)
pd.set_option('max_colwidth', 60)
pd.set_option('precision', 4)
%matplotlib inline

Source the data required¶

The World Bank API¶

We obtain the Gross National Income (GNI) of Australia per capita (Atlas method in terms of current US$) using The World Bank data API. For details about this data, please refer to this link: http://data.worldbank.org/indicator/NY.GNP.PCAP.CD

Our hypothesis is that crowd attendance and club membership is correlated with the shared wealth of the country, where football supporters subscribe to clubs and purchase tickets when they have more disposal income. GNI per capita has been chosen as a indicator of the general wealth of the population. Later in this analysis we will compare this measurement against crowd attendance and club membership figures.

# Executes World Bank API query, removes the country information,
# renames the metric column and removes entries with no data,
# i.e. outside of the reportable time-frame.
dnp_pcap_au = wb.download(
    indicator = 'NY.GNP.PCAP.CD',
    country = ['AU'],
    start = 1960,
    end = 2018
).reset_index("country").drop("country", axis = 1).rename(
    columns = {"NY.GNP.PCAP.CD" : "GNI_per_capita"}
).dropna()

[2018-02-03 14:26:42,603] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): api.worldbank.org

dnp_pcap_au = dnp_pcap_au.append(pd.DataFrame(data = {'GNI_per_capita' : 46560}, index = ['2017']))
dnp_pcap_au = dnp_pcap_au.sort_index(ascending = False)
dnp_pcap_au

A full set of data was obtained from the World Bank API, which goes back as far as 1964 and up until 2014 (inclusive). The plot below shows the trend in GNI year-on-year.

# Plot GNI per capita in order of year.
dnp_pcap_au.sort_index().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fa46d3de390>

We can see from the above plot that the average wealth of Australian's in terms of current US dollars has generally increased on an on-going basis over the entire time-frame, with only a very few small downturns. The change in the level of wealth appears to have exponentially risen year-on-year.

Twitter API¶

Each of the AFL clubs have a public presence on Twitter. It is hypothesised that the number of followers each club has on Twitter will be an indicator of a club's current popularity. We will now obtain the list of AFL clubs and their followers on Twitter. Later in this analysis, the number of Twitter followers will be compared to total membership and crowd attendance for each club to determine if these parameters are correlated.

# PLEASE DO NOT SHARE THESE CREDENTIALS AS THEY ARE PRIVATE AND CONFIDENTIAL
consumer_key="XXYY"
consumer_secret="XXYY"
access_token="XXYY"
access_token_secret="XXYY"
# ==========================================================================

# Twiter public API documentation: https://dev.twitter.com/rest/public
# Tweepy documentation: http://tweepy.readthedocs.io/en/v3.5.0/index.html

# Set the credentials for accessing the Twitter API.

# Perform the authentication handshake with the Twitter API using the above credentials.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)

# Create an object to use as a connection to the authenticated API
api = tweepy.API(auth)

# Rate limit handler for the Twitter API.
def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            print "Rate Limited. Waiting 15 minutes..."
            time.sleep(15 * 60)
            print "OK, wait is over."
            continue
        except tweepy.TweepError as e:
            print "Tweep Error: {0}".format(str(e))
            break

afl_clubs = api.list_members("AFLPlayers", "clubs")

[2018-02-03 14:26:45,193] {binder.py:107} INFO - PARAMS: {'slug': 'clubs', 'owner_screen_name': 'AFLPlayers'}
[2018-02-03 14:26:45,205] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): api.twitter.com

In the above code, a list of AFL Players and their Twitter handles is obtained from the 'clubs' list of the AFL Players' Association's Twitter account. The Twitter API is used to obtain the details of this list and then used to obtain the number of followers of each club in that list.

To ensure consistency across each of the data sources used in this analysis, club names will be adjusted to be consistent.

afl_clubs_df = pd.DataFrame()

# Define a function that will be used to add clubs to the above empty data frame of AFL club
# Twitter accounts along with a selection of associated details (including number of followers).
def add_club_to_df(df, club):
    df = df.append({
        "id" : club.id_str,
        "name" : club.name,
        "followers" : club.followers_count,
        "location" : club.location,
        "tz" : club.time_zone,
        "handle" : club.screen_name,
        "image" : club.profile_image_url
    }, ignore_index = True)
    return df

# The above function is then applied to each AFL account from the list returned from the
# Twitter API.
for afl_club in afl_clubs:
    afl_clubs_df = add_club_to_df(afl_clubs_df, afl_club)

# The resulting data frame is then tidied up
afl_clubs_df = afl_clubs_df.set_index("id")
afl_clubs_df.followers = afl_clubs_df.followers.astype("int")
afl_clubs_df = afl_clubs_df[["name", "location", "tz", "followers", "handle", "image"]]
afl_clubs_df = afl_clubs_df.rename(columns = {"name" : "club"})
afl_clubs_df.club = afl_clubs_df.club.str.replace("GWS GIANTS", "GWS")
afl_clubs_df.club = afl_clubs_df.club.str.replace("GC SUNS", "Gold Coast")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" FC", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Dockers", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Lions", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Cats", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Eagles", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Crows", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Swans", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(r'\W+', "")
afl_clubs_df = afl_clubs_df.set_index('club')

The following graph shows the number of Twitter followers for each club:

# Plot number of followers for each club in order of lowest to highest.
afl_clubs_df['followers'].sort_values().plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x7fa46d282750>

Web scraping¶

Number of members per club from 2006 to 2016¶

Club membership sizes for each club from 2006 to 2016 are obtained from various websites using web scraping.

Memberships for 2006 through to 2009 are obtained from footyindustry.com:

# Fetch the page with the club membership details on it
memberships_2006_2009_url = "http://www.footyindustry.com/wiki/index.php?title=Club_Memberships"
memberships_2006_2009_response = requests.get(memberships_2006_2009_url)

[2018-02-03 14:26:47,142] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.footyindustry.com

# Parse the page and extract the tables from it. Each table represents a single year.
memberships_2006_2009_response
memberships_2006_2009 = BeautifulSoup(memberships_2006_2009_response.content, "lxml")
memberships_2006_2009_tables = memberships_2006_2009.findAll("table")

# Convert the tables into data frames, selecting only the tables we require.
memberships_2006_2009_dfs = pd.read_html(
    str(memberships_2006_2009_tables[3:7]),
    header = 1,
    index_col = 0
)

# Remove rows with missing values and the final totals row from each table, keeping only the
# the number of members with 'game access' type memberships.
memberships_2006_2009_dfs = [df.dropna().iloc[0:-1, 3] for df in memberships_2006_2009_dfs]

# Combine all the tables into a single data frame with a column indicating the year
memberships_2006_2009_dfs = pd.concat(
    {year: df for year, df in zip(['2006', '2007', '2008', '2009'], memberships_2006_2009_dfs)},
    names = ['year', 'club']
)

# Convert the series into data frames with clubs as rows and years as columns
memberships_2006_2009_df = memberships_2006_2009_dfs.reset_index(name = "members").pivot_table(
    index = "club", columns = "year", values = "members", aggfunc = sum
)

Memberships for 2010 and 2011 are obtained from Wikipedia:

# Download the webpage with the membership figures on it
memberships_2011_url = "https://en.wikipedia.org/wiki/2011_AFL_season"
memberships_2011_response = requests.get(memberships_2011_url)

[2018-02-03 14:26:47,899] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): en.wikipedia.org

# Parse the webpage, extracting the table found in the club membership section
memberships_2011 = BeautifulSoup(memberships_2011_response.content, "lxml")
memberships_2011 = memberships_2011.find(attrs={"id" : "Club_membership"}).find_next("table")

# Convert the table into a data frame, naming the columns and discarding the last
# column that we do not require.
memberships_2011_df = pd.read_html(str(memberships_2011), header = 0)[0].iloc[:, 0:-1]
memberships_2011_df.columns = ["club", "2011", "change_from_2010"]
# Clean up the data in the table to calculate the 2010 figures
# If the 'change' column does not start with a digit, then it must be a negative change.
# Extract the change from 2010 to 2011 to calculate the original 2010 figures.
memberships_2011_df.change_from_2010 = memberships_2011_df.change_from_2010.str.replace(
    r"^[^0-9]+", "-"
).str.extract(
    r"(.+) !", expand = False
).astype(np.number)
memberships_2011_df['2010'] = memberships_2011_df['2011'] - memberships_2011_df.change_from_2010
# Arrange the dataframe with club as the rows and keeping just the 2010 and 2011 columns.
memberships_2011_df = memberships_2011_df.set_index('club')
memberships_2011_df = memberships_2011_df[["2010", "2011"]]
# Tidy up the club names for consistency and remove the final 'total' row from the data frame.
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Lions", "")
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Eagles", "")
memberships_2011_df = memberships_2011_df.drop("Total")

Memberships for 2012 are obtained from a sports news article on the news.com.au website:

# Download the webpage with the club membership stats on it
memberships_2012_url = "http://www.news.com.au/sport/afl/collingwood-and-hawthorn-record-membership-increases/story-fnelctok-1226466581212"
memberships_2012_response = requests.get(memberships_2012_url)

[2018-02-03 14:26:51,522] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.news.com.au

# Parse the webpage, defining a regular expression to extract the club name and membership
# size from the text contained in bulleted paragraphs.
memberships_2012_memberships = BeautifulSoup(memberships_2012_response.content, "lxml")
memberships_2012_regex = re.compile("([a-zA-Z ]+): ([0-9]{2},[0-9]{3})")
memberships_2012_tags = memberships_2012_memberships.find_all(string = memberships_2012_regex)
# Extract the club name and membership size seperately (using the regular expression already defined)
# for each paragraph as a list of tupples and coerce these into a Data Frame naming each column
# appropriately to 'club' and 'members'.
memberships_2012_df = pd.DataFrame(
    [re.match(memberships_2012_regex, club).groups() for club in memberships_2012_tags],
    columns = ["club", "members"]
)
# Set the club to index the rows, format the club sizes into numbers, set the column to the year
# and tidy the club names for consistency.
memberships_2012_df = memberships_2012_df.set_index("club")
memberships_2012_df.members = memberships_2012_df.members.str.replace(",", "").astype("int")
memberships_2012_df = memberships_2012_df.rename(columns = {"members": "2012"})
memberships_2012_df.index = memberships_2012_df.index.str.replace(" Giants", "")

Memberships for the years 2013 and 2014 are obtained from an article from the official AFL website:

# Download the webpage
memberships_2013_2014_url = "http://www.afl.com.au/news/2014-08-02/club-memberships-rise"
memberships_2013_2014_response = requests.get(memberships_2013_2014_url)

[2018-02-03 14:26:52,390] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.afl.com.au

# Parse the webpage by extracting the table and converting into a data frame with appropraitely
# named columns, setting the club name as the row index and order the years from left to right.
memberships_2013_2014 = BeautifulSoup(memberships_2013_2014_response.content, "lxml")
memberships_2013_2014_table = memberships_2013_2014.find("table")
memberships_2013_2014_df = pd.read_html(str(memberships_2013_2014_table), header = 0)[0]
memberships_2013_2014_df.columns = ["club", "2013", "2014"]
memberships_2013_2014_df = memberships_2013_2014_df.set_index("club")
memberships_2013_2014_df[["2013", "2014"]] = memberships_2013_2014_df[["2013", "2014"]].apply(
    # Format the club sizes into numbers.
    lambda s: s.astype(str).str.replace(r"[*,]", "").astype(int),
    axis = 1
)
# Tidy up the club names to be consistent
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Lions", "")
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Swans", "")

Finally, memberships for the years 2015, 2016 and 2017 are obtained from an articles on aflonline.com.au:

def get_memberships_article(year = "2016"):
  # Download the webpage
  memberships_url = "http://www.aflonline.com.au/afl-news/current-afl-membership-numbers-for-" + str(year) + "/"
  memberships_response = requests.get(memberships_url)
  # Parse the webpage by extracting the table, converting into a data frame and naming the columns.
  memberships = BeautifulSoup(memberships_response.content, "lxml")
  memberships_table = memberships.find("table")
  memberships_df = pd.read_html(str(memberships_table), header = 0)[0]
  memberships_df = memberships_df.iloc[:, 0:3]
  memberships_df.columns = ["club", "year2", "year1"]
  # Set the club name as the row index.
  memberships_df = memberships_df.set_index("club")
  # Tidy the club names for consistency.
  memberships_df.index = memberships_df.index.str.replace(" Lions", "")
  memberships_df.index = memberships_df.index.str.replace(" Giants", "")
  memberships_df.index = memberships_df.index.str.replace(r"Melb$", "Melbourne")
  # Order the years from left to right.
  memberships_df = memberships_df[['year1', 'year2']]
  memberships_df = memberships_df.iloc[memberships_df.index != "TOTAL", :]
  return memberships_df

memberships_2015_2016_df = get_memberships_article("2016")
memberships_2016_2017_df = get_memberships_article("2017")
memberships_2015_2016_df.columns = ["2015", "2016"]
memberships_2016_2017_df.columns = ["2016", "2017"]
memberships_2015_onwards = memberships_2015_2016_df[["2015"]]
memberships_2015_onwards = memberships_2015_onwards.join(memberships_2016_2017_df, how = "outer")
memberships_2015_onwards

[2018-02-03 14:26:53,913] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.aflonline.com.au
[2018-02-03 14:26:54,681] {connectionpool.py:207} INFO - Starting new HTTP connection (1): www.aflonline.com.au

The memberships from each of the above sources are then concatenated together to form a single table.

# Concatenate the scraped membership stats into a single data frame with a multi-index 
# for clubs and year forming the rows.
club_memberships_df = memberships_2011_df.join([
        memberships_2006_2009_df,
        memberships_2012_df,
        memberships_2013_2014_df,
        memberships_2015_onwards
], how = "outer")
club_memberships_df.index.name = "club"
club_memberships_df = club_memberships_df.stack()
club_memberships_df.index.names = ['club', 'year']
club_memberships_df = club_memberships_df.reset_index(name = "members")
club_memberships_df = club_memberships_df.set_index(['year', 'club'])

The heatmap below compares the membership size of each club year-on-year. Note that Gold Coast and GWS started as AFL clubs in 2011 and 2012, respectively, and therefore had no prior membership figures.

# Transform the memberships data frame into wide format.
memberships_club_year_pivot = club_memberships_df.reset_index().pivot_table(
    values = "members", index = "club", columns = "year", aggfunc = sum, dropna = False
)
# Produce a heatmap, with the clubs sorted in order of total membership renewals
sns.heatmap(
    memberships_club_year_pivot.loc[
        memberships_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

We can see that Collingwood memberships increased significantly in 2011. As we will see later in this report, this corresponds with an increase in game wins. The same can be said for Hawthorn in 2014 onwards. However we can see that there is little to no change for many teams regardless of the number of wins. For example, the Western Bulldogs have seen minimal change to membership even with large variations in game wins (eg. 10 wins in 2010, 5 wins in 2012 and 14 in 2015).

Interestingly, we can see that in 2009 club memberships were generally low across all of the clubs but then increased again in 2010. This is at the time of the Global Financial Crisis (GFC), although population wealth at the time did not decrease as we saw earlier from the plot of GNI per capita year-on-year. Therefore we cannot, at this stage, state whether the GFC is the reason for the coinciding dip in memberships.

Now we obtain the attendance numbers from afltables.com:

# The following function can be used to scrape a series of pages from the same website, where the
# difference in extracting the crowd attendance data is the year given in the URL.
def get_attendance(year = "2016"):
  # Download the required page
  url = "http://afltables.com/afl/crowds/" + str(year) + ".html"
  response = requests.get(url)
  # Parse the page by extracting the table and converting into a data frame, keeping only the required
  # columns.
  attendances = BeautifulSoup(response.content, "lxml")
  attendance_by_team = attendances.find("table")
  df = pd.read_html(str(attendance_by_team), header = 0)[0].iloc[:-1, 0:7]
  # Name the columns
  df.columns = [
      "club",
      "home_attendance", "home_games", "home_avg",
      "away_attendance", "away_games", "away_avg"
  ]
  # Use club name as the index for the rows.
  df = df.set_index("club")
  # Remove these columns we later decided we are not interested in them:
  df = df.drop(["home_avg", "away_avg"], axis = 1)
  df = df.drop(index="Team")
  # Coerce the numbers into integers.
  df.home_attendance = df.home_attendance.astype("int")
  df.home_games = df.home_games.astype("int")
  df.away_attendance = df.away_attendance.astype("int")
  df.away_games = df.away_games.astype("int")
  # Rename clubs to be consistent with other datasets.
  df.index = df.index.str.replace(" Lions", "")
  df.index = df.index.str.replace("Greater Western Sydney", "GWS")
  df.index = df.index.str.replace("Kangaroos", "North Melbourne")
  return df

# Apply the above function to scrape the website for attendance data for the years 2006 to 2017 (inclusive)
attendance_by_year = {str(year): get_attendance(year) for year in range(2006, 2018)}
attendance_by_year = pd.concat(attendance_by_year, names = ["year"])

# Calculate new fields for total attendees, total games and average attendance, keeping only
# the calculated columns in the data frame already indexed by year and club name.
attendance_by_year['attendees'] = attendance_by_year['home_attendance'] + attendance_by_year['away_attendance']
attendance_by_year['games'] = attendance_by_year['home_games'] + attendance_by_year['away_games']
attendance_by_year = attendance_by_year[['attendees', 'games']]
attendance_by_year['avg attendance'] = attendance_by_year['attendees'] / attendance_by_year['games']

[2018-02-03 14:26:56,364] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,588] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:56,742] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,816] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:56,962] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:56,997] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,152] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,191] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,413] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,449] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:57,627] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:57,662] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,045] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,080] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,244] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,279] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,443] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,478] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,676] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,711] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:58,928] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:58,966] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com
[2018-02-03 14:26:59,195] {connectionpool.py:207} INFO - Starting new HTTP connection (1): afltables.com
[2018-02-03 14:26:59,229] {connectionpool.py:758} INFO - Starting new HTTPS connection (1): afltables.com

The following heatmap compares the average attendance per game by playing team, year-on-year:

# Transform the attendance by club and year into wide format
attendance_club_year_pivot = attendance_by_year.reset_index().pivot_table(
    values = "avg attendance", index = "club", columns = "year", dropna = False
)
# Plot a heatmap of attendance by year and club. Sort the clubs by highest total spectators to least
sns.heatmap(
    attendance_club_year_pivot.loc[
        attendance_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

There is a tendency for Victorian based clubs to record higher attendance figures when compared to non-Victorian clubs. This may be due to cultural reasons such as the sport having a stronger following in Victoria compared to other states.

We now read the Excel file containing the match results for each game:

# Open the spreadsheet file, selecting only the columns we are interested in.
odds_df = pd.read_excel(
    "afl.xlsx",
    skiprows = 1,
    usecols = "A:AY"
)
# Tidy up the team names to be consistent
odds_df[['Home Team', 'Away Team']] = odds_df[['Home Team', 'Away Team']].apply(
    lambda club: club.str.replace(" Giants", ""),
    axis = 1
)
# Keep the columns we are interested in and rename the column indicating whether the game is a final.
odds_df = odds_df[['Date', 'Home Team', 'Away Team', 'Home Score', 'Away Score', 'Play Off Game?']]
odds_df = odds_df.rename(columns = {"Play Off Game?": "Final"})
# Encode the Final column into Y and N.
odds_df.Final = odds_df.Final.fillna(value = "N")
odds_df = odds_df[odds_df.Final == "N"]

# Identify games where the home team won, the away team won and where there were draws.
odds_df['Home Win'] = odds_df['Home Score'] > odds_df['Away Score']
odds_df['Away Win'] = odds_df['Home Score'] < odds_df['Away Score']
odds_df['Draw'] = odds_df['Home Score'] == odds_df['Away Score']
# Identify the names of the teams that won each game
odds_df['Home Win'] = odds_df['Home Team'].where(odds_df['Home Win'])
odds_df['Away Win'] = odds_df['Away Team'].where(odds_df['Away Win'])

# A function to return the name of the team that won a game (represented as a row)
def decide_winning_team(row):
    if row['Draw'] == True:
        return "Draw"
    elif pd.isnull(row['Home Win']):
        return row['Away Win']
    elif pd.isnull(row['Away Win']):
        return row['Home Win']

# Reduce the columns down to just one column using the above function that indicates which
# team was the winning team, otherwise a draw.
odds_df['Winning Team'] = odds_df[['Home Win', 'Away Win', 'Draw']].apply(
    decide_winning_team,
    axis = 1
)

# Remove the redundant 'Draw' column as we have this encoded in the above already.
odds_df = odds_df.drop("Draw", axis = 1)
# Determine the year (season) of the game.
odds_df['year'] = odds_df['Date'].dt.year.astype("str")
# Determine the number of games won, grouped by each team
winning_team_by_year = odds_df.groupby(['year', 'Winning Team']).size()
winning_team_by_year = winning_team_by_year.reset_index(name = "wins")
# Format the number of games won by team data frame into the same structure as our other
# datasets, i.e. rows indexed by year and club.
winning_team_by_year = winning_team_by_year.rename(columns = {"Winning Team": "club"})
winning_team_by_year = winning_team_by_year.set_index(['year', 'club'])
# Any years where a team is not mentioned in this must be a team that had zero wins that year.
winning_team_by_year = winning_team_by_year.reset_index().pivot_table(
     values = "wins", index = "club", columns = "year", fill_value = 0, dropna = False
).stack().reset_index(name = "wins").rename(
    columns = {'level_1' : 'year'}
).set_index(['year', 'club']).sort_index()

The following heatmap compares the number of games won by each team year-on-year:

# Transform from long to wide format.
winning_team_by_year_pivot = winning_team_by_year.reset_index().pivot_table(
    values = "wins", index = "club", columns = "year", dropna = False
)
# Plot heatmap, in order of club with most wins to least.
sns.heatmap(
    winning_team_by_year_pivot.loc[
        winning_team_by_year_pivot.sum(axis = 1).sort_values(ascending = False).index
    ],
    robust = True
)
plt.yticks(rotation = 0) 
plt.show()

The graph above ranks the clubs by total number of wins over the entire time frame and compares the wins they have had year-on-year. It also shows when there were draws. We can see that the top two clubs (Geelong and Hawthorn) have had a period of sustained success with both recording a high number of wins each season since 2011. GWS and Gold Coast have recorded the least number of wins however this is due to the fact that they are new clubs and only entered the AFL in 2012 and 2011 respectively. Note: Figures shown for 2016 are incomplete because the season is yet to finish.

The table below shows the same information with the clubs ordered alphabetically. It provides the number of wins for each club, each year.

winning_team_by_year_pivot

We can now combine the club memberships, attendance and winner results into one data frame as they are all consistently indexed by club and year. Note that we do not have game results prior to 2009.

# Join attendance data with club memberships and wins
year_club_stats = club_memberships_df.join(
    attendance_by_year, how = "inner"
).join(
    winning_team_by_year, how = "left"
)
year_club_stats = year_club_stats.reset_index()
year_club_stats

The above table shows the combined datasets.

# Calculate the totals for each year across the combined clubs
year_stats = year_club_stats.groupby("year").sum()
# As there is always two clubs in every game, the total games should be halved.
year_stats.games = year_stats.games / 2
# Similarly attendees are spectators of both teams at the same game.
year_stats.attendees = year_stats.attendees / 2
# Calculate average attendance at each game
year_stats['avg attendance'] = year_stats.attendees / year_stats.games

# Exclude 2016 from the plot as that is an incomplete year
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "avg attendance")

# Annotate the plot with the year
# Following solution from: http://stackoverflow.com/questions/32154518/annotate-labels-in-pandas-scatter-plot
def annotate_df(row):  
    p.annotate(row.name, row[['members', 'avg attendance']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

We can see a steady increase in total members each year since 2010 however it is interesting to note that this has not resulted in increased average attendances. Average attendances have actually dropped significantly in 2012.

One possible reason for this is the introduction of two new clubs into the competition in 2011 and 2012 (Gold Coast and GWS). These two clubs have shown to have a lower number of wins per year and have therefore reduced the quality of the competition for spectators.

# Similar code to the above, same explanation.
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "attendees")

def annotate_df(row):  
    p.annotate(row.name, row[['members', 'attendees']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

Attendance per year has remained constant with very little fluctuation. This is despite significant increases in total membership numbers.

# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p =  p_data.plot.scatter(x = "GNI_per_capita", y = "members")

def annotate_df(row):  
    p.annotate(row.name, row[['GNI_per_capita', 'members']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

From the above plot it appears that club memberships correlate with GNI per capita.

We can see that GNI per capita has generally increased year-on-year although 2014 showed a small decline from 2013.

It is interesting to see that membership numbers have also increased each year except for 2008 and 2009. One possible explanation for this could be the Global Financial Crisis which affected Australia's economy around this time.

# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p = p_data.plot.scatter(x = "GNI_per_capita", y = "attendees")

def annotate_df(row):  
    p.annotate(row.name, row[['GNI_per_capita', 'attendees']].values)

p_data.apply(annotate_df, axis = 1)

plt.show()

GNI per capita does not appear to correlate with total attendance at games which is in contrast to the previous plot which showed some correlation with club memberships.

# As we only have current Twitter followers we must compare to the most up to date
# club stats. The last complete year for club stats is 2015 so we compare with that year.

# Join the Twitter data with the year_club_stats for 2015.
club_stats_2015 = year_club_stats[year_club_stats.year == "2015"].join(
    afl_clubs_df, how = "outer", on = "club"
)

# Keep only the columns we are interested in.
club_stats_2015 = club_stats_2015[
    ['club', 'members', 'attendees', 'games', 'wins', 'followers', 'avg attendance']
]

# Plot two metrics from the above
club_stats_2015.plot.scatter(x = "members", y = "followers")

<matplotlib.axes._subplots.AxesSubplot at 0x7fa465a54a50>

The above plot compares current Twitter followers with 2015 memberships for each club. It shows a correlation between membership numbers and Twitter followers.

Note: Historical data cannot be obtained through the Twitter API for last year and 2016 membership numbers are yet to be finalised.

club_stats_2015.plot.scatter(x = "avg attendance", y = "followers")

<matplotlib.axes._subplots.AxesSubplot at 0x7fa465520250>

The above plot shows a correlation between Twitter followers and average game attendance.

year_club_stats.plot.scatter(x = "members", y = "avg attendance")

<matplotlib.axes._subplots.AxesSubplot at 0x7fa46582d290>

The above plot shows the average attendance compared to membership size for each club per year. This shows a strong correlation between average attendance and membership size.

year_club_stats.plot.scatter(x = "members", y = "wins")

<matplotlib.axes._subplots.AxesSubplot at 0x7fa465520e50>

The above plot show the number of wins compared to membership size for each club per year. Membership size and number of wins does not appear to be significantly correlated.

year_club_stats.plot.scatter(x = "avg attendance", y = "wins")

<matplotlib.axes._subplots.AxesSubplot at 0x7fa4654440d0>

The above plot shows the number of wins compared to average attendance for each club per year. It shows some correlation between number of wins and average attendance at games. Teams with poorer performance tend to result in games with lower crowd attendance.

	GNI_per_capita
2017	46560.0
2016	54420.0
2015	60330.0
2014	64860.0
2013	65640.0
2012	59840.0
2011	50150.0
2010	46560.0
2009	44040.0
2008	42390.0
2007	37340.0
2006	34170.0
2005	30340.0
2004	25520.0
2003	21130.0
2002	19980.0
2001	20130.0
2000	21130.0
1999	21460.0
1998	21790.0
1997	21980.0
1996	20520.0
1995	19320.0
1994	18890.0
1993	18930.0
1992	18560.0
1991	18250.0
1990	17350.0
1989	15610.0
1988	14140.0
1987	12570.0
1986	12330.0
1985	12030.0
1984	12050.0
1983	11600.0
1982	12090.0
1981	11880.0
1980	10840.0
1979	9610.0
1978	8480.0
1977	8090.0
1976	7880.0
1975	7230.0
1974	5960.0
1973	4660.0
1972	3890.0
1971	3480.0
1970	3280.0
1969	2980.0
1968	2730.0
1967	2540.0
1966	2350.0
1965	2270.0
1964	2110.0
1963	1970.0
1962	1880.0

	2015	2016	2017
club
Adelaide	52920	54307	56865
Brisbane	25408	23286	21362
Carlton	47305	50130	50326
Collingwood	75037	74643	75879
Essendon	60818	57494	67768
Fremantle	51433	51889	51254
GWS	13480	15312	20944
Geelong	44312	50571	54854
Gold Coast	13643	12854	11665
Hawthorn	72924	75351	75663
Melbourne	35953	39146	42233
North Melbourne	41012	45014	40343
Port Adelaide	54057	53743	52129
Richmond	70809	72278	72669
St Kilda	32746	38009	42052
Sydney	48836	56523	58838
West Coast	60221	65188	65064
Western Bulldogs	35222	39459	47653

	year	club	members	attendees	games	avg attendance	wins
0	2006	Adelaide	46472.0	796016	22	36182.5455	NaN
1	2006	Brisbane	24873.0	649461	22	29520.9545	NaN
2	2006	Carlton	42408.0	813535	22	36978.8636	NaN
3	2006	Collingwood	45972.0	1164982	22	52953.7273	NaN
4	2006	Essendon	40412.0	935522	22	42523.7273	NaN
5	2006	Fremantle	39206.0	692272	22	31466.9091	NaN
6	2006	Geelong	37160.0	712718	22	32396.2727	NaN
7	2006	Hawthorn	52496.0	691924	22	31451.0909	NaN
8	2006	Melbourne	31506.0	770733	22	35033.3182	NaN
9	2006	North Melbourne	28340.0	634686	22	28849.3636	NaN
10	2006	Port Adelaide	30605.0	586332	22	26651.4545	NaN
11	2006	Richmond	36981.0	855556	22	38888.9091	NaN
12	2006	St Kilda	31906.0	802678	22	36485.3636	NaN
13	2006	Sydney	26269.0	730377	22	33198.9545	NaN
14	2006	West Coast	43927.0	782853	22	35584.2273	NaN
15	2006	Western Bulldogs	28215.0	788467	22	35839.4091	NaN
16	2007	Adelaide	48720.0	788959	22	35861.7727	NaN
17	2007	Brisbane	22737.0	665330	22	30242.2727	NaN
18	2007	Carlton	39360.0	864753	22	39306.9545	NaN
19	2007	Collingwood	42498.0	1226604	22	55754.7273	NaN
20	2007	Essendon	41947.0	1088652	22	49484.1818	NaN
21	2007	Fremantle	43366.0	737569	22	33525.8636	NaN
22	2007	Geelong	36850.0	762050	22	34638.6364	NaN
23	2007	Hawthorn	41436.0	769629	22	34983.1364	NaN
24	2007	Melbourne	29619.0	707402	22	32154.6364	NaN
25	2007	North Melbourne	32600.0	638881	22	29040.0455	NaN
26	2007	Port Adelaide	34185.0	599859	22	27266.3182	NaN
27	2007	Richmond	30820.0	909203	22	41327.4091	NaN
28	2007	St Kilda	30063.0	862848	22	39220.3636	NaN
29	2007	Sydney	26721.0	772705	22	35122.9545	NaN
30	2007	West Coast	44863.0	834907	22	37950.3182	NaN
31	2007	Western Bulldogs	28306.0	721691	22	32804.1364	NaN
32	2008	Adelaide	50976.0	803035	22	36501.5909	NaN
33	2008	Brisbane	21976.0	624232	22	28374.1818	NaN
34	2008	Carlton	35431.0	1070500	22	48659.0909	NaN
35	2008	Collingwood	38587.0	1247391	22	56699.5909	NaN
36	2008	Essendon	32759.0	1044123	22	47460.1364	NaN
37	2008	Fremantle	43343.0	700023	22	31819.2273	NaN
38	2008	Geelong	30169.0	837196	22	38054.3636	NaN
39	2008	Hawthorn	31064.0	910679	22	41394.5000	NaN
40	2008	Melbourne	28077.0	663711	22	30168.6818	NaN
41	2008	North Melbourne	22366.0	680095	22	30913.4091	NaN
42	2008	Port Adelaide	34073.0	556905	22	25313.8636	NaN
43	2008	Richmond	30044.0	935002	22	42500.0909	NaN
44	2008	St Kilda	30394.0	791834	22	35992.4545	NaN
45	2008	Sydney	28764.0	701906	22	31904.8182	NaN
46	2008	West Coast	45949.0	757948	22	34452.1818	NaN
47	2008	Western Bulldogs	28725.0	697930	22	31724.0909	NaN
48	2009	Adelaide	50138.0	761080	22	34594.5455	8.0
49	2009	Brisbane	26459.0	642309	22	29195.8636	7.0
...	...	...	...	...	...	...	...
155	2015	Essendon	60818.0	894335	22	40651.5909	6.0
156	2015	Fremantle	51433.0	691707	22	31441.2273	17.0
157	2015	GWS	13480.0	402018	22	18273.5455	11.0
158	2015	Geelong	44312.0	720060	21	34288.5714	11.0
159	2015	Gold Coast	13643.0	394216	22	17918.9091	4.0
160	2015	Hawthorn	72924.0	899917	22	40905.3182	16.0
161	2015	Melbourne	35953.0	653832	22	29719.6364	7.0
162	2015	North Melbourne	41012.0	608974	22	27680.6364	13.0
163	2015	Port Adelaide	54057.0	737649	22	33529.5000	12.0
164	2015	Richmond	70809.0	1054354	22	47925.1818	15.0
165	2015	St Kilda	32746.0	590021	22	26819.1364	6.0
166	2015	Sydney	48836.0	701028	22	31864.9091	16.0
167	2015	West Coast	60221.0	705440	22	32065.4545	16.0
168	2015	Western Bulldogs	35222.0	595081	22	27049.1364	14.0
169	2016	Adelaide	54307.0	857443	22	38974.6818	16.0
170	2016	Brisbane	23286.0	501909	22	22814.0455	3.0
171	2016	Carlton	50130.0	784445	22	35656.5909	7.0
172	2016	Collingwood	74643.0	972219	22	44191.7727	9.0
173	2016	Essendon	57494.0	802059	22	36457.2273	3.0
174	2016	Fremantle	51889.0	574588	22	26117.6364	4.0
175	2016	GWS	15312.0	403207	22	18327.5909	16.0
176	2016	Geelong	50571.0	750982	22	34135.5455	17.0
177	2016	Gold Coast	12854.0	351648	22	15984.0000	6.0
178	2016	Hawthorn	75351.0	840496	22	38204.3636	17.0
179	2016	Melbourne	39146.0	686641	22	31210.9545	10.0
180	2016	North Melbourne	45014.0	671847	22	30538.5000	12.0
181	2016	Port Adelaide	53743.0	694948	22	31588.5455	10.0
182	2016	Richmond	72278.0	900237	22	40919.8636	8.0
183	2016	St Kilda	38009.0	686520	22	31205.4545	12.0
184	2016	Sydney	56523.0	696570	22	31662.2727	17.0
185	2016	West Coast	65188.0	744774	22	33853.3636	16.0
186	2016	Western Bulldogs	39459.0	702779	22	31944.5000	15.0
187	2017	Adelaide	56865.0	827935	22	37633.4091	15.0
188	2017	Brisbane	21362.0	481838	22	21901.7273	5.0
189	2017	Carlton	50326.0	873618	22	39709.9091	6.0
190	2017	Collingwood	75879.0	1003991	22	45635.9545	9.0
191	2017	Essendon	67768.0	1019757	22	46352.5909	12.0
192	2017	Fremantle	51254.0	690906	22	31404.8182	8.0
193	2017	GWS	20944.0	431975	22	19635.2273	14.0
194	2017	Geelong	54854.0	791228	22	35964.9091	15.0
195	2017	Gold Coast	11665.0	369482	22	16794.6364	6.0
196	2017	Hawthorn	75663.0	866536	22	39388.0000	10.0
197	2017	Melbourne	42233.0	833190	22	37872.2727	12.0
198	2017	North Melbourne	40343.0	554306	22	25195.7273	6.0
199	2017	Port Adelaide	52129.0	699377	22	31789.8636	14.0
200	2017	Richmond	72669.0	1024751	22	46579.5909	15.0
201	2017	St Kilda	42052.0	761222	22	34601.0000	11.0
202	2017	Sydney	58838.0	794909	22	36132.2273	14.0
203	2017	West Coast	65064.0	713672	22	32439.6364	12.0
204	2017	Western Bulldogs	47653.0	729431	22	33155.9545	11.0