Author: Johann de Boer
Year-on-year comparison of club success, membership size, popularity, game attendance and Australian population wealth.
Identified data sources:
The Australian Football League (AFL) is a professional competition in Australia in the sport of Australian Rules Football. Currently there are 18 teams spread throughout Australia. Each season consists of each team playing approximately 22 games. The top performing eight teams then play off in a finals system with the last two remaining teams playing off in a grand final to determine the winner for the year. Some key points relevant to this project are:
Questions for data analysis:
!pip install tweepy
!pip install pandas_datareader
!pip install xlrd
!pip install tables
from IPython.display import display
from scipy import stats
from itertools import groupby
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pylab import rcParams # controls for plot rendering
import sqlite3
from datetime import datetime
import tweepy
import time
import os.path
import json
from pandas_datareader import data, wb
import requests
from bs4 import BeautifulSoup
import xlrd
mpl.style.use('ggplot')
pd.set_option('display.html.border', 1)
pd.set_option('max_columns', 12)
pd.set_option('max_rows', 100)
pd.set_option('display.width', 100)
pd.set_option('multi_sparse', False)
pd.set_option('expand_frame_repr', True)
pd.set_option('max_colwidth', 60)
pd.set_option('precision', 4)
%matplotlib inline
We obtain the Gross National Income (GNI) of Australia per capita (Atlas method in terms of current US$) using The World Bank data API. For details about this data, please refer to this link: http://data.worldbank.org/indicator/NY.GNP.PCAP.CD
Our hypothesis is that crowd attendance and club membership is correlated with the shared wealth of the country, where football supporters subscribe to clubs and purchase tickets when they have more disposal income. GNI per capita has been chosen as a indicator of the general wealth of the population. Later in this analysis we will compare this measurement against crowd attendance and club membership figures.
# Executes World Bank API query, removes the country information,
# renames the metric column and removes entries with no data,
# i.e. outside of the reportable time-frame.
dnp_pcap_au = wb.download(
indicator = 'NY.GNP.PCAP.CD',
country = ['AU'],
start = 1960,
end = 2018
).reset_index("country").drop("country", axis = 1).rename(
columns = {"NY.GNP.PCAP.CD" : "GNI_per_capita"}
).dropna()
dnp_pcap_au = dnp_pcap_au.append(pd.DataFrame(data = {'GNI_per_capita' : 46560}, index = ['2017']))
dnp_pcap_au = dnp_pcap_au.sort_index(ascending = False)
dnp_pcap_au
A full set of data was obtained from the World Bank API, which goes back as far as 1964 and up until 2014 (inclusive). The plot below shows the trend in GNI year-on-year.
# Plot GNI per capita in order of year.
dnp_pcap_au.sort_index().plot()
We can see from the above plot that the average wealth of Australian's in terms of current US dollars has generally increased on an on-going basis over the entire time-frame, with only a very few small downturns. The change in the level of wealth appears to have exponentially risen year-on-year.
Each of the AFL clubs have a public presence on Twitter. It is hypothesised that the number of followers each club has on Twitter will be an indicator of a club's current popularity. We will now obtain the list of AFL clubs and their followers on Twitter. Later in this analysis, the number of Twitter followers will be compared to total membership and crowd attendance for each club to determine if these parameters are correlated.
# PLEASE DO NOT SHARE THESE CREDENTIALS AS THEY ARE PRIVATE AND CONFIDENTIAL
consumer_key="XXYY"
consumer_secret="XXYY"
access_token="XXYY"
access_token_secret="XXYY"
# ==========================================================================
# Twiter public API documentation: https://dev.twitter.com/rest/public
# Tweepy documentation: http://tweepy.readthedocs.io/en/v3.5.0/index.html
# Set the credentials for accessing the Twitter API.
# Perform the authentication handshake with the Twitter API using the above credentials.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)
# Create an object to use as a connection to the authenticated API
api = tweepy.API(auth)
# Rate limit handler for the Twitter API.
def limit_handled(cursor):
while True:
try:
yield cursor.next()
except tweepy.RateLimitError:
print "Rate Limited. Waiting 15 minutes..."
time.sleep(15 * 60)
print "OK, wait is over."
continue
except tweepy.TweepError as e:
print "Tweep Error: {0}".format(str(e))
break
afl_clubs = api.list_members("AFLPlayers", "clubs")
In the above code, a list of AFL Players and their Twitter handles is obtained from the 'clubs' list of the AFL Players' Association's Twitter account. The Twitter API is used to obtain the details of this list and then used to obtain the number of followers of each club in that list.
To ensure consistency across each of the data sources used in this analysis, club names will be adjusted to be consistent.
afl_clubs_df = pd.DataFrame()
# Define a function that will be used to add clubs to the above empty data frame of AFL club
# Twitter accounts along with a selection of associated details (including number of followers).
def add_club_to_df(df, club):
df = df.append({
"id" : club.id_str,
"name" : club.name,
"followers" : club.followers_count,
"location" : club.location,
"tz" : club.time_zone,
"handle" : club.screen_name,
"image" : club.profile_image_url
}, ignore_index = True)
return df
# The above function is then applied to each AFL account from the list returned from the
# Twitter API.
for afl_club in afl_clubs:
afl_clubs_df = add_club_to_df(afl_clubs_df, afl_club)
# The resulting data frame is then tidied up
afl_clubs_df = afl_clubs_df.set_index("id")
afl_clubs_df.followers = afl_clubs_df.followers.astype("int")
afl_clubs_df = afl_clubs_df[["name", "location", "tz", "followers", "handle", "image"]]
afl_clubs_df = afl_clubs_df.rename(columns = {"name" : "club"})
afl_clubs_df.club = afl_clubs_df.club.str.replace("GWS GIANTS", "GWS")
afl_clubs_df.club = afl_clubs_df.club.str.replace("GC SUNS", "Gold Coast")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" FC", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Dockers", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Lions", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Cats", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Eagles", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Crows", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(" Swans", "")
afl_clubs_df.club = afl_clubs_df.club.str.replace(r'\W+', "")
afl_clubs_df = afl_clubs_df.set_index('club')
The following graph shows the number of Twitter followers for each club:
# Plot number of followers for each club in order of lowest to highest.
afl_clubs_df['followers'].sort_values().plot.barh()
Memberships for 2006 through to 2009 are obtained from footyindustry.com:
# Fetch the page with the club membership details on it
memberships_2006_2009_url = "http://www.footyindustry.com/wiki/index.php?title=Club_Memberships"
memberships_2006_2009_response = requests.get(memberships_2006_2009_url)
# Parse the page and extract the tables from it. Each table represents a single year.
memberships_2006_2009_response
memberships_2006_2009 = BeautifulSoup(memberships_2006_2009_response.content, "lxml")
memberships_2006_2009_tables = memberships_2006_2009.findAll("table")
# Convert the tables into data frames, selecting only the tables we require.
memberships_2006_2009_dfs = pd.read_html(
str(memberships_2006_2009_tables[3:7]),
header = 1,
index_col = 0
)
# Remove rows with missing values and the final totals row from each table, keeping only the
# the number of members with 'game access' type memberships.
memberships_2006_2009_dfs = [df.dropna().iloc[0:-1, 3] for df in memberships_2006_2009_dfs]
# Combine all the tables into a single data frame with a column indicating the year
memberships_2006_2009_dfs = pd.concat(
{year: df for year, df in zip(['2006', '2007', '2008', '2009'], memberships_2006_2009_dfs)},
names = ['year', 'club']
)
# Convert the series into data frames with clubs as rows and years as columns
memberships_2006_2009_df = memberships_2006_2009_dfs.reset_index(name = "members").pivot_table(
index = "club", columns = "year", values = "members", aggfunc = sum
)
Memberships for 2010 and 2011 are obtained from Wikipedia:
# Download the webpage with the membership figures on it
memberships_2011_url = "https://en.wikipedia.org/wiki/2011_AFL_season"
memberships_2011_response = requests.get(memberships_2011_url)
# Parse the webpage, extracting the table found in the club membership section
memberships_2011 = BeautifulSoup(memberships_2011_response.content, "lxml")
memberships_2011 = memberships_2011.find(attrs={"id" : "Club_membership"}).find_next("table")
# Convert the table into a data frame, naming the columns and discarding the last
# column that we do not require.
memberships_2011_df = pd.read_html(str(memberships_2011), header = 0)[0].iloc[:, 0:-1]
memberships_2011_df.columns = ["club", "2011", "change_from_2010"]
# Clean up the data in the table to calculate the 2010 figures
# If the 'change' column does not start with a digit, then it must be a negative change.
# Extract the change from 2010 to 2011 to calculate the original 2010 figures.
memberships_2011_df.change_from_2010 = memberships_2011_df.change_from_2010.str.replace(
r"^[^0-9]+", "-"
).str.extract(
r"(.+) !", expand = False
).astype(np.number)
memberships_2011_df['2010'] = memberships_2011_df['2011'] - memberships_2011_df.change_from_2010
# Arrange the dataframe with club as the rows and keeping just the 2010 and 2011 columns.
memberships_2011_df = memberships_2011_df.set_index('club')
memberships_2011_df = memberships_2011_df[["2010", "2011"]]
# Tidy up the club names for consistency and remove the final 'total' row from the data frame.
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Lions", "")
memberships_2011_df.index = memberships_2011_df.index.str.replace(" Eagles", "")
memberships_2011_df = memberships_2011_df.drop("Total")
Memberships for 2012 are obtained from a sports news article on the news.com.au website:
# Download the webpage with the club membership stats on it
memberships_2012_url = "http://www.news.com.au/sport/afl/collingwood-and-hawthorn-record-membership-increases/story-fnelctok-1226466581212"
memberships_2012_response = requests.get(memberships_2012_url)
# Parse the webpage, defining a regular expression to extract the club name and membership
# size from the text contained in bulleted paragraphs.
memberships_2012_memberships = BeautifulSoup(memberships_2012_response.content, "lxml")
memberships_2012_regex = re.compile("([a-zA-Z ]+): ([0-9]{2},[0-9]{3})")
memberships_2012_tags = memberships_2012_memberships.find_all(string = memberships_2012_regex)
# Extract the club name and membership size seperately (using the regular expression already defined)
# for each paragraph as a list of tupples and coerce these into a Data Frame naming each column
# appropriately to 'club' and 'members'.
memberships_2012_df = pd.DataFrame(
[re.match(memberships_2012_regex, club).groups() for club in memberships_2012_tags],
columns = ["club", "members"]
)
# Set the club to index the rows, format the club sizes into numbers, set the column to the year
# and tidy the club names for consistency.
memberships_2012_df = memberships_2012_df.set_index("club")
memberships_2012_df.members = memberships_2012_df.members.str.replace(",", "").astype("int")
memberships_2012_df = memberships_2012_df.rename(columns = {"members": "2012"})
memberships_2012_df.index = memberships_2012_df.index.str.replace(" Giants", "")
Memberships for the years 2013 and 2014 are obtained from an article from the official AFL website:
# Download the webpage
memberships_2013_2014_url = "http://www.afl.com.au/news/2014-08-02/club-memberships-rise"
memberships_2013_2014_response = requests.get(memberships_2013_2014_url)
# Parse the webpage by extracting the table and converting into a data frame with appropraitely
# named columns, setting the club name as the row index and order the years from left to right.
memberships_2013_2014 = BeautifulSoup(memberships_2013_2014_response.content, "lxml")
memberships_2013_2014_table = memberships_2013_2014.find("table")
memberships_2013_2014_df = pd.read_html(str(memberships_2013_2014_table), header = 0)[0]
memberships_2013_2014_df.columns = ["club", "2013", "2014"]
memberships_2013_2014_df = memberships_2013_2014_df.set_index("club")
memberships_2013_2014_df[["2013", "2014"]] = memberships_2013_2014_df[["2013", "2014"]].apply(
# Format the club sizes into numbers.
lambda s: s.astype(str).str.replace(r"[*,]", "").astype(int),
axis = 1
)
# Tidy up the club names to be consistent
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Lions", "")
memberships_2013_2014_df.index = memberships_2013_2014_df.index.str.replace(" Swans", "")
Finally, memberships for the years 2015, 2016 and 2017 are obtained from an articles on aflonline.com.au:
def get_memberships_article(year = "2016"):
# Download the webpage
memberships_url = "http://www.aflonline.com.au/afl-news/current-afl-membership-numbers-for-" + str(year) + "/"
memberships_response = requests.get(memberships_url)
# Parse the webpage by extracting the table, converting into a data frame and naming the columns.
memberships = BeautifulSoup(memberships_response.content, "lxml")
memberships_table = memberships.find("table")
memberships_df = pd.read_html(str(memberships_table), header = 0)[0]
memberships_df = memberships_df.iloc[:, 0:3]
memberships_df.columns = ["club", "year2", "year1"]
# Set the club name as the row index.
memberships_df = memberships_df.set_index("club")
# Tidy the club names for consistency.
memberships_df.index = memberships_df.index.str.replace(" Lions", "")
memberships_df.index = memberships_df.index.str.replace(" Giants", "")
memberships_df.index = memberships_df.index.str.replace(r"Melb$", "Melbourne")
# Order the years from left to right.
memberships_df = memberships_df[['year1', 'year2']]
memberships_df = memberships_df.iloc[memberships_df.index != "TOTAL", :]
return memberships_df
memberships_2015_2016_df = get_memberships_article("2016")
memberships_2016_2017_df = get_memberships_article("2017")
memberships_2015_2016_df.columns = ["2015", "2016"]
memberships_2016_2017_df.columns = ["2016", "2017"]
memberships_2015_onwards = memberships_2015_2016_df[["2015"]]
memberships_2015_onwards = memberships_2015_onwards.join(memberships_2016_2017_df, how = "outer")
memberships_2015_onwards
The memberships from each of the above sources are then concatenated together to form a single table.
# Concatenate the scraped membership stats into a single data frame with a multi-index
# for clubs and year forming the rows.
club_memberships_df = memberships_2011_df.join([
memberships_2006_2009_df,
memberships_2012_df,
memberships_2013_2014_df,
memberships_2015_onwards
], how = "outer")
club_memberships_df.index.name = "club"
club_memberships_df = club_memberships_df.stack()
club_memberships_df.index.names = ['club', 'year']
club_memberships_df = club_memberships_df.reset_index(name = "members")
club_memberships_df = club_memberships_df.set_index(['year', 'club'])
The heatmap below compares the membership size of each club year-on-year. Note that Gold Coast and GWS started as AFL clubs in 2011 and 2012, respectively, and therefore had no prior membership figures.
# Transform the memberships data frame into wide format.
memberships_club_year_pivot = club_memberships_df.reset_index().pivot_table(
values = "members", index = "club", columns = "year", aggfunc = sum, dropna = False
)
# Produce a heatmap, with the clubs sorted in order of total membership renewals
sns.heatmap(
memberships_club_year_pivot.loc[
memberships_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
],
robust = True
)
plt.yticks(rotation = 0)
plt.show()
We can see that Collingwood memberships increased significantly in 2011. As we will see later in this report, this corresponds with an increase in game wins. The same can be said for Hawthorn in 2014 onwards. However we can see that there is little to no change for many teams regardless of the number of wins. For example, the Western Bulldogs have seen minimal change to membership even with large variations in game wins (eg. 10 wins in 2010, 5 wins in 2012 and 14 in 2015).
Interestingly, we can see that in 2009 club memberships were generally low across all of the clubs but then increased again in 2010. This is at the time of the Global Financial Crisis (GFC), although population wealth at the time did not decrease as we saw earlier from the plot of GNI per capita year-on-year. Therefore we cannot, at this stage, state whether the GFC is the reason for the coinciding dip in memberships.
Now we obtain the attendance numbers from afltables.com:
# The following function can be used to scrape a series of pages from the same website, where the
# difference in extracting the crowd attendance data is the year given in the URL.
def get_attendance(year = "2016"):
# Download the required page
url = "http://afltables.com/afl/crowds/" + str(year) + ".html"
response = requests.get(url)
# Parse the page by extracting the table and converting into a data frame, keeping only the required
# columns.
attendances = BeautifulSoup(response.content, "lxml")
attendance_by_team = attendances.find("table")
df = pd.read_html(str(attendance_by_team), header = 0)[0].iloc[:-1, 0:7]
# Name the columns
df.columns = [
"club",
"home_attendance", "home_games", "home_avg",
"away_attendance", "away_games", "away_avg"
]
# Use club name as the index for the rows.
df = df.set_index("club")
# Remove these columns we later decided we are not interested in them:
df = df.drop(["home_avg", "away_avg"], axis = 1)
df = df.drop(index="Team")
# Coerce the numbers into integers.
df.home_attendance = df.home_attendance.astype("int")
df.home_games = df.home_games.astype("int")
df.away_attendance = df.away_attendance.astype("int")
df.away_games = df.away_games.astype("int")
# Rename clubs to be consistent with other datasets.
df.index = df.index.str.replace(" Lions", "")
df.index = df.index.str.replace("Greater Western Sydney", "GWS")
df.index = df.index.str.replace("Kangaroos", "North Melbourne")
return df
# Apply the above function to scrape the website for attendance data for the years 2006 to 2017 (inclusive)
attendance_by_year = {str(year): get_attendance(year) for year in range(2006, 2018)}
attendance_by_year = pd.concat(attendance_by_year, names = ["year"])
# Calculate new fields for total attendees, total games and average attendance, keeping only
# the calculated columns in the data frame already indexed by year and club name.
attendance_by_year['attendees'] = attendance_by_year['home_attendance'] + attendance_by_year['away_attendance']
attendance_by_year['games'] = attendance_by_year['home_games'] + attendance_by_year['away_games']
attendance_by_year = attendance_by_year[['attendees', 'games']]
attendance_by_year['avg attendance'] = attendance_by_year['attendees'] / attendance_by_year['games']
The following heatmap compares the average attendance per game by playing team, year-on-year:
# Transform the attendance by club and year into wide format
attendance_club_year_pivot = attendance_by_year.reset_index().pivot_table(
values = "avg attendance", index = "club", columns = "year", dropna = False
)
# Plot a heatmap of attendance by year and club. Sort the clubs by highest total spectators to least
sns.heatmap(
attendance_club_year_pivot.loc[
attendance_club_year_pivot.sum(axis = 1).sort_values(ascending = False).index
],
robust = True
)
plt.yticks(rotation = 0)
plt.show()
There is a tendency for Victorian based clubs to record higher attendance figures when compared to non-Victorian clubs. This may be due to cultural reasons such as the sport having a stronger following in Victoria compared to other states.
We now read the Excel file containing the match results for each game:
# Open the spreadsheet file, selecting only the columns we are interested in.
odds_df = pd.read_excel(
"afl.xlsx",
skiprows = 1,
usecols = "A:AY"
)
# Tidy up the team names to be consistent
odds_df[['Home Team', 'Away Team']] = odds_df[['Home Team', 'Away Team']].apply(
lambda club: club.str.replace(" Giants", ""),
axis = 1
)
# Keep the columns we are interested in and rename the column indicating whether the game is a final.
odds_df = odds_df[['Date', 'Home Team', 'Away Team', 'Home Score', 'Away Score', 'Play Off Game?']]
odds_df = odds_df.rename(columns = {"Play Off Game?": "Final"})
# Encode the Final column into Y and N.
odds_df.Final = odds_df.Final.fillna(value = "N")
odds_df = odds_df[odds_df.Final == "N"]
# Identify games where the home team won, the away team won and where there were draws.
odds_df['Home Win'] = odds_df['Home Score'] > odds_df['Away Score']
odds_df['Away Win'] = odds_df['Home Score'] < odds_df['Away Score']
odds_df['Draw'] = odds_df['Home Score'] == odds_df['Away Score']
# Identify the names of the teams that won each game
odds_df['Home Win'] = odds_df['Home Team'].where(odds_df['Home Win'])
odds_df['Away Win'] = odds_df['Away Team'].where(odds_df['Away Win'])
# A function to return the name of the team that won a game (represented as a row)
def decide_winning_team(row):
if row['Draw'] == True:
return "Draw"
elif pd.isnull(row['Home Win']):
return row['Away Win']
elif pd.isnull(row['Away Win']):
return row['Home Win']
# Reduce the columns down to just one column using the above function that indicates which
# team was the winning team, otherwise a draw.
odds_df['Winning Team'] = odds_df[['Home Win', 'Away Win', 'Draw']].apply(
decide_winning_team,
axis = 1
)
# Remove the redundant 'Draw' column as we have this encoded in the above already.
odds_df = odds_df.drop("Draw", axis = 1)
# Determine the year (season) of the game.
odds_df['year'] = odds_df['Date'].dt.year.astype("str")
# Determine the number of games won, grouped by each team
winning_team_by_year = odds_df.groupby(['year', 'Winning Team']).size()
winning_team_by_year = winning_team_by_year.reset_index(name = "wins")
# Format the number of games won by team data frame into the same structure as our other
# datasets, i.e. rows indexed by year and club.
winning_team_by_year = winning_team_by_year.rename(columns = {"Winning Team": "club"})
winning_team_by_year = winning_team_by_year.set_index(['year', 'club'])
# Any years where a team is not mentioned in this must be a team that had zero wins that year.
winning_team_by_year = winning_team_by_year.reset_index().pivot_table(
values = "wins", index = "club", columns = "year", fill_value = 0, dropna = False
).stack().reset_index(name = "wins").rename(
columns = {'level_1' : 'year'}
).set_index(['year', 'club']).sort_index()
The following heatmap compares the number of games won by each team year-on-year:
# Transform from long to wide format.
winning_team_by_year_pivot = winning_team_by_year.reset_index().pivot_table(
values = "wins", index = "club", columns = "year", dropna = False
)
# Plot heatmap, in order of club with most wins to least.
sns.heatmap(
winning_team_by_year_pivot.loc[
winning_team_by_year_pivot.sum(axis = 1).sort_values(ascending = False).index
],
robust = True
)
plt.yticks(rotation = 0)
plt.show()
The graph above ranks the clubs by total number of wins over the entire time frame and compares the wins they have had year-on-year. It also shows when there were draws. We can see that the top two clubs (Geelong and Hawthorn) have had a period of sustained success with both recording a high number of wins each season since 2011. GWS and Gold Coast have recorded the least number of wins however this is due to the fact that they are new clubs and only entered the AFL in 2012 and 2011 respectively. Note: Figures shown for 2016 are incomplete because the season is yet to finish.
The table below shows the same information with the clubs ordered alphabetically. It provides the number of wins for each club, each year.
winning_team_by_year_pivot
We can now combine the club memberships, attendance and winner results into one data frame as they are all consistently indexed by club and year. Note that we do not have game results prior to 2009.
# Join attendance data with club memberships and wins
year_club_stats = club_memberships_df.join(
attendance_by_year, how = "inner"
).join(
winning_team_by_year, how = "left"
)
year_club_stats = year_club_stats.reset_index()
year_club_stats
The above table shows the combined datasets.
# Calculate the totals for each year across the combined clubs
year_stats = year_club_stats.groupby("year").sum()
# As there is always two clubs in every game, the total games should be halved.
year_stats.games = year_stats.games / 2
# Similarly attendees are spectators of both teams at the same game.
year_stats.attendees = year_stats.attendees / 2
# Calculate average attendance at each game
year_stats['avg attendance'] = year_stats.attendees / year_stats.games
# Exclude 2016 from the plot as that is an incomplete year
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "avg attendance")
# Annotate the plot with the year
# Following solution from: http://stackoverflow.com/questions/32154518/annotate-labels-in-pandas-scatter-plot
def annotate_df(row):
p.annotate(row.name, row[['members', 'avg attendance']].values)
p_data.apply(annotate_df, axis = 1)
plt.show()
We can see a steady increase in total members each year since 2010 however it is interesting to note that this has not resulted in increased average attendances. Average attendances have actually dropped significantly in 2012.
One possible reason for this is the introduction of two new clubs into the competition in 2011 and 2012 (Gold Coast and GWS). These two clubs have shown to have a lower number of wins per year and have therefore reduced the quality of the competition for spectators.
# Similar code to the above, same explanation.
p_data = year_stats#[year_stats.index != "2016"]
p = p_data.plot.scatter(x = "members", y = "attendees")
def annotate_df(row):
p.annotate(row.name, row[['members', 'attendees']].values)
p_data.apply(annotate_df, axis = 1)
plt.show()
Attendance per year has remained constant with very little fluctuation. This is despite significant increases in total membership numbers.
# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p = p_data.plot.scatter(x = "GNI_per_capita", y = "members")
def annotate_df(row):
p.annotate(row.name, row[['GNI_per_capita', 'members']].values)
p_data.apply(annotate_df, axis = 1)
plt.show()
From the above plot it appears that club memberships correlate with GNI per capita.
We can see that GNI per capita has generally increased year-on-year although 2014 showed a small decline from 2013.
It is interesting to see that membership numbers have also increased each year except for 2008 and 2009. One possible explanation for this could be the Global Financial Crisis which affected Australia's economy around this time.
# Similar code to the above, same explanation.
p_data = year_stats.join(dnp_pcap_au, how = "inner")
p = p_data.plot.scatter(x = "GNI_per_capita", y = "attendees")
def annotate_df(row):
p.annotate(row.name, row[['GNI_per_capita', 'attendees']].values)
p_data.apply(annotate_df, axis = 1)
plt.show()
GNI per capita does not appear to correlate with total attendance at games which is in contrast to the previous plot which showed some correlation with club memberships.
# As we only have current Twitter followers we must compare to the most up to date
# club stats. The last complete year for club stats is 2015 so we compare with that year.
# Join the Twitter data with the year_club_stats for 2015.
club_stats_2015 = year_club_stats[year_club_stats.year == "2015"].join(
afl_clubs_df, how = "outer", on = "club"
)
# Keep only the columns we are interested in.
club_stats_2015 = club_stats_2015[
['club', 'members', 'attendees', 'games', 'wins', 'followers', 'avg attendance']
]
# Plot two metrics from the above
club_stats_2015.plot.scatter(x = "members", y = "followers")
The above plot compares current Twitter followers with 2015 memberships for each club. It shows a correlation between membership numbers and Twitter followers.
Note: Historical data cannot be obtained through the Twitter API for last year and 2016 membership numbers are yet to be finalised.
club_stats_2015.plot.scatter(x = "avg attendance", y = "followers")
The above plot shows a correlation between Twitter followers and average game attendance.
year_club_stats.plot.scatter(x = "members", y = "avg attendance")
The above plot shows the average attendance compared to membership size for each club per year. This shows a strong correlation between average attendance and membership size.
year_club_stats.plot.scatter(x = "members", y = "wins")
The above plot show the number of wins compared to membership size for each club per year. Membership size and number of wins does not appear to be significantly correlated.
year_club_stats.plot.scatter(x = "avg attendance", y = "wins")
The above plot shows the number of wins compared to average attendance for each club per year. It shows some correlation between number of wins and average attendance at games. Teams with poorer performance tend to result in games with lower crowd attendance.
# Save the data frames to CSV files.
dnp_pcap_au.to_csv("Australia_GNI_per_capita.csv")
afl_clubs_df.to_csv("Twitter_followers_by_AFL_club.csv")
year_club_stats.to_csv("Stats_by_club_and_year.csv")