Foreword¶

This case is split into two parts. The first section, Analysis, discusses the high level implications of my findings. The second section, Technical Process, goes through the step by step code that I used to generate these insights.

I will be operating under the assumption that you know the basics of how chess is played. So, if you are unfamiliar with chess you may want to look at another one of my cases instead.

Analysis¶

I've been playing online chess for almost four years now and I've played a lot of games. On the one hand, this is great for data analysis because it gives me a large sample size with which to draw insights. But on the other hand, I am embarassed to admit just how many games I've played. This analysis will be using data from ~1,950 chess games that I've played over the last 3-4 years. Over 100,000 moves have been played across those games.

I figured that since Lichess.org offers an API to fetch personal game data it would be interesting to see what insights can be gleaned from the years I've spent playing online chess. The goal of this analysis is to discover actionable insights that I can use to improve my chess rating.

Let's take a look at how I've been doing so far. The graph below shows a breakdown of my rating over time broken out by different chess game variants. To put the rating numbers into context, someone who has just started playing chess will typically clock in at around 1000 - 1200 rating points, whereas grandmasters can be found in the 2000+ range. The current world champion, Magnus Carlsen, hovers at around 2800 rating points.

It is also important to understand the different game variations in the graph below. Classical, rapid, blitz, and bullet chess are all standard chess games but played with different time constraints. Three Check, is a variation of chess where the goal is to check your opponent three times to win.

The time constraints of classical, rapid, blitz, and bullet chess are as follows:

Classical is any game where each side has >15 minutes to play the game.
Rapid is between 5 and 15 minutes per side.
Blitz is between 1 and 5 minutes per side.
Bullet is < 1 minute per side. This sounds completely ridiculous, I know, but top chess players can somehow make it work.

My chess ratings over time by game speed and game variant

We can see that I am much weaker at faster time controls. In the 3-4 years that I've been playing, my bullet and blitz ratings have barely changed whereas my classical and rapid rating have increased to 1700 and 1500 rating points respectively.

Let's see what insights we can draw from my games to improve my ratings. To source my game data, I connected to the Lichess API and cleaned my data with Python then compiled that data into a PostGreSQL database. Detailed code can be found in the Technical Process section below.

Ok enough technical jargon, let's get my rating up.

Before even looking for ways to improve my actual chess play, I want to see if I can game the matchmaking system to artifically inflate my rating without improving my play. On Lichess, we can specify the rating range we would like our opponent to be within, so maybe I can play more often against weaker opponents to improve my rating. Let's check to see our win rates against different opponent strengths.

Win rate against different opponent levels

Well, right off the bat it looks like my assumption that our sample would be good enough for conclusive analysis may be false. After filtering out non-ranked games and unusual chess variations we're left with just under 1500 games. Of those 1,500 games, 1,000 of them were played against similar skill level opponents, leaving us with very small sample sizes for opponents of stronger and weaker difficulties.

Even with the small sample sizes though, we can still see a clear trend, the stronger our opponent, the lower our winning percentage. By that logic we should selectively play weaker opponents to increase our rating, right? Well, not necessarily. The rating system will award less points for a win against a weaker player while revoking significantly more points for a loss against a weaker player. Because of this, we should check our average rating gain/loss for each opponent level before assuming that a higher win rate indicates more rating points gained.

Rating points gain/loss against different opponent levels

Again it's important to acknowledge that we still have problems with our sample size with this data. But we'll work with what we've got.

Counterintuitively, even though we only win 16% of games when our opponent is much stronger than us, this difficulty actually nets the largest average rating gain. Therefore we should be selectively targeting opponents with 200 or more rating points above our current level.

Conversely, we are losing almost 30% of the time when we're playing much weaker opponents. Our loss rate against very weak opponents is double our win rate against very strong opponents. Knowing this, it makes sense that we lose a ton of rating points (8 points on average) playing against very weak players. We should be avoiding playing weak opponents at all costs. Our win rate against weak opponents is not strong enough to justify the large rating loss on losing.

Ok now that we've established criteria for selecting our opponents let's take a look at our actual chess play.

At the end of every rated game on Lichess, a computer engine evaluates the strength of each move using a metric called centipawn loss (CPL). Centipawn loss is exactly what it sounds like. It is the fraction of a pawn that you are losing each move. To elaborate, if we make a move and the CPL is 100, then we have just worsened our position by the value of 1 full pawn. The goal of each move we make is to minimize centipawn loss. A CPL of 20 is considered very good.

Let's check our CPL within different game variants to see if our lower rated variants (the shorter time controls) have higher CPLs. One would expect shorter time controls to have higher average CPL since we have less time to think through our decisions.

Centipawn loss by variant

Again we've got low sample sizes for the shorter time controls, but let's be honest. We don't need data to know that 1-5 minutes per side is crazy, and will therefore yield a very high CPL. Also, note that overall sample size has decreased because CPL records are only available if requested at the time the match was played. This may inherently bias our sample, since I would only have requested CPL records for more interesting matches. We'll work with what we've got though.

As expected the longer the game, the better our decisions are. But there is nothing actionable to be taken from that knowledge, so let's see if we can segment our games by the phase of the game to see where we can improve in each variant.

Below I've broken out average CPL by the first third of the game as the opening, the second third as the middlegame, and the remaining third as the endgame.

Centipawn loss by variant by game phase

We can again see that CPL decreases as we play longer time controls, but now we can also see that we are consistently stronger during the opening and play progressively weaker moves as games progress. The endgame is our weakest phase across all time controls.

This could be a result of mental fatigue throughout the game, but it is more likely because the opening is just easier to play correctly than the sharper positions that develop later in the game. During most opening situations I will have set moves that I intend on playing because I am following a predetermined opening line. This minimizes the risk that I make any missteps during the first moves of the game. As the game progresses the position often becomes more unclear resulting in more mistakes.

It is helpful to know that I should be focusing more on training in endgame situations rather than memorizing opening sequences, as many players spend time doing. But let's see if we can look deeper into this data. I want to see which moves I am making that are causing my endgame and middlegame CPL to increase. Let's break this out further by looking at which pieces I am moving that contribute the most to my average CPL.

Centipawn loss by variant by game phase by piece

The above data visualizes why I am not good very good at bullet and blitz chess. Those quadrants of the CPL graph are a big red-orange blob of terrible moves. Granted, the sample sizes for bullet and blitz are too low to draw any meaningful conclusions.

Focusing on classical and rapid chess, however, we can see a couple of anomalies. Our endgame pawn moves in longer chess games are very weak, specifically in classical play. With an average CPL of 100, every time we move a pawn during a classical endgame, we will on average lose a pawn's worth of value. This is very actionable information - we know that we should be researching endgame pawn strategy to improve our classical rating.

We can also see that our king moves in the opening during rapid games are very weak. This makes sense, since we really shouldn't be moving our king at all during the opening. We can see this reflected on our frequency graph, where king moves in the opening are some of the least played moves in rapid and classical... But when they are played, they are usually a bad move.

We can also see on the frequency map that rarely are we castling in the middle or endgame. This makes sense and provides a nice sanity check that our data is correct. Players will almost always castle in the early game to get their king to safety. And since a player cannot castle twice, castling is usually not playable in the middle or endgame.

Outside of poor late game pawn moves in classical play, the most noticeable trend is again that my play worsens significantly from opening to middlegame and from middlegame to endgame. This is consistent across all four variants and across almost every piece type. My end game needs work.

Ultimately, I can draw a few conclusions from this analysis:

I can artificially inflate my rating by selectively playing much stronger players than my current rating.
My play deteriorates as the game progress. I ought to spend my time learning to end game strategies rather than focusing on opening or middlegame theory.
In particular, my endgame pawn moves need serious improvement in longer time controls. I could take some time to study endgame pawn strategy so that I don't lose a pawn's worth of value each time I move a pawn.

In closing, I am a little disappointed at the results of this analysis. I thought I'd be able to come away with stronger conclusions by breaking out CPL by variant/piece/phase - but whether it was due to small sample size or my own chess inconsistencies, the patterns didn't seem very pronounced. I think that with a game as complicated as chess, to really glean mind blowing insights would require much more in depth analysis. For instance, if one were to integrate in a chess engine to go beyond the basic CPL metric (there are deeper chess metrics than just CPL, which weren't available to me) there would be many more possibilities to draw stronger insights.

I'll also admit that maybe I haven't played as many chess games as I first thought... I needed more data.

Technical Process¶

Before we can analyze anything we need to fetch our data, clean it, and store it.

First let's create a PostGreSQL server to host our database and create a database within that server to store our data.

Next let's import all the modules we will need for this analysis. Namely we will be using Python's Pandas library for much of our data manipulation and Psycopg2 to connect to our PostGres database and to send SQL queries.

#Importing all the needed modules for this analysis
import pandas as pd
import requests, json, pickle, psycopg2, datetime
from itertools import zip_longest
from IPython.display import display_html

#This function allows for displaying dataframes side by side 
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

#Formatting table displays
pd.options.display.max_columns = None
pd.options.display.max_rows = 10

Next I've written a quick Psycopg2 class to simplify connecting to our database. We will use this throughout our analysis.

#The Psycopg2 database connection class I use to connect to the database
class Database_Connection:
    def __init__(self, _dbname, _user="", _host="localhost", _password="", _port=2222):
        try:
            self.connection = psycopg2.connect(dbname=_dbname, user=_user, host=_host, password=_password, port=_port)
            self.connection.autocommit = True
            self.cursor = self.connection.cursor()
        except:
            print("Cannot connect to database\n")

    def comm(self, command, command2=None):
        self.cursor.execute(command, command2)

    def close(self):
        self.cursor.close()
        self.connection.close()

Now that we have a database to store our data let's fetch our data, clean it, and structure our database.

The script below sends a GET request to the Lichess API to fetch an NDJSON (variation of JSON) file of all of the games I've ever played, all the moves I've made, and details on the players in each of those games.

The script then converts the NDJSON file to standard JSON and the parses through the JSON file to clean the data for use in a databse. Specifically, we structure and create three data tables: 'overview', 'player_info', and 'moves'.

fetch_new_data = False
create_database = False

#Retrieves all Lichess game details from Lichess API and saves them to a JSON file
if fetch_new_data:
    url = "https://www.lichess.org/api/games/user/mbellm"
    print("Fetching game data...")
    response = requests.get(
        url,
        params={"clocks":"true", "evals":"true", "opening":"true"}, 
        headers={"Accept":"application/x-ndjson"})

    response = response.content.decode("utf-8")
    stripped_response = [json.loads(x) for x in response.split("\n")[:-1]]
    with open("chess_raw_data.json", "w") as f:
        json.dump(stripped_response, f, indent=2)
    print("Data Fetched.")

with open("chess_raw_data.json") as f:
    chess_games_list = json.load(f)


#Connects to database using Psycopg2 and creates tables for data insertion
if create_database:
    print("Connecting to database and creating tables...")
    chess_db = Database_Connection("chess_db")
    chess_db.comm("DROP TABLE IF EXISTS overview, moves, player_info")
    chess_db.comm("""
        CREATE TABLE overview (
        Game_ID VARCHAR (50) PRIMARY KEY,
        Game_start TIMESTAMP,
        Game_end TIMESTAMP,
        Rated BOOL NOT NULL,
        Variant VARCHAR (50),
        Speed VARCHAR (50),
        Status VARCHAR(50),
        Winner VARCHAR(50),
        Opening VARCHAR(255),
        Time_Initial INTEGER,
        Time_Increment INTEGER
        )""")
    chess_db.comm("""
        CREATE TABLE player_info (
        Index SERIAL PRIMARY KEY,
        Game_ID VARCHAR (50) NOT NULL REFERENCES overview(Game_ID),
        Username VARCHAR (50),
        Colour VARCHAR (50),
        Rating INTEGER,
        Rating_dif INTEGER,
        Inaccuracies INTEGER,
        Mistakes INTEGER,
        Blunders INTEGER,
        Average_CPL INTEGER
        )""")
    chess_db.comm("""
        CREATE TABLE moves (
        Index SERIAL PRIMARY KEY,
        Move_number INTEGER,
        Colour VARCHAR (50),
        Game_ID VARCHAR (50) NOT NULL REFERENCES overview(Game_ID),
        Move VARCHAR(10),
        Eval INTEGER,
        Forced_mate_for VARCHAR (50),
        Error_value VARCHAR (50)
        )""")
    print("Tables created.")

    #Parses through JSON file, cleans data and inserts data into tables
    count = 0
    for game in chess_games_list:
        count += 1
        print("Inserting data from game:", count)
        game_id = game["id"]
        rated = bool(game["rated"])
        variant = game["variant"]
        game_start = game["createdAt"]
        game_start = datetime.datetime.utcfromtimestamp(game_start/1000).strftime('%Y-%m-%d %H:%M:%S')
        game_end = game["lastMoveAt"]
        game_end = datetime.datetime.utcfromtimestamp(game_end/1000).strftime('%Y-%m-%d %H:%M:%S')
        speed = game["speed"]
        status = game["status"]
        try:
            opening = game["opening"]["name"]
        except:
            opening = None
        try:
            winner = game["winner"]
        except:
            winner = None
        try:
            time_initial = game["clock"]["initial"]
            time_increment = game["clock"]["increment"]
        except:
            time_initial = None
            time_increment = None

        chess_db.comm("""
        INSERT INTO overview 
        (Game_ID, Game_start, Game_end, Rated, Variant, Speed, Status,
        Winner, Opening, Time_initial, Time_increment)
        VALUES (
        %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
        )""", (game_id, game_start, game_end, rated, variant,
            speed, status, winner, opening, time_initial, time_increment))


        for colour in game["players"]:
            try:
                user = game["players"][f"{colour}"]["user"]["name"]
            except:
                user = None

            try:
                rating = game["players"][f"{colour}"]["rating"]
                rating_dif = game["players"][f"{colour}"]["ratingDiff"]
            except:
                rating = None
                rating_dif = None

            try:
                inaccuracies = game["players"][f"{colour}"]["analysis"]["inaccuracy"]
                mistakes = game["players"][f"{colour}"]["analysis"]["mistake"]
                blunders = game["players"][f"{colour}"]["analysis"]["blunder"]
                average_cpl = game["players"][f"{colour}"]["analysis"]["acpl"]
            except:
                inaccuracies = None
                mistakes = None
                blunders = None
                average_cpl = None

            chess_db.comm("""
            INSERT INTO player_info 
            (Game_ID, Username, Colour, Rating, Rating_dif, Inaccuracies, 
            Mistakes, Blunders, Average_CPL)
            VALUES (
            %s, %s, %s, %s, %s, %s, %s, %s, %s
            )""", (game_id, user, colour, rating, rating_dif,
                inaccuracies, mistakes, blunders, average_cpl))

        moves = game["moves"].split(" ")
        try:
            evals = game["analysis"]
        except:
            evals = [None]

        move_number = 1
        for sequence in zip_longest(moves, evals):
            move = sequence[0]
            if move_number % 2 == 0:
                colour = "black"
            else:
                colour = "white"

            try:
                forced_mate_for = None
                evaluation = None
                evaluation = sequence[1]["eval"]
            except:
                try:
                    forced_value = sequence[1]["mate"]
                    if forced_value < 0:
                        forced_mate_for = "black"
                    else:
                        forced_mate_for = "white"
                except:
                    evaluation = None
            try:
                error_value = sequence[1]["judgment"]["name"]
            except:
                error_value = None

            chess_db.comm("""
            INSERT INTO moves 
            (Move_number, Colour, Game_ID, Move, Eval, 
            Forced_mate_for, Error_value)
            VALUES (
            %s, %s, %s, %s, %s, %s, %s
            )""", (move_number, colour, game_id, move,
                evaluation, forced_mate_for, error_value))

            move_number += 1
    chess_db.close()

Reconnecting to database for querying

chess_db = Database_Connection("chess_db")

Alright, we've got our database! Let's take a look at the three tables we are working with. Firstly, we have our overview table which gives us an overview of the information related to each of my 1962+ games. Yes, that's how many games I've played.

#Overview table
query = """
SELECT *
FROM overview;
"""

df = pd.read_sql(query, chess_db.connection)
df.head()

We also have our player info table which provides info on each player in each game such as their ratings and average centipawn loss for the game.

#Player_info table
query = """
SELECT *
FROM player_info
WHERE mistakes IS NOT NULL;
"""

df = pd.read_sql(query, chess_db.connection)
df.head()

Lastly we have our moves table. This one is a big one, storing over 100,000 total moves. It shows info such as the move number during each game, the move that was made, and the engine evaluation at the time that move was made.

#Moves table
query = """
SELECT *
FROM moves
WHERE error_value IS NOT NULL;
"""

df = pd.read_sql(query, chess_db.connection)
df.head()

Let's start by sending a query to check our ratings over time for each game variant where we've played at least 10 games. We need to clean up our variant and speed columns to view this since we want to break out the 'standard' variant by time control but we want to keep variants such as 'three check' labeled by their variant name.

query = """
SELECT game_start, rating,
    CASE
        WHEN variant != 'standard' THEN variant
        WHEN variant = 'standard' THEN speed
    END AS g_mode
FROM overview
JOIN player_info
ON overview.game_id = player_info.game_id
WHERE rated IS TRUE AND username = 'mbellm' 
    AND CASE
        WHEN variant != 'standard' THEN variant
        WHEN variant = 'standard' THEN speed
    END IN (
        SELECT game_mode FROM (
            SELECT count(*) AS frequency,
                CASE
                    WHEN variant != 'standard' THEN variant
                    WHEN variant = 'standard' THEN speed
                END AS game_mode
                FROM player_info
                JOIN overview
                ON player_info.game_id = overview.game_id
                    AND player_info.username = 'mbellm'
                    AND rated IS TRUE
            GROUP BY 2
            HAVING count(*) > 10) inner_q)
"""

df = pd.read_sql(query, chess_db.connection)
df

Great, now let's check our win rate against different opponent strengths. We'll buckets our games into five difficulty levels to see our win, losses, and draws at each level. This will make it easy for us to graph our our results as a percentage of the total games played at each difficulty for our analysis section

#Lastly we have our moves table which has all the moves that were made in every game I've every played.
query = """
SELECT count(*) AS frequency,
    CASE 
        WHEN rating_dif  < -200 THEN 'much stronger'
        WHEN rating_dif BETWEEN -199 AND -101  THEN 'stronger'
        WHEN rating_dif BETWEEN -100 AND 100 THEN 'similar'
        WHEN rating_dif BETWEEN 101 AND 199 THEN 'weaker'
        WHEN rating_dif > 200 THEN 'much weaker'
    END AS opponent_is,
    CASE
        WHEN colour = winner THEN 'win'
        WHEN colour != winner THEN 'lose'
        WHEN winner IS NULL THEN 'draw'
    END AS game_result
FROM (
    SELECT a.game_id, a.username, b.username AS opp_name, a.colour, a.rating AS my_rating, 
        b.rating AS opp_rating, a.rating-b.rating AS rating_dif
    FROM player_info AS a
    JOIN player_info AS b
    ON a.game_id = b.game_id
        AND a.username = 'mbellm'
        AND b.username != 'mbellm') p_info
JOIN overview
ON overview.game_id = p_info.game_id
    AND overview.rated IS TRUE
    AND overview.variant = 'standard'
GROUP BY 2, 3
HAVING 
    CASE 
        WHEN rating_dif  < -200 THEN 'much stronger'
        WHEN rating_dif BETWEEN -199 AND -101  THEN 'stronger'
        WHEN rating_dif BETWEEN -100 AND 100 THEN 'similar'
        WHEN rating_dif BETWEEN 101 AND 199 THEN 'weaker'
        WHEN rating_dif > 200 THEN 'much weaker'
    END IS NOT NULL
ORDER BY 2, 3
"""

pd.options.display.max_rows = 15
df = pd.read_sql(query, chess_db.connection)
df

Next let's query the actual rating gained or lost by those same buckets, since as discussed in the analysis section, win rate does not tell the full story.

query = """
SELECT count(*) AS frequency, round(avg(rating_gain), 2) AS avg_rating_gain,
    CASE 
        WHEN rating_dif  < -200 THEN 'much stronger'
        WHEN rating_dif BETWEEN -199 AND -101  THEN 'stronger'
        WHEN rating_dif BETWEEN -100 AND 100 THEN 'similar'
        WHEN rating_dif BETWEEN 101 AND 199 THEN 'weaker'
        WHEN rating_dif > 200 THEN 'much weaker'
    END AS opp_rating
FROM (
    SELECT a.game_id, a.username, b.username AS opp_name, a.colour, a.rating AS my_rating, 
        b.rating AS opp_rating, a.rating-b.rating AS rating_dif, a.rating_dif AS rating_gain
    FROM player_info AS a
    JOIN player_info AS b
    ON a.game_id = b.game_id
        AND a.username = 'mbellm'
        AND b.username != 'mbellm') p_info
JOIN overview
ON overview.game_id = p_info.game_id
    AND overview.rated IS TRUE
    AND overview.variant = 'standard'
GROUP BY 3
HAVING
    CASE 
        WHEN rating_dif  < -200 THEN 'much stronger'
        WHEN rating_dif BETWEEN -199 AND -101  THEN 'stronger'
        WHEN rating_dif BETWEEN -100 AND 100 THEN 'similar'
        WHEN rating_dif BETWEEN 101 AND 199 THEN 'weaker'
        WHEN rating_dif > 200 THEN 'much weaker'
    END IS NOT NULL
ORDER BY 3;
"""

df = pd.read_sql(query, chess_db.connection)
df

I thought it would be interesting to also graph out the win rate depending on which move I castled as well. I thought that perhaps games where I delayed castling would have lower win rates as my king would be less safe.

#When should I be castling
query = """
SELECT move_number,
    CASE
        WHEN colour = winner THEN 'win'
        WHEN winner IS NULL THEN 'draw'
        WHEN colour != winner THEN 'lose'
    END AS game_res, count(*) AS frequency
FROM (
    SELECT moves.game_id, move_number, moves.colour
    FROM moves
    JOIN player_info
    ON moves.game_id = player_info.game_id
        AND player_info.username = 'mbellm'
        AND moves.colour = player_info.colour
        AND (moves.move = 'O-O' OR moves.move='O-O-O')) inner_q
JOIN overview
ON inner_q.game_id = overview.game_id
    AND overview.variant = 'standard'
GROUP BY 1,2
ORDER BY 1,2
"""
pd.options.display.max_rows = 10
df = pd.read_sql(query, chess_db.connection)
df_grouped = df.groupby('move_number').agg('sum')
df.to_csv('castling_data.csv')
df_grouped.to_csv('castling_data_grouped.csv')
df

Win rate by castle move

As we can see from the graph above however, there was not much of a trend to this data so I omitted it from the analysis. We can see that the openings I play usually result on castling on move 9, 11, or 13. But there isn't much to be done with that information.

Next let's check to see how we do during differently time controls. I expect faster time controls to have higher average centipawn loss.

#Average CPL by speed
query ="""
SELECT count(*) AS frequency, round(avg(average_cpl),2) AS avg_cpl, speed
FROM player_info
JOIN overview
ON player_info.game_id = overview.game_id
WHERE username = 'mbellm' AND inaccuracies IS NOT NULL
GROUP BY variant, speed
HAVING variant = 'standard' AND speed NOT IN ('correspondence', 'ultraBullet')
ORDER BY avg_cpl
"""

df = pd.read_sql(query, chess_db.connection)
df

As expected bullet is our worst variant and classical is our best.

But what about if we break those time controls out by the phase of the game, do I need to work on a certain parts of my game more than others?

To split avg CPL by game phase we need to first calculate average centipawn loss per move by taking the difference between engine evaluations on each move. In order to do that we need to clean up a couple of engine evaluation errors for when white/black is in a checkmate position, let's update our table first then calculate CPL per move.

#Updating table to be able to calculate avg cpl
query = """
UPDATE moves
SET eval = 
        CASE 
            WHEN forced_mate_for = 'white' THEN 1000
            WHEN forced_mate_for = 'black' THEN -1000
            WHEN eval IS NULL THEN NULL
            WHEN eval > 1000 THEN 1000
            WHEN eval < -1000 THEN 1000
            ELSE eval
        END
"""

chess_db.comm(query)

#Calculating cpl per move
query = """
SELECT player_info.game_id, move_number, move, player_info.colour, cpl AS avg_centi
FROM (
    SELECT *,
        CASE
            WHEN colour = 'white' THEN (eval - lag(eval) OVER 
                (PARTITION BY game_id ORDER BY game_id, move_number)) *-1
            WHEN colour = 'black' THEN (eval - lag(eval) OVER (PARTITION BY game_id ORDER BY game_id, move_number))
        END AS cpl
FROM moves ) inner_q
JOIN player_info
ON inner_q.game_id = player_info.game_id AND inner_q.colour = player_info.colour
WHERE player_info.game_id != '02EW8kec'
ORDER BY player_info.game_id, move_number
"""

df = pd.read_sql(query, chess_db.connection)
df

Now that our table is updated with CPL per move, I've created a function that will split each game into thirds. The function takes the max number of moves in each game and sets break points for the opening, middlegame, and endgame. Then it adds a phase column for each move.

def define_move_phase(x):
    bins = (0, round(x['move_number'].max() * 1/3), round(x['move_number'].max() * 2/3), x['move_number'].max())    
    phases = ["opening", "middlegame", "endgame"]
    try:
        x.loc[:, 'phase'] = pd.cut(x['move_number'], bins, labels=phases)
    except ValueError:
        x.loc[:, 'phase'] = None
    return x

df = df.groupby('game_id').apply(define_move_phase)
df

Now we need to join our player_info, overiew, and moves tables together to view avg cpl by piece moved by variant by phase. Let's first join overview and player_info together then add those to our moves table.

#Filter data to where it's only my games
query = """
SELECT player_info.game_id, username, colour, variant, speed
FROM player_info
JOIN overview
ON player_info.game_id = overview.game_id
"""

df2 = pd.read_sql(query, chess_db.connection)
print(df2.shape)
df2.head()

(3924, 5)

Now let's join those tables onto our moves table and filter out moves made by our opponents now that we can who made each move. My username is 'mbellm' which is being filtered.

df = pd.merge(df,df2,on=['game_id', 'colour'])
df = df[(df['username'] == 'mbellm')]
df

Let's clean up our variant column again, this time using pandas. Let's replace the 'standard' variant type with the associated time controls for each move.

try:
    m = df['variant'] == 'standard'
    df.loc[m, 'variant'] = df.loc[m, 'speed']
    df.drop('speed', axis=1, inplace=True)
except KeyError:
    pass
df = df[(df['variant'] == 'rapid') | (df['variant'] == 'classical') | (df['variant'] == 'bullet') | 
        (df['variant'] == 'blitz')]

df

Great! Now let's check our avg CPL by game phase by variant

pd.options.display.max_rows = 20
df_loss_by_variant = df.groupby(['phase', 'variant']).agg('mean')['avg_centi']
df_loss_by_variant.loc[['opening', 'middlegame', 'endgame']]
df_loss_by_variant.to_excel("cpl_by_variant_by_phase.xlsx")
df_loss_by_variant

phase       variant  
endgame     blitz        111.447273
            bullet       160.092715
            classical     73.573311
            rapid         71.369587
middlegame  blitz        110.797153
            bullet        92.174194
            classical     47.522493
            rapid         57.295096
opening     blitz         37.127820
            bullet        58.568493
            classical     30.606940
            rapid         35.348463
Name: avg_centi, dtype: float64

Now let's drill down deeper and reformat our move column so we can view the piece that was moved on each row.

pd.options.mode.chained_assignment = None
pd.options.display.max_rows = 10
def which_piece_moved(x):
    move_dict = {
        'Q' : 'Queen',
        'B' : 'Bishop',
        'N' : 'Knight',
        'K' : 'King',
        'R' : 'Rook',
        'O' : 'Castle'
    }
    
    if x[:1] in move_dict.keys():
        return move_dict[x[:1]]
    else:
        return 'Pawn'
        
df.loc[:, 'move'] = df.loc[:, 'move'].apply(which_piece_moved)
df

Now let's check our avg CPL by variant by game phase by piece moved! That's a lot of information, it should be easier to understand if we visualize that data in our analysis section.

pd.options.display.max_rows = 100
df = df.groupby(['phase', 'variant', 'move']).agg(['mean', 'count'])['avg_centi']
df.to_excel("final.xlsx")
df

Lastly, let's close out our connection to the database

chess_db.close()

To repeat the conclusion in our analysis section...

Ultimately, I can draw a few conclusions from this analysis:

I can artificially inflate my rating by selectively playing much stronger players than my current rating.
My play deteriotes as the game progress. I ought to spend my time learning to end game strategies rather than focusing on the opening or middlegame theory.
In particular, my endgame pawn moves need serious improvement in longer time controls. I could take some time to study endgame pawn strategy so that I don't lose a pawn's worth of value each time I move a pawn.

In closing, I am a little disappointed at the results of this analysis. I thought I'd be able to come away with stronger conclusions by breaking out CPL by variant/piece/phase - but whether it was due to small sample size or my own chess inconsistencies, the patterns didn't seem very pronounced. I think that with a game as complicated as chess, to really glean mind blowing insights would require much more in depth analysis. For instance, if one were to integrate in a chess engine to go beyond the basic CPL metric (there are deeper chess metrics than just CPL, which weren't available to me) there would be many more possibilities to draw stronger insights.

I'll also admit that maybe I haven't played as many chess games as I first thought... I needed more data.

	game_start	rating	g_mode
0	2019-02-03 22:21:10	1520	rapid
1	2019-01-30 04:43:35	1530	rapid
2	2019-01-29 06:01:30	1539	rapid
3	2019-01-22 02:19:43	1089	bullet
4	2019-01-17 06:09:52	1080	bullet
...	...	...	...
1458	2016-02-25 06:42:31	1574	rapid
1459	2016-02-25 06:16:37	1500	blitz
1460	2016-02-25 06:02:44	1592	rapid
1461	2016-02-25 06:00:51	1616	rapid
1462	2016-02-25 05:32:40	1500	rapid

	game_id	move_number	move	colour	avg_centi
0	03gDhPWr	1	e4	white	NaN
1	03gDhPWr	2	d5	black	37.0
2	03gDhPWr	3	e5	white	61.0
3	03gDhPWr	4	c5	black	-5.0
4	03gDhPWr	5	Nf3	white	26.0
...	...	...	...	...	...
110093	zzaiRa7s	36	a5+	black	NaN
110094	zzaiRa7s	37	Kxb5	white	NaN
110095	zzaiRa7s	38	c6+	black	NaN
110096	zzaiRa7s	39	Ka4	white	NaN
110097	zzaiRa7s	40	Q@b4#	black	NaN

	game_id	game_start	game_end	rated	variant	speed	status	winner	opening	time_initial
0	NAy7WF0f	2019-02-03 22:21:10	2019-02-03 22:24:06	True	standard	rapid	timeout	black	Queen's Pawn Game: Zukertort Variation	600.0
1	xExJA7q9	2019-01-30 04:43:35	2019-01-30 04:56:02	True	standard	rapid	resign	black	Italian Game	600.0
2	Kn8Y6wYN	2019-01-29 06:01:30	2019-01-29 06:15:12	True	standard	rapid	resign	white	Slav Defense	600.0
3	AVezclrl	2019-01-22 02:19:43	2019-01-22 02:21:57	True	standard	bullet	outoftime	white	Scandinavian Defense: Mieses-Kotroc Variation	60.0
4	1VFInsMw	2019-01-17 06:09:52	2019-01-17 06:11:03	True	standard	bullet	mate	black	Queen's Pawn Game	60.0

	index	game_id	username	colour	rating	rating_dif	inaccuracies	mistakes	blunders	average_cpl
0	13	0ngCNyCN	mbellm	white	1715.0	-9.0	6	8	5	83
1	14	0ngCNyCN	mecfvm	black	1766.0	9.0	9	5	3	63
2	15	i7LN6liR	mbellm	white	1726.0	-11.0	2	1	1	46
3	16	i7LN6liR	endasan	black	1719.0	11.0	1	2	0	22
4	17	wruI96Kl	buktop456	white	1576.0	-6.0	3	2	0	58

	index	move_number	colour	game_id	move	eval	forced_mate_for	error_value
0	302	17	white	0ngCNyCN	Bh4	-53	None	Inaccuracy
1	304	19	white	0ngCNyCN	a3	-71	None	Inaccuracy
2	311	26	black	0ngCNyCN	Nd4	200	None	Mistake
3	315	30	black	0ngCNyCN	Nh5	270	None	Inaccuracy
4	316	31	white	0ngCNyCN	Bh2	166	None	Mistake

	frequency	opponent_is	game_result
0	2	much stronger	draw
1	91	much stronger	lose
2	18	much stronger	win
3	1	much weaker	draw
4	13	much weaker	lose
5	33	much weaker	win
6	49	similar	draw
7	461	similar	lose
8	490	similar	win
9	6	stronger	draw
10	68	stronger	lose
11	44	stronger	win
12	2	weaker	draw
13	46	weaker	lose
14	79	weaker	win

	frequency	avg_rating_gain	opp_rating
0	111	4.42	much stronger
1	47	-8.21	much weaker
2	1000	-0.30	similar
3	118	0.14	stronger
4	127	-2.42	weaker

	frequency	avg_cpl	speed
0	153	55.82	classical
1	632	65.47	rapid
2	26	92.12	blitz
3	18	106.50	bullet

	game_id	move_number	move	colour	avg_centi	phase
0	03gDhPWr	1	e4	white	NaN	opening
1	03gDhPWr	2	d5	black	37.0	opening
2	03gDhPWr	3	e5	white	61.0	opening
3	03gDhPWr	4	c5	black	-5.0	opening
4	03gDhPWr	5	Nf3	white	26.0	opening
...	...	...	...	...	...	...
110093	zzaiRa7s	36	a5+	black	NaN	endgame
110094	zzaiRa7s	37	Kxb5	white	NaN	endgame
110095	zzaiRa7s	38	c6+	black	NaN	endgame
110096	zzaiRa7s	39	Ka4	white	NaN	endgame
110097	zzaiRa7s	40	Q@b4#	black	NaN	endgame

	game_id	username	colour	variant	speed
0	NAy7WF0f	Chouchene_yousef	white	standard	rapid
1	NAy7WF0f	mbellm	black	standard	rapid
2	xExJA7q9	mbellm	white	standard	rapid
3	xExJA7q9	nishantkamble	black	standard	rapid
4	Kn8Y6wYN	Holum	white	standard	rapid

	game_id	move_number	move	colour	avg_centi	phase	username	variant
31	03gDhPWr	2	Pawn	black	37.0	opening	mbellm	rapid
32	03gDhPWr	4	Pawn	black	-5.0	opening	mbellm	rapid
33	03gDhPWr	6	Pawn	black	31.0	opening	mbellm	rapid
34	03gDhPWr	8	Knight	black	63.0	opening	mbellm	rapid
35	03gDhPWr	10	Pawn	black	535.0	opening	mbellm	rapid
...	...	...	...	...	...	...	...	...
110024	zykwEM21	49	Rook	white	NaN	endgame	mbellm	rapid
110025	zykwEM21	51	Rook	white	NaN	endgame	mbellm	rapid
110026	zykwEM21	53	King	white	NaN	endgame	mbellm	rapid
110027	zykwEM21	55	King	white	NaN	endgame	mbellm	rapid
110028	zykwEM21	57	Rook	white	NaN	endgame	mbellm	rapid

			mean	count
phase	variant	move
endgame	blitz	Bishop	270.692308	26
		Castle	NaN	0
		King	40.982143	56
		Knight	75.242424	33
		Pawn	100.722222	36
		Queen	83.950000	60
		Rook	158.890625	64
	bullet	Bishop	5.857143	14
		Castle	NaN	0
		King	130.700000	20
		Knight	221.333333	9
		Pawn	127.648649	37
		Queen	177.978261	46
		Rook	263.040000	25
	classical	Bishop	69.063584	173
		Castle	51.142857	7
		King	56.576408	373
		Knight	72.315152	165
		Pawn	100.053864	427
		Queen	67.425249	301
		Rook	68.314667	375
	rapid	Bishop	66.332834	667
		Castle	77.354839	31
		King	56.489796	1421
		Knight	79.423622	635
		Pawn	67.978001	1591
		Queen	77.342835	1263
		Rook	83.255052	1435
middlegame	blitz	Bishop	83.122449	49
		Castle	55.666667	6
		King	123.862069	29
		Knight	84.697674	43
		Pawn	78.296875	64
		Queen	252.488889	45
		Rook	69.333333	45
	bullet	Bishop	89.320000	25
		Castle	193.333333	6
		King	136.444444	9
		Knight	122.333333	33
		Pawn	96.800000	40
		Queen	15.800000	20
		Rook	65.500000	22
	classical	Bishop	33.200717	279
		Castle	25.424242	33
		King	16.311475	183
		Knight	65.226054	261
		Pawn	59.078199	422
		Queen	47.031469	286
		Rook	50.356955	381
	rapid	Bishop	49.291408	1129
		Castle	57.881250	160
		King	21.312112	644
		Knight	62.159679	1121
		Pawn	56.370625	1600
		Queen	70.532570	1136
		Rook	66.880761	1367
opening	blitz	Bishop	51.946429	56
		Castle	45.000000	12
		King	18.857143	7
		Knight	9.800000	70
		Pawn	43.870130	77
		Queen	49.243243	37
		Rook	58.428571	7
	bullet	Bishop	75.172414	29
		Castle	18.000000	7
		King	2.000000	2
		Knight	68.424242	33
		Pawn	26.452830	53
		Queen	119.809524	21
		Rook	65.000000	1
	classical	Bishop	38.297872	376
		Castle	25.786667	75
		King	34.208333	24
		Knight	22.917051	434
		Pawn	31.639313	524
		Queen	32.308333	240
		Rook	27.917647	85
	rapid	Bishop	38.172148	1429
		Castle	26.144876	283
		King	88.494382	89
		Knight	27.435129	1634
		Pawn	35.714907	2301
		Queen	40.497013	837
		Rook	41.642023	257