Skip to content

Getting Started

Need a very quick introduction to Glass Onion? Here's an example of identifier synchronization across two public datasets for the 2023-2024 1. Bundesliga season: Statsbomb and Impect. If you're already familiar with how Glass Onion works, feel free to move on to the integration guide, where you can learn how to make Glass Onion part of a production workflow.

Installing Glass Onion

The easiest way to install Glass Onion is via uv or pip.

uv add glass_onion
pip install glass_onion

The installation guide has more options if you need them.

Example: Player Synchronization

Loading data

First, let's use kloppy to retrieve sample data for a given match from both Impect and Statsbomb. For simplicity, we've picked out the August 19, 2023 fixture between Bayer Leverkusen and RB Leipzig from both datasets.

1
2
3
4
from kloppy import impect, statsbomb

impect_dataset = impect.load_open_data(match_id="122839")
statsbomb_dataset = statsbomb.load_open_data(match_id="3895052")

We can pull out the player information from both of these event datasets into Pandas dataframes. For each, we'll have to iterate through the teams and pull specific fields for each of the players.

import pandas as pd 

def get_players(dataset, provider):
    return pd.DataFrame([
        {
            f"{provider}_player_id": player.player_id,
            "jersey_number": str(player.jersey_no),
            "team_id": team.team_id,
            "team_name": team.name,
            "player_name": player.full_name,
            "player_nickname": player.name
        }
        for team in dataset.metadata.teams
        for player in team.players
    ])

impect_player_df = get_players(
    impect_dataset, 
    provider="impect"
)
statsbomb_player_df = get_players(
    statsbomb_dataset, 
    provider="statsbomb"
)

Assigning unified team identifiers

We need to unify team identifiers across these two dataframes so Glass Onion can properly use team_id in its synchronization logic. With just two teams, we can do this manually (as below) by simply setting RB Leipzig's team_id to RBL and Bayer Leverkusen's to B04. If we wanted to do this across the entire competition, we could build a more complex workflow with Glass Onion.

import numpy as np
impect_player_df["team_id"] = np.select(
    [
        impect_player_df["team_id"] == '41',
        impect_player_df["team_id"] == '37',
    ],
    [
        "B04",
        "RBL"
    ],
    default=impect_player_df["team_id"]
)

statsbomb_player_df["team_id"] = np.select(
    [
        statsbomb_player_df["team_id"] == '904',
        statsbomb_player_df["team_id"] == '182',
    ],
    [
        "B04",
        "RBL"
    ],
    default=statsbomb_player_df["team_id"]
)

Wrapping in PlayerSyncableContent

Now, we just have to wrap these two dataframes in PlayerSyncableContent instances so they can be used in PlayerSyncEngine.

from glass_onion import PlayerSyncableContent

impect_content = PlayerSyncableContent(
    provider="impect",
    data=impect_player_df
)

statsbomb_content = PlayerSyncableContent(
    provider="statsbomb",
    data=statsbomb_player_df
)

Using PlayerSyncEngine

Once you have two PlayerSyncableContent instances, you can now synchronize them with PlayerSyncEngine.synchronize!

1
2
3
4
5
6
7
from glass_onion import PlayerSyncEngine

engine = PlayerSyncEngine(
    content=[impect_content, statsbomb_content],
    verbose=True
)
result = engine.synchronize()

result is a PlayerSyncableContent object, so you can view the dataframe containing synchronized identifiers by dumping out its data field:

result.data.head()

jersey_number team_id player_name impect_player_id statsbomb_player_id provider
1 B04 Lukas Hradecky 1030 8667 impect
10 B04 Florian Wirtz 64023 40724 impect
10 RBL Emil Forsberg 355 5625 impect
11 B04 Nadiem Amiri 1286 8403 impect
11 RBL Timo Werner 1017 5557 impect
You can then join other dataframes using result.data to link the Impect and Statsbomb datasets together.

Future: Using the synced identifiers

Let's say you want to compare a player's Statsbomb Shot xG to their Impect Packing xG. We'll need to parse out both KPIs from their JSON files:

# Kloppy doesn't cover this case, so we have to parse both JSON files ourselves.
import json
import requests
import pandas as pd

impect_player_match = json.loads(requests.get("https://raw.githubusercontent.com/ImpectAPI/open-data/refs/heads/main/data/player_kpis/player_kpis_122839.json").content)
impect_player_match_list = []
for t in ["squadHome", "squadAway"]:
    impect_player_match_list = [{ **p, "team_id": impect_player_match[t]["id"], "match_id": impect_player_match["matchId"]} for p in impect_player_match[t]["players"]]

impect_player_match_df = pd.DataFrame(impect_player_match_list).explode("kpis")
impect_player_match_df["kpi_id"] = impect_player_match_df["kpis"].apply(lambda x: x["kpiId"])
impect_player_match_df["kpi_value"] = impect_player_match_df["kpis"].apply(lambda x: x["value"])
impect_player_match_df.drop("kpis", axis=1, inplace=True)
impect_player_match = impect_player_match_df[impect_player_match_df["kpi_id"] == 83].groupby(["match_id", "id"], as_index=False).kpi_value.sum()
impect_player_match.rename(
    { 
        "match_id": "impect_match_id",
        "id": "impect_player_id", 
        "kpi_value": "impect_packing_xg"
    }, 
    axis=1, 
    inplace=True
)
impect_player_match["impect_player_id"] = impect_player_match["impect_player_id"].astype(str)
impect_player_match
impect_match_id impect_player_id impect_packing_xg
122839 355 0.01708
122839 1017 0.13579
122839 1132 0.48393
122839 1174 0.00439
122839 1235 0.00155

import pandas as pd
statsbomb_player_match_df = pd.read_json("https://raw.githubusercontent.com/statsbomb/open-data/refs/heads/master/data/events/3895052.json")
statsbomb_player_match_df["match_id"] = "3895052"
statsbomb_player_match_df = statsbomb_player_match_df[statsbomb_player_match_df["shot"].notna()]
statsbomb_player_match_df["player_id"] = statsbomb_player_match_df["player"].apply(lambda x: x["id"])
statsbomb_player_match_df["player_name"] = statsbomb_player_match_df["player"].apply(lambda x: x["name"])
statsbomb_player_match_df["team_id"] = statsbomb_player_match_df["team"].apply(lambda x: x["id"])
statsbomb_player_match_df["team_name"] = statsbomb_player_match_df["team"].apply(lambda x: x["name"])
statsbomb_player_match_df["shot_statsbomb_xg"] = statsbomb_player_match_df["shot"].apply(lambda x: x["statsbomb_xg"])
statsbomb_player_match_df = statsbomb_player_match_df[
    [
        "match_id",
        "team_id",
        "player_id",
        "shot_statsbomb_xg"
    ]
]
statsbomb_player_match = statsbomb_player_match_df.groupby(["match_id", "player_id"], as_index=False).shot_statsbomb_xg.sum()
statsbomb_player_match["player_id"] = statsbomb_player_match["player_id"].astype(str)
statsbomb_player_match.rename(
    {
        "match_id": "statsbomb_match_id",
        "player_id": "statsbomb_player_id",
        "shot_statsbomb_xg": "statsbomb_shot_xg"
    },
    axis=1,
    inplace=True
)
statsbomb_player_match
statsbomb_match_id statsbomb_player_id statsbomb_shot_xg
3895052 3500 0.020734
3895052 5536 0.314084
3895052 5557 0.042124
3895052 8211 0.167162
3895052 8221 0.157727

But once we have both datasets, we can join them easily:

composite_df = pd.merge(
    left=impect_player_match,
    right=result.data,
    on="impect_player_id",
    how="outer"
)

composite_df = pd.merge(
    left=composite_df,
    right=statsbomb_player_match,
    on="statsbomb_player_id",
    how="outer"
)

composite_result = (
    composite_df[["player_name", "team_id", "jersey_number", "impect_player_id", "statsbomb_player_id", "impect_packing_xg", "statsbomb_shot_xg"]]
        .sort_values(by=["impect_packing_xg", "statsbomb_shot_xg"], ascending=False)
)

composite_result.head()
player_name team_id jersey_number impect_player_id statsbomb_player_id impect_packing_xg statsbomb_shot_xg
Loïs Openda RBL 17 2735 16275 1.25466 1.084519
Yussuf Poulsen RBL 9 1132 5536 0.48393 0.314084
Timo Werner RBL 11 1017 5557 0.13579 0.042124
Dani Olmo RBL 7 5658 16532 0.11369 0.092469
Benjamin Henrichs RBL 39 1303 8211 0.08574 0.167162