Utilities
Utilities for performing object synchronization.
apply_cosine_similarity(input1, input2)
Generates a dataframe of cosine similarity results from two pandas.Series. The inputs have NULL/NA values removed before being vectorized for use in the similarity algorithm.
For more technical details on cosine similarity, please see sklearn.feature_extraction.text.TfidfVectorizer and sklearn.metrics.pairwise.cosine_similarity.
The methodology behind this implementation can be found at: https://unravelsports.com/post.html?id=2022-07-11-player-id-matching-system
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input1
|
Series[str]
|
a pandas.Series of strings. |
required |
input2
|
Series[str]
|
a pandas.Series of strings. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: a DataFrame with the schema
|
Source code in src/glass_onion/utils.py
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 | |
dataframe_clean_merged_fields(df, columns)
Cleans up dataframe columns after a pandas.DataFrame.merge or pandas.DataFrame.join operation by keeping the first instance of the column and dropping others.
Assumes that the suffixes used in the merge or join operation are the defaults (IE: _x and _y). IE: if a is passed in columns,
the columns considered for cleanup will be a_x and a_y. a_x will be renamed a, and a_y will be dropped.
If a column (or its suffixed versions) in columns does not exist in df, it will be ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
the dataframe to clean up columns on |
required |
columns
|
(Index[str], list[str], str)
|
the columns to look for for cleanup |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas.DataFrame object. |
Source code in src/glass_onion/utils.py
dataframe_coalesce(df, columns)
Unifies dataframe columns after a pandas.DataFrame.merge or pandas.DataFrame.join operation using a SQL-style COALESCE.
Assumes that the suffixes used in the merge or join operation are the defaults (IE: _x and _y). IE: if a is passed in columns,
the columns COALESCEd will be a_x and a_y. Rows where a_x is NA/None will get new values from a_y, then a_x will be renamed to a and a_y will be dropped.
If a column (or its suffixed versions) in columns does not exist in df, it will be ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
the dataframe to run COALESCE operations on. |
required |
columns
|
(Index[str], list[str], str)
|
the columns to COALESCE. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas.DataFrame object with the specified columns COALESCEd. |
Source code in src/glass_onion/utils.py
series_clean_spaces(input)
Please see string_clean_spaces for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series of strings. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series of strings with only "true" spaces (U+0020). |
Source code in src/glass_onion/utils.py
series_normalize(input)
Applies a full suite of normalizations to a pandas.Series of strings.
Please see the following methods for more details:
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series of strings. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with normalized strings. |
Source code in src/glass_onion/utils.py
series_normalize_team_names(input)
Applies a full suite of normalizations to a pandas.Series of team name strings.
Please see the following methods for more details:
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with more standardized club names. |
Source code in src/glass_onion/utils.py
series_remove_accents(input)
Please see string_remove_accents for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series with Unicode-compliant strings. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with ASCII strings. |
Source code in src/glass_onion/utils.py
series_remove_common_prefixes(input)
Replaces common team prefixes with empty strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series with club names. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with more standardized club names. |
Source code in src/glass_onion/utils.py
series_remove_common_suffixes(input)
Replaces common team suffixes with empty strings.
Please see string_replace_common_womens_suffixes and string_remove_youth_suffixes for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series with club names. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with more standardized club names. |
Source code in src/glass_onion/utils.py
series_remove_double_spaces(input)
Replaces consecutive whitespace characters with just one space character.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series of strings. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series of strings. |
Source code in src/glass_onion/utils.py
series_remove_non_word_chars(input)
Replaces any consecutive punctuation/whitespace/etc. character with one space character.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series of strings. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series of strings. |
Source code in src/glass_onion/utils.py
series_remove_youth_prefixes(input)
Replaces common youth team suffixes with empty strings.
Please see string_remove_youth_suffixes for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
Series[str]
|
a pandas.Series with club names. |
required |
Returns:
| Type | Description |
|---|---|
Series[str]
|
A pandas.Series with more standardized club names. |
Source code in src/glass_onion/utils.py
string_clean_spaces(input)
Replaces Unicode character U+00A0 (the no-break space) with a "true" space (Unicode character U+0020).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str
|
any string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A string with only "true" spaces (U+0020). |
Source code in src/glass_onion/utils.py
string_ngrams(input, n=3)
Splits a given string into n-character n-grams to use later in cosine similarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str
|
any string. |
required |
n
|
int
|
the number of characters to use in each n-gram. Must be greater than 0. |
3
|
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of strings of length n. |
Source code in src/glass_onion/utils.py
string_remove_accents(input)
Uses unidecode to convert input (a Unicode object/string) into an ASCII-compliant string.
Please see unidecode.unidecode for more details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str
|
any Unicode-compliant string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A string with only ASCII-compliant characters. |
Source code in src/glass_onion/utils.py
string_remove_youth_suffixes(input)
Removes common youth team suffixes with empty strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str
|
any string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A cleaned string without specific text indicating youth teams. |
Source code in src/glass_onion/utils.py
string_replace_common_womens_suffixes(input)
Removes common women's club suffixes with empty strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input
|
str
|
any string. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A cleaned string without specific text indicating women's teams. |