Core API Reference

This page contains the API reference for the core API of the StatCan Dialogue Dataset. It contains many useful functions for working with the dataset (including loading, processing, downloading).

Reference for `statcan_dialogue_dataset`

The core API for the StatCan Dialogue Dataset. To use this, start with:

import statcan_dialogue_dataset as sdd

Then, you can start using the functions in this module.

`download_dataverse`

sdd.download_dataverse(api_token, server_url="https://borealisdata.ca", persistent_id="doi:10.5683/SP3/NR0BMY", filename="task_data.zip", data_dir=None, overwrite=False, bundle_name="dataverse_files.zip", remove_bundle=False)

Description

Download a file from a Dataverse repository. By default, this downloads the file from the Borealis Data repository (McGill Dataverse) and uses the DOI for the StatCan Dialogue Dataset. You will need to provide the API token created from the account that has been granted access to the file.

Parameters

Name	Type	Default	Description
`api_token`	`str`		The API token for the account that has been granted access to the file. You can create an API token from the account settings page. See the Dataverse documentation for more information: https://guides.dataverse.org/en/latest/api/auth.html
`server_url`	`str`	`"https://borealisdata.ca"`	The URL of the Dataverse repository. By default, this is the URL for the Borealis Data repository (McGill Dataverse).
`persistent_id`	`str`	`"doi:10.5683/SP3/NR0BMY"`	The persistent identifier for the file. By default, this is the DOI for the StatCan Dialogue Dataset. You can find the persistent identifier for a file by going to the file’s page on the Dataverse repository and clicking the “Share” button. The persistent identifier can be found in the URL of the page.
`filename`	`str`	`"task_data.zip"`	The name of the file to download. By default, this is the name of the file in the StatCan Dialogue Dataset.
`data_dir`	`str or Path`	`None`	The directory to download the file to. By default, this is the directory returned by `utils.get_data_dir()`. If the directory does not exist, it will be created.
`overwrite`	`bool`	`False`	Whether to overwrite the file if it already exists. By default, this is False, which means that the file will not be downloaded if it already exists.
`bundle_name`	`str`	`"dataverse_files.zip"`	The name of the bundle file that contains the file to download. By default, this is the default name used by Dataverse. Note that it is indeed possible that it contains a zip file (so a zip in a zip), which is the case by default (task_data.zip is contained in dataverse_files.zip). You generally don’t need to change this nor the `filename` above.
`remove_bundle`	`bool`	`False`	Whether to remove the bundle file after extracting the file.

Note

Once this is downloaded, you can load the task data using the extract_task_data_zip function.

`download_huggingface`

sdd.download_huggingface(api_token, repository_url="https://huggingface.co/datasets/McGill-NLP/statcan-dialogue-dataset", branch="main", filename="task_data.zip", data_dir=None, overwrite=False)

Description

This downloads the task data from HuggingFace. It requires an API token to be passed in, which can be generated from your HuggingFace account, after you have been granted access to the data repository.

Parameters

Name	Type	Default	Description
`api_token`	`str`		The API token for the account that has been granted access to the file. You can create an API token from the account settings page. See the HuggingFace documentation for more information: https://huggingface.co/docs/hub/security-tokens
`repository_url`	`str`	`"https://huggingface.co/datasets/McGill-NLP/statcan-dialogue-dataset"`	The URL of the HuggingFace repository. By default, this is the URL for the StatCan Dialogue Dataset repository.
`branch`	`str`	`"main"`	The branch of the repository to download the file from. By default, this is the “main” branch.
`filename`	`str`	`"task_data.zip"`	The name of the file to download. By default, this is the name of the file that contains the task data.
`data_dir`	`str or Path`	`None`	The directory to download the file to. By default, this is the directory returned by `utils.get_data_dir()`. If the directory does not exist, it will be created.
`overwrite`	`bool`	`False`	Whether to overwrite the file if it already exists in the directory. By default, this is False.

Note

Once this is downloaded, you can load the task data using the extract_task_data_zip function.

`download_full_tables`

sdd.download_full_tables(data_dir=None, lang="en", deposition_id="auto", show_progress=True)

Description

Download the full tables from Zenodo.

Parameters

Name	Type	Default	Description
`data_dir`	`str or Path`	`None`	The directory to download the tables to. If None, the default data directory is used. If the directory does not exist, it will be created.
`lang`	`str`	`'en' ("en")`	The language of the tables to download. Must be ‘en’ or ‘fr’.
`deposition_id`	`int or str`	`'auto' ("auto")`	The Zenodo deposition ID to download the tables from. If ‘auto’, the appropriate deposition ID is used based on the value of `lang`.
`show_progress`	`bool`	`True`	Whether to show a progress bar while downloading the tables. If True, the `tqdm` package must be installed.

Returns

Path

The path to the downloaded tables.

`extract_task_data_zip`

sdd.extract_task_data_zip(filename="task_data.zip", data_dir=None, load_dir=None, remove_zip=False)

Description

Extracts the ZIP file from Huggingface/Dataverse from the given path. The file is extracted to the data directory. For instructions on how to download the data from dataverse, visit: https://mcgill-nlp.github.io/statcan-dialogue-dataset

Parameters

Name	Type	Default	Description
`filename`	`str or Path`	`"task_data.zip"`	The path to the ZIP file to extract.
`data_dir`	`str or Path`	`None`	The path to the data directory. If None, the default data directory is used.
`load_dir`	`str or Path`	`None`	The path to the directory to load the data from. If None, the data directory is used. This is useful if load_dir is different from data_dir, e.g. if the data is extracted to a different directory.
`remove_zip`	`bool`	`True (False)`	Whether to remove the ZIP file after extraction.

`load_table`

sdd.load_table(code, data_dir=None, load_from_zip=True, lang="en")

Description

Loads a table from the data directory.

Parameters

Name	Type	Default	Description
`code`	`str`		The PID code of the table to load.
`data_dir`	`str or Path`	`None`	The path to the data directory. If None, the default data directory is used.
`load_from_zip`	`bool`	`True`	Whether to load the table from the ZIP file. If False, it is assumed that the tables- directory exists in the data directory, which contains the `.csv.zip files.`
`lang`	`str`	`"en"`	The language of the table to load. Must be either “en” or “fr”.

Returns

pd.DataFrame

The table with the given PID code.

Note

The tables must be downloaded first using the download_full_tables function.

`extract_full_tables`

sdd.extract_full_tables(data_dir=None, remove_zip=False, lang="en")

`load_task_data`

sdd.load_task_data(task="retrieval", lang="en", split="train", with_hn=False, with_augmented=False, data_dir=None)

Description

Loads the data for a given task, language and split. The data is returned as a pandas DataFrame.

Parameters

Name	Type	Default	Description
`task`	`str`	`'retrieval' ("retrieval")`	The task to load the data for. Must be either ‘retrieval’ or ‘generation’.
`lang`	`str`	`'en' ("en")`	The language to load the data for. Must be either ‘en’ or ‘fr’.
`split`	`str`	`'train' ("train")`	The split to load the data for. Must be either ‘train’, ‘valid’ or ‘test’.
`with_hn`	`bool`	`False`	Whether to include the BM25 hard negatives for the retrieval task.
`with_augmented`	`bool`	`False`	Whether to include the augmented data for the generation task.
`data_dir`	`str or Path`	`None`	The path to the base data directory. If None, the default data directory is used. It must contain subdirectories “retrieval” and “generation”, which contain the data for the retrieval and generation tasks respectively.

`load_retrieval_metadata`

sdd.load_retrieval_metadata(data_dir=None)

Description

Loads the metadata for the retrieval task. The metadata is returned as a pandas DataFrame. The ‘pid’ column is set as the index. Note this is not the same thing as sdd.metadata module.

`load_index_splits`

sdd.load_index_splits(data_dir=None)

Description

Loads the list of indices for each split (train, valid, test). Those indices are used to split the data into the different splits. This is only needed when you are using the json form of the original conversation data.

`load_task_conversations`

sdd.load_task_conversations(data_dir=None)

Description

Loads the conversations for the retrieval and generation tasks. The conversations are returned as a dictionary with the following structure:

{
    <index>: {
        "index": <index>,
        "conversation": [
            {
                "speaker": <'user' or 'agent'>,
                "name": <name of speaker>,
                "timestamp": <timestamp>,
                "urls": [...],
                "text": <the message by the speaker>
            },
            ...
        ],
        "language": {
            'automatic': <language selected by the user in the chat system>,
            'fasttext': {
                'detected': <'en' or 'fr>,
                'confidence': <confidence of the language detection model>
            },
            'langid': {
                'detected': <'en' or 'fr>,
                'confidence': <confidence of the language detection model>
            }
        },
        "urls": [...],
    },
    ...
}

`find_pid`

sdd.find_pid(url)

Description

Given a URL, finds the pid value (or returns None if not found, or -1 if the PID is invalid).

Reference for statcan_dialogue_dataset

download_dataverse

Description

Parameters

Note

download_huggingface

Description

Parameters

Note

download_full_tables

Description

Parameters

Returns

extract_task_data_zip

Description

Parameters

load_table

Description

Parameters

Returns

Note

extract_full_tables

load_task_data

Description

Parameters

load_retrieval_metadata

Description

load_index_splits

Description

load_task_conversations

Description

find_pid

Description

Reference for `statcan_dialogue_dataset`

`download_dataverse`

`download_huggingface`

`download_full_tables`

`extract_task_data_zip`

`load_table`

`extract_full_tables`

`load_task_data`

`load_retrieval_metadata`

`load_index_splits`

`load_task_conversations`

`find_pid`