Provenance

The provenance of datasets is crucial in machine learning, serving as a foundational aspect of data integrity and quality. Understanding the origin, context, and collection methods of datasets ensures reliability and reproducibility in ML models. Proper dataset provenance enhances model accuracy, aids in ethical considerations, and supports compliance with data regulations. As such, the provenance section of the datacard encapsulates the entire lifecycle of the dataset. It provides comprehensive insights into the dataset's source/origin, transformations, annotations, and legal frameworks, which are critical for ensuring the dataset's reliability and compliance in AI applications.

Provenance pydantic type:

class Publisher(BaseModel): 
    name: str
    publisher_url: Union[Optional[HttpUrl], None]
    id: Optional[str] = None

class License(BaseModel):
    license_kind: Literal['OPEN', 'CLOSED', 'UNSPECIFIED', 'PROPRIETARY'] ='UNSPECIFIED' #unspecified only exists for the time being
    license: str
    license_url: Union[Optional[HttpUrl], None]

class DatasetProvenance(BaseModel): 
    derived_from_datasets: List[Union[str, 'DataCard', 'ValyuDataDID', 'Sources', Literal['Base']]] = ['Base']
    children: List[str] = []
    ml_model_generated: Optional[List[str]]   
    creators: List[str]
    human_annotation: str
    publisher_details: Optional[List[Publisher]]
    licenses: List[License]
    license_notes: Optional[str]
    license_verified_by: Optional[str]

Provenance Fields:

  • derived_from_datasets: Traces the dataset's lineage back to its origins, highlighting dependencies and sources. This field is fundamental for assessing the dataset's foundation and its evolution over time.

  • children: Identifies datasets that have been derived from the current dataset, offering insights into how the data has been utilized and transformed further.

  • ml_model_generated: Documents whether parts of the dataset were generated or modified by machine learning models, providing transparency into the data's augmentation and potential biases.

  • creators: Lists the individuals or organizations responsible for creating the dataset, underlining the expertise and intentions behind the dataset's compilation.

  • human_annotation: Details the human involvement in annotating the dataset, crucial for understanding the dataset's contextual accuracy and the nature of the annotations.

  • publisher_details: Enriches the dataset's provenance by detailing the publisher's information, adding an extra layer of credibility and context.

  • licenses: Enumerates the licenses governing the dataset's use, pivotal for legal compliance and clarifying permissible uses.

  • license_notes and license_verified_by: Provide additional insights and validations regarding the dataset's licensing, ensuring clarity on any special conditions or endorsements.

Publisher Details:

  • name: The name of the entity or individual responsible for publishing the dataset. This field adds a layer of accountability and credibility, as knowing the publisher helps users assess the dataset's reliability.

  • publisher_url: A URL providing more information about the publisher. This can link to an institutional page, a dataset repository, or a personal webpage, offering additional context and validation for the dataset.

  • id: An optional identifier for the publisher, which could be useful in systems where publishers are indexed or need to be referenced uniquely.

License Details:

  • license_kind: Indicates the type of license under which the dataset is released. This field categorises the dataset as 'OPEN', 'CLOSED', 'PROPRIETARY', or 'UNSPECIFIED', providing a quick reference to the dataset's accessibility and usage restrictions.

  • license: The specific name or title of the license, which provides detailed information about the rights, restrictions, and obligations associated with the dataset.

  • license_url: A URL to the full text of the license, allowing users to review the legal terms in detail. This is essential for understanding the conditions under which the dataset can be used, shared, or modified.

Last updated