Using HuggingFace Argument Parser

Posted on 04/12/2024 in posts python

When writing python scripts, I rely a lot of using command line arguments to make the script reusable across different configurations.

To support this I have been using Huggingface Argument Parser HfArgumentParser for some time. This pars

The parser is a wrapper of Python's default ArgumentParser which allows specifying the arguments via dataclasses. This allows it to return specific dataclass objects after parsing the whole command line arguments.

Some additional features we get from the HFfArgumentParser is to read the configuration from a JSON or YAML file via its parse_json_file and parse_yaml_file functions.

Overall, I really like the ability to write arguments via dataclasses and then using the __post_init__ to create custom arguments which I can use in the model.

E.g.

@dataclass
class EvalArgs:
    model_name_or_path: str = field(metadata={"help": "location of the model"})
    file_suffix: str = field(default="", metadata={"help": "file_suffix for output folder"})
    batch_size: int = field(default=64, metadata={"help": "batch size"})
    max_length: int = field(default=512, metadata={"help": "max length"})
    dataset_paths: list[str] = field(default_factory=list, metadata={"help": "datasets used for the model"})
    dataset_names: list[str] = field(default_factory=list, metadata={"help": "datasets names for each dataset"})

    def __post_init__(self):
        if not self.dataset_names:
            self.dataset_names = [f"dataset_{i}" for i in range(len(self.dataset_paths))]
        assert len(self.dataset_paths) == len(self.dataset_names), f"{len(self.dataset_paths)=} != {len(self.dataset_names)=}"
        self.datasets = dict(zip(self.dataset_names, self.dataset_paths))

parser = HfArgumentParser([EvalArgs])
eval_args, unknown_args = parser.parse_args_into_dataclasses([
    "--model_name_or_path", "./model", 
    "--dataset_paths", "d1_path", "d2_path", 
    "--dataset_names", "d1", "d2", 
    "--max_length", "52"
], return_remaining_strings=True)
print(f"{eval_args=}, {unknown_args=}")
eval_args.datasets

Will return the following output:

eval_args=EvalArgs(model_name_or_path='./model', file_suffix='', batch_size=64, max_length=52, dataset_paths=['d1_path', 'd2_path'], dataset_names=['d1', 'd2']), unknown_args=[]
{'d1': 'd1_path', 'd2': 'd2_path'}

This allows for easily writing training and evaluation scripts for models. This also allows for easily reading the script from JSON or YAML files.