# Add a Custom Model

This guide provides a step-by-step process to integrate your own model into cz-benchmarks.

---

## Overview

To add a new model, you will:
1. Create a directory for your model.
2. Implement the necessary files and classes, extending the base classes for model implementation and input validation.
3. Test and integrate your model.

---

## Step 1: Create a Directory for Your Model

1. Navigate to the `docker/` directory in your project.
2. Create a new subdirectory for your model, e.g., `docker/your_model/`.
3. Structure the directory as follows:

    ```
    docker/your_model/
    ├── Dockerfile          # Define the container environment
    ├── model.py            # Implementation of your model inference code
    ├── config.yaml         # Configuration file for your model
    ├── requirements.txt    # (Optional) List Python dependencies
    └── assets/             # (Optional) Store model weights, vocabularies, etc.
    ```

---

## Step 2: Add a new ModelType

In the `src.czbenchmarks.models.types.ModelType` enum, add a value for your model:

     Example:

     ```
     class ModelType(Enum):
          ...
          YOUR_MODEL = "YOUR_MODEL"
     ```

---

## Step 3: Implement the Model Class

1. Create a model class that extends `BaseModelImplementation`.
2. Implement the required methods, such as `run_model`.

     Example:

     ```python
     import argparse
     from typing import Set
     from czbenchmarks.datasets.types import DataType
     from czbenchmarks.models.implementations.base_model_implementation import BaseModelImplementation
     from czbenchmarks.models.types import ModelType


     class YourModel(BaseModelImplementation):
          def create_parser(self):
               parser = argparse.ArgumentParser(description="Run YourModel on input dataset.")
               parser.add_argument("--your_param", type=int, default=32, help="Description of your_param")
               return parser

          model_type = ModelType.YOUR_MODEL

          @property
          def inputs(self) -> Set[DataType]:
                # Specify appropriate `DataType`s below
               return { }

          @property
          def outputs(self) -> Set[DataType]:
               # Specify appropriate `DataType`s below (embeddings are a typical model output)
               return { DataType.EMBEDDING }  

          def get_model_weights_subdir(self, dataset) -> str:
               return "your_model"

          def _download_model_weights(self, dataset) -> None:
               # Implement your model weight download or verification logic here.
               pass

          def run_model(self, dataset):
               # Implement inference logic:
               embeddings = ...  # Run inference to produce embeddings
               dataset.set_output(self.model_type, DataType.EMBEDDING, embeddings)

     if __name__ == "__main__":
           YourModel().run()
     ```

Note that you can access any arguments specified in the `config.yaml` or via command-line options using `self.args`.

---

## Step 4: (Optional) Add a display name for your Model

In `src.czbenchmarks.models.utils`, add a display name for your model if you would
like a prettier display name than the one you used in the `ModelType` enum.

As currently implemented, the display name can be customized based on the value in the
`ModelType` enum as well as the `model-variant` and (fine-tuning) `dataset` arguments
passed to your model (defined in the `create_parser` method).


    Example:

     ```python
     _MODEL_VARIANT_FINETUNE_TO_DISPLAY_NAME = {
         ...
         ("YOUR_MODEL", "1M10L", None): "YourModel (1 million cells, 10 layers)",
         ...
     }
     ```

---

## Step 5: (Optional) Extend `BaseSingleCellValidator`

If your model is a single-cell transcriptomic model and accepts AnnData objects as input, then it can extend `BaseSingleCellValidator`. This will enable the class to validate that the input `Dataset` provides the required organisms, obs keys, and var keys.

1. Add `BaseSingleCellValidator` as a parent class.
2. Specify the required organisms, obs keys, and var keys that are defined as class variables.
3. Specify `DataType.ANNDATA` as the model's input type via the `inputs()` method.
    Example:

    ```python
    ...
    from czbenchmarks.datasets.types import Organism
    from czbenchmarks.models.validators.base_single_cell_model_validator import BaseSingleCellValidator


    class YourModel(BaseModelImplementation, BaseSingleCellValidator):

         ...

         available_organisms = [Organism.HUMAN, Organism.MOUSE]  # Use appropriate Organism enums
         required_obs_keys = []  # Specify required obs keys, as needed
         required_var_keys = ["feature_name"]  # Use appropriate feature name

         @property
         def inputs(self) -> Set[DataType]:
              return { DataType.ANNDATA }

         ...
    ```

---

## Step 6: Create a Config File for the Model

1. Create a `config.yaml` file in your model's directory. This file will define the configuration parameters required for your model.
2. Include the `_target_` key to specify the model class and any additional parameters your model requires.

     Example:

     ```yaml
     _target_: model.YourModel
     your_param: 32
     another_param: "value"
     ```

    The config file may include any additional parameters required by your model.

---

## Step 7: Add `requirements.txt`

1. Create a `requirements.txt` file under `docker/your_model`.
2. Add required Python packages

---

## Step 8: Create a `Dockerfile`

1. Create a new file `docker/your_model/Dockerfile`
2. Specify Docker commands to build the Docker image, per the requirments of the model.

     Example:
     ```
     FROM nvidia/cuda:12.6.1-cudnn-runtime-ubuntu22.04

     WORKDIR /app

     RUN apt-get update && \
     apt-get install -y python3 python3-pip

     COPY docker/your_model/requirements.txt .
     RUN pip install --no-cache-dir -r requirements.txt

     COPY src /app/package/src
     COPY pyproject.toml /app/package/pyproject.toml
     COPY README.md /app/package/README.md

     RUN pip install -e /app/package[interactive]

     COPY docker/your_model/model.py .
     COPY docker/your_model/config.yaml .
     # Specify additional files here, as neeeded

     ENTRYPOINT ["python3", "-u", "/app/model.py"]
     ``` 

3. Add an entry for your model's Docker image location in `src/czbenchmarks/conf/models.yaml`:

     Example:

     ```
     models:
          YOUR_MODEL:
          model_image_uri: cz-benchmarks-models-public:YOUR_MODEL

     ...
     ```

---

## Step 9: Build Your Model

1. Build the Docker container using the `Dockerfile` you created. Run the following command, replacing `your_model` with the appropriate values:

     ```sh
     docker build -t cz-benchmarks-models:your_model -f docker/your_model/Dockerfile .
     ```

2. Optionally, add the Docker build command to your project's `Makefile` for easier execution. For example:

     ```makefile
     .PHONY: your_model
     your_model:
          docker build -t cz-benchmarks-models:your_model -f docker/your_model/Dockerfile .
     ```

---

## Step 10: Test Your Model

Test the Docker container to ensure it works as expected. You can run the container and verify its functionality by executing your model on a sample dataset using the `czbenchmarks.runner.run_inference()` method.

Example:

```python
import logging
import sys
from czbenchmarks.datasets.utils import load_dataset
from czbenchmarks.runner import run_inference

if __name__ == "__main__":
     logging.basicConfig(level=logging.INFO, stream=sys.stdout)
     dataset = load_dataset("tsv2_bone_marrow")  # Specify a dataset from models.yaml that can be used as input to your model

     dataset = run_inference("YOUR_MODEL", dataset)

     print(dataset.get_output("YOUR_MODEL", "EMBEDDING"))
```

For details on creating a custom dataset, refer to the [Add a Custom Dataset](../how_to_guides/add_custom_dataset.md) guide.
     
---

## Additional Notes

- For guidance, review existing implementations such as `docker/scvi` or `docker/scgpt`. These examples can help you understand best practices and common patterns.
- Use the `assets/` directory to store supplementary files your model might need, such as pre-trained weights, vocabularies, or other resources. Keeping these files organized ensures your model remains portable and easy to manage.