Package and Access External Data Files in your Python Wheels with PDM

When building a Python package that needs to include data files such as binaries, configuration files, etc., the usual recommendation is to include those files as part of the package you’re building, i.e. having them in the same parent directory as the rest of your source files. The only reason for that is making the configuration of your package build system much simpler, but in exchange for that, you end up mixing files that are not really part of your source code into the folders that actually contain your code.

If this isn’t something that bothers you, and you see no value in separation of concerns at the directory level, please soldier on. However, if you would prefer to maintain your auxiliary data (and even specific scripts!) external to your source code, while still being able to incorporate it into -and access it from- your installable Python package, this is not only possible, but also relatively simple with modern Python tools and standards.

The Scenario

For the example system we will be implementing here, we will be using the following src layout:

.
├── README.md
├── pyproject.toml
├── src/
│   └── awesome_package/
│       ├── __init__.py
│       └── awesome_module.py
├── awesome_auxiliary_data/
│   ├── awesome_image.png
│   └── awesome_binary.xls
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

We will be using pyproject.toml to configure the build system, and PDM to actually build the final Python package wheel, which will contain the auxiliary_data and auxiliary_scripts as data files separate from the source code.

You can download the final example files from the example GitHub repository.

Installing PDM

As we will be using PDM for this example, if you havent installed it yet, just run this on your terminal:

pip install --upgrade pdm

Configuring the Build System

If you don’t already have a pyproject.toml file, you can have PDM create it for you. Just run the following on your terminal from within your project’s directory, and follow the prompts that will ask you for your project’s details:

pdm init

Once you’ve done that, or if you already had a pyproject.toml file, we will need to add or modify a couple of sections in it. Open the file in your editor, and first we’ll make sure that PDM is configured as the build system:

# Inside your pyproject.toml file

[build-system]
requires = ["pdm-backend"]
build-backend = "pdm.backend"

Next, we’ll mark the project as “distributable”, ensure the src directory is included and considered the “package directory”, and we’ll also add our auxiliary file directories as source includes so that they are also added to the sdist, as some build tools build the wheels directly from the sdist:

# Inside your pyproject.toml file

[tool.pdm]
distribution = true

[tool.pdm.build]
includes = [
    "src",
]
package-dir = "src"
source-includes = [
    "awesome_auxiliary_data",
    "awesome_auxiliary_scripts",
]

Lastly, we’ll include the wheel data file configuration:

# Inside your pyproject.toml file

[tool.pdm.build.wheel-data]
data = [    # install the auxiliary data dir at the python default data scheme dir
    {path = "awesome_auxiliary_data/*", relative-to = "."},
]
scripts = [    # install the auxiliary scripts dir at the python default scripts scheme dir
    {path = "awesome_auxiliary_scripts/*", relative-to = "."},
]

This syntax is specific for PDM, and allows you to choose which auxiliary files to install into which Python install scheme paths.

Just as a recap, Python currently uses 8 different install scheme paths, each with a specific file type in mind, and you can install stuff into any of them:

  • stdlib: directory containing the standard Python library files that are not platform-specific.
  • platstdlib: directory containing the standard Python library files that are platform-specific.
  • platlib: directory for site-specific, platform-specific files.
  • purelib: directory for site-specific, non-platform-specific files (‘pure’ Python).
  • include: directory for non-platform-specific header files for the Python C-API.
  • platinclude: directory for platform-specific header files for the Python C-API.
  • scripts: directory for script files.
  • data: directory for data files.

In this example we’re only using the most common ones for auxiliary files, which are data and scripts. Also you might have noticed that we’re defining the wheel data paths to be relative to the parent folder with relative-to = “.”: this is because we want the package installation to copy the whole awesome_auxiliary_data and awesome_auxiliary_scripts directories into their respective install scheme directories as subdirectories, instead of just directly dumping the files. This is a common courtesy to make sure we’re not overwriting some file with the same name that might have been installed there by a different package.

After the package installation, these should be the end results inside the {data install scheme dir} and {scripts install scheme dir} directories (which will be in different locations depending on the platform or operating system):

{data install scheme dir}
└── awesome_auxiliary_data/
    ├── awesome_image.png
    └── awesome_binary.xls

{scripts install scheme dir}
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

Example Auxiliary Data and Scripts

The example GitHub repository contains some binary files in the awesome_auxiliary_data subdirectory and some Python modules in the awesome_auxiliary_scripts directory. The binary files are just a PNG image and an Excel file. As for the modules, they just include a test function each, to show that they are running from outside the installed package.

Code above not showing? Go directly to the source.

Code above not showing? Go directly to the source.

Accessing the Auxiliar Data and Scripts from within the Package

From within a Python module, you can get the path to any of the install scheme directories on the current machine with the sysconfig.get_path function. For this example, we can get the {data install scheme dir} and {scripts install scheme dir} paths like this:

Of course, those additional files will not exist in those directories until the package is installed.

Finishing the Proof of Concept

To test that our installed package can effectively access those auxiliary files, we will add a module to our package that copies the auxiliar data files into your current directory and executes the functions inside the auxiliary scripts. This will be the main module of the package:

Code above not showing? Go directly to the source.

Build the Distributables

Now that we have all our files in place, we can use PDM to build the package by running on the terminal:

pdm build

This will create a subdirectory called dist, build the wheel and the sdist, and put them there.

Install the Package

Now that you have the wheel created in the previous step, you can install it in any environment. In my case it was named awesome_package-1.0.0-py3-none-any.whl, so to install it I just run the following on the terminal:

pip install %PATH_TO_THE_DIST_DIR%/awesome_package-1.0.0-py3-none-any.whl

where %PATH_TO_THE_DIST_DIR% is the path of the directory containing the wheel.

Run this Puppy!

Now that we have the package installed, we can import it into our scripts as any other library. However, the easiest way to run it and see it in action is just running the main module directly from the terminal:

python -m awesome_package.awesome_module

This should copy the auxiliary data files into the current working directory on your terminal, and print the outputs of the auxiliary scripts, showing that they were correctly loaded from their corresponding Python install scheme directories.

Conclusion

In this example, we accomplished the following:

  • Added auxiliary data files and scripts to a Python package wheel, without them being part of the source code.
  • Those data files and scripts were automatically copied to the corresponding Python install scheme paths when the package was installed with pip install.
  • Once installed, our package was able to directly access those installed data files and scripts.

With this, we were able to cleanly separate those auxiliary files from the rest of our source code, and it allows us to do more advanced things, for example storing templates that we can use for setting up project boiler plate or scaffolding for the users of our framework. (Whether that particular use case is a good idea or not, is a whole other discussion, though.)

Did you find this useful? What would you use this for? Leave a comment and let me know!