Package and Access External Data Files in your Python Wheels with PDM

Posted on 2024/09/01 by R. Kazeno

When building a Python package that needs to include data files such as binaries, configuration files, etc., the usual recommendation is to include those files as part of the package you’re building, i.e. having them in the same parent directory as the rest of your source files. The only reason for that is making the configuration of your package build system much simpler, but in exchange for that, you end up mixing files that are not really part of your source code into the folders that actually contain your code.

If this isn’t something that bothers you, and you see no value in separation of concerns at the directory level, please soldier on. However, if you would prefer to maintain your auxiliary data (and even specific scripts!) external to your source code, while still being able to incorporate it into -and access it from- your installable Python package, this is not only possible, but also relatively simple with modern Python tools and standards.

The Scenario

For the example system we will be implementing here, we will be using the following src layout:

.
├── README.md
├── pyproject.toml
├── src/
│   └── awesome_package/
│       ├── __init__.py
│       └── awesome_module.py
├── awesome_auxiliary_data/
│   ├── awesome_image.png
│   └── awesome_binary.xls
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

We will be using pyproject.toml to configure the build system, and PDM to actually build the final Python package wheel, which will contain the auxiliary_data and auxiliary_scripts as data files separate from the source code.

You can download the final example files from the example GitHub repository.

Installing PDM

As we will be using PDM for this example, if you havent installed it yet, just run this on your terminal:

pip install --upgrade pdm

Configuring the Build System

If you don’t already have a pyproject.toml file, you can have PDM create it for you. Just run the following on your terminal from within your project’s directory, and follow the prompts that will ask you for your project’s details:

pdm init

Once you’ve done that, or if you already had a pyproject.toml file, we will need to add or modify a couple of sections in it. Open the file in your editor, and first we’ll make sure that PDM is configured as the build system:

# Inside your pyproject.toml file

[build-system]
requires = ["pdm-backend"]
build-backend = "pdm.backend"

Next, we’ll mark the project as “distributable”, ensure the src directory is included and considered the “package directory”, and we’ll also add our auxiliary file directories as source includes so that they are also added to the sdist, as some build tools build the wheels directly from the sdist:

# Inside your pyproject.toml file

[tool.pdm]
distribution = true

[tool.pdm.build]
includes = [
    "src",
]
package-dir = "src"
source-includes = [
    "awesome_auxiliary_data",
    "awesome_auxiliary_scripts",
]

Lastly, we’ll include the wheel data file configuration:

# Inside your pyproject.toml file

[tool.pdm.build.wheel-data]
data = [    # install the auxiliary data dir at the python default data scheme dir
    {path = "awesome_auxiliary_data/*", relative-to = "."},
]
scripts = [    # install the auxiliary scripts dir at the python default scripts scheme dir
    {path = "awesome_auxiliary_scripts/*", relative-to = "."},
]

This syntax is specific for PDM, and allows you to choose which auxiliary files to install into which Python install scheme paths.

Just as a recap, Python currently uses 8 different install scheme paths, each with a specific file type in mind, and you can install stuff into any of them:

stdlib: directory containing the standard Python library files that are not platform-specific.
platstdlib: directory containing the standard Python library files that are platform-specific.
platlib: directory for site-specific, platform-specific files.
purelib: directory for site-specific, non-platform-specific files (‘pure’ Python).
include: directory for non-platform-specific header files for the Python C-API.
platinclude: directory for platform-specific header files for the Python C-API.
scripts: directory for script files.
data: directory for data files.

In this example we’re only using the most common ones for auxiliary files, which are data and scripts. Also you might have noticed that we’re defining the wheel data paths to be relative to the parent folder with relative-to = “.”: this is because we want the package installation to copy the whole awesome_auxiliary_data and awesome_auxiliary_scripts directories into their respective install scheme directories as subdirectories, instead of just directly dumping the files. This is a common courtesy to make sure we’re not overwriting some file with the same name that might have been installed there by a different package.

After the package installation, these should be the end results inside the {data install scheme dir} and {scripts install scheme dir} directories (which will be in different locations depending on the platform or operating system):

{data install scheme dir}
└── awesome_auxiliary_data/
    ├── awesome_image.png
    └── awesome_binary.xls

{scripts install scheme dir}
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

Example Auxiliary Data and Scripts

The example GitHub repository contains some binary files in the awesome_auxiliary_data subdirectory and some Python modules in the awesome_auxiliary_scripts directory. The binary files are just a PNG image and an Excel file. As for the modules, they just include a test function each, to show that they are running from outside the installed package.

Code above not showing? Go directly to the source.

Accessing the Auxiliar Data and Scripts from within the Package

From within a Python module, you can get the path to any of the install scheme directories on the current machine with the sysconfig.get_path function. For this example, we can get the {data install scheme dir} and {scripts install scheme dir} paths like this:

Of course, those additional files will not exist in those directories until the package is installed.

Finishing the Proof of Concept

To test that our installed package can effectively access those auxiliary files, we will add a module to our package that copies the auxiliar data files into your current directory and executes the functions inside the auxiliary scripts. This will be the main module of the package:

Code above not showing? Go directly to the source.

Build the Distributables

Now that we have all our files in place, we can use PDM to build the package by running on the terminal:

pdm build

This will create a subdirectory called dist, build the wheel and the sdist, and put them there.

Install the Package

Now that you have the wheel created in the previous step, you can install it in any environment. In my case it was named awesome_package-1.0.0-py3-none-any.whl, so to install it I just run the following on the terminal:

pip install %PATH_TO_THE_DIST_DIR%/awesome_package-1.0.0-py3-none-any.whl

where %PATH_TO_THE_DIST_DIR% is the path of the directory containing the wheel.

Run this Puppy!

Now that we have the package installed, we can import it into our scripts as any other library. However, the easiest way to run it and see it in action is just running the main module directly from the terminal:

python -m awesome_package.awesome_module

This should copy the auxiliary data files into the current working directory on your terminal, and print the outputs of the auxiliary scripts, showing that they were correctly loaded from their corresponding Python install scheme directories.

Conclusion

In this example, we accomplished the following:

Added auxiliary data files and scripts to a Python package wheel, without them being part of the source code.
Those data files and scripts were automatically copied to the corresponding Python install scheme paths when the package was installed with pip install.
Once installed, our package was able to directly access those installed data files and scripts.

With this, we were able to cleanly separate those auxiliary files from the rest of our source code, and it allows us to do more advanced things, for example storing templates that we can use for setting up project boiler plate or scaffolding for the users of our framework. (Whether that particular use case is a good idea or not, is a whole other discussion, though.)

Did you find this useful? What would you use this for? Leave a comment and let me know!

A Simple Yet Powerful Cache Tagging Strategy for Redis

Posted on 2024/03/09 by R. Kazeno

When using Redis for caching data related to specific entities, a basic requirement is being able to invalidate all the data related to any entity we update, so that any subsequent request related to that entity will refresh its cache. This is usually called “tagging”, as the basic idea is to tag the data you’re caching with the identifiers of the entities related to that data, so that when one of your entities changes, you (somehow) invalidate all the data tagged with that entity’s identifier.

Active Data Invalidation

The usual advice for implementing this on Redis is based on an active data invalidation strategy, where we actively invalidate (read: delete) all of the cache entries for any tag at the point where we’re updating its data.

Writing to the Cache

Insert a record with the data you’re caching, usually as a string, with some identifier of your data as the key and your serialized data as the value.
SET mydata1 "Serialized data for mydata1"
SET mydata2 "Serialized data for mydata2"
SET mydata3 "Serialized data for mydata3"
Take the data identifier key you used in (1) and add it to a set or list corresponding to each associated tag record.
SADD tag:mytag1 mydata1 mydata2
SADD tag:mytag2 mydata3

Reading from the Cache

Just get the cached data by key, using its identifier, and deserialize it.
GET mydata1
If there’s no data in the cache: retrieve it from persistent storage, generate it, etc. and then write it to cache.

Invalidating Tagged Data in the Cache

Get the record corresponding to the tag you want to invalidate.
SMEMBERS tag:mytag1
For each data identifier key in that record, delete the record by key.
DEL mydata1 mydata2
Delete the record for the tag, effectively invalidating it.
DEL tag:mytag1

Pros and Cons

Advantages

Relatively simple to implement

Disadvantages

Slow invalidation, as it first requires retrieving the list of keys associated with that tag, then deleting them as well as the tag record. This slows down any write involving updating a tag’s associated data.
Greater risk of inconsistent data reads, as another client might read stale cache data while keys are being invalidated.
If any of your tag records get evicted or deleted by mistake, there’s no easy way to invalidate those records.

Variants

There are slightly more elaborate approaches of this basic idea, for specific use cases, for example when you need to be able to easily access the tags corresponding to each cached data record you can use primary and secondary indices.

The Alternative: Passive Data Invalidation

This tagging strategy is based on keeping numeric “versions” of each tag. The invalidation becomes much easier and quicker as the tag’s version number simply has to be increased, at the cost of making the cache retrieval slighly more complex since you need to check if the version number of the data corresponds to the tag’s current version.

For this we can use Redis hashes, which are a very handy data structure that works as a collection of field-value pairs.

Writing to the Cache

All the tag versions will be saved in a hash record with an easily remembered key, for example tags, @tags, etc. We’ll use @tags in the rest of this example.

Read your @tags record to get the current versions of your tags. If it hasn’t been created yet it will just be empty.
HGETALL @tags
Insert a record for the data you’re caching as another hash, with some identifier of your data as the key. The serialized data you’re caching would go into a specific field in that hash, for example let’s call it @data. Also, for each tag associated with this data we add a field named after the tag, with the current version of that tag, as read in step (1), as the value. For tags not present in @tags we use 0 as the version.
HMSET mydata1 @data "Serialized data for data1" mytag1 1 mytag2 0

Reading from the Cache

Retrieve the current tag versions from @tags
HGETALL @tags
Get the cached data hash by key, using its identifier. If it’s not in cache: retrieve it from persistent storage, generate it, etc., and then write it to cache.
HGETALL mydata1
Compare the version of each tag in your cached data hash with its current version in @tags. If it is not exactly the same for any of the tags then ignore the cached data, and instead: retrieve it from persistent storage, generate it, etc., and then write it to cache.
If all the tag versions in your cached data are current, then deserialize the @data field.

Invalidating Tagged Data in the Cache

Increase by 1 the value of the field named after your tag, inside your @tags record. Helpfully, Redis creates this field automatically if it doesn’t exist, and will end up with value 1 after the increment.
HINCRBY @tags mytag1 1

Pros and Cons

Advantages

Invalidating tags is super quick and easy, as you just increment a value for each tag you updated.
Less risk of inconsistent data reads, as tag invalidations are atomic.
If you’re evicting by least recently or least frequently used, your @tags record should not be evicted as you will be accessing it all the time.

Disadvantages

It’s a bit more complex to implement, especially the reading part.
Reading data from cache requires also reading the @tags record. This can be mitigated by reading that record right after connecting to Redis and keeping it in memory for any subsequent cache retrievals during the same connection. Doing that also helps to keep the data consistent, as its less probable that you will mix stale and updated data in the same connection.

Conclusion

I have successfully used this approach for caching DB results while tagging them based on the tables the data was retrieved from. When writing to any table I’m also increasing its tag version, and all subsequent reads of the cached results based on statements such as SELECT or JOIN involving that table will automatically refresh that cached data. Systems that have a more robust entity tagging system can use the same idea for tagging individual entities inside the DB, like a particular user, blog post, etc.

If there’s interest, I might add an example of a PhpRedis implementation for setting, getting, and invalidating tagged data using this strategy. In the meantime, if you end up implementing this strategy, please let me know how you implemented and/or modified it for your particular use case, and what cool things you learned in the process!

MySQL Function for Calculating the Working Days between 2 Dates

Posted on 2019/08/07 by R. Kazeno

Lately I needed a MySQL stored function to calculate the working days (or business days) between 2 dates, however all the solutions I found online were either not configurable in terms of which week days count as working days, or really hard to read/understand. So I rolled my own, and decided to post it here in case anyone else finds it useful.

Here’s the function declaration code, as well as a usage example, hosted in GitHub (or you can find it in https://gist.github.com/kazeno/8bad9453d1e4d2aed33e6af14d1aa7a1 if it’s not showing in your browser):

The function accepts 2 dates, as well as a string that specifies which week days should count as working days. The week days are input as the integers corresponding to their WEEKDAY function representation, i.e.:

0 = Monday
1 = Tuesday
2 = Wednesday
3 = Thursday
4 = Friday
5 = Saturday
6 = Sunday

Thus if the working days are Monday to Friday, the workdays argument would be ‘01234’, and if for a more abstact example the working days are Tuesday to Thursday plus Saturday, the workdays argument would then become ‘1235’.

The function itself determines the start and end dates from the first 2 arguments (so you can use it with the earlier and later dates in any position), counts the number of whole weeks (Monday to Sunday) between the 2 dates, and then loops through the remaining days not belonging to a whole week and counts them if they are contained in the 3rd argument.

Hope you found it useful!

Display Custom Equation Numbers in Wolfram Mathematica

Posted on 2017/07/16 by R. Kazeno

If you would like to use the EquationNumbered style for typesetting, but using your own equation numbers or designations instead of the automatically generated ones, you can simply edit the cell expression by going into the menu Cell > Show Expression. If you do this on a blank EquationNumbered cell, you should get code like the following:

Cell[BoxData[ FormBox["", TraditionalForm]], "EquationNumbered"]

There, you need to override the CellFrameLabels option of the parent Cell, with the following code:

{{None, Cell[ TextData[{"(YOUR_DESIGNATION_HERE)"}], "EquationNumbered"]}, {None, None}}

Substitute the YOUR_DESIGNATION_HERE text with the number or designation you want to assign to your equation. For example if you want to call it A1, your cell should end up like this:

Cell[BoxData[ FormBox["", TraditionalForm]], "EquationNumbered", CellFrameLabels->{{None, Cell[ TextData[{"(A1)"}], "EquationNumbered"]}, {None, None}}]

Now you can disable the menu Cell > Show Expression option, and your cell is ready for you to type your equation in!

Laplace's equation in polar coordinates

Clear the Field of an non-Editable AngularUI Typeahead

Posted on 2015/03/03 by R. Kazeno

When working with non-editable AngularUI Typeaheads (i.e. ones that have the typeahead-editable attribute set to false), a logical thing to do would be to erase the typeahead input field’s view value if the user doesn’t select a valid typeahead option. However, there’s no built-in option to do just that, so here’s a workaround.

The main idea here is to use the ng-blur attribute to set a function that will clear the typeahead field if no valid option was selected. To do that first we need to access by name the form controller object containing the typeahead. If the typeahead is not contained in a form element, what we can do is declare one of its parent elements as a form controller using ng-form, and give names to both of them:

<div ng-form name="myForm">
    <input type="text" name="myField" ng-model="myField" ng-blur="clearUnselected()" typeahead="myField in myFields" typeahead-editable="false" />
</div>

Now we can access the the typeahead’s form properties from the AngularJS controller, and reset its $viewValue property if no valid option was selected by the user. We wrap this into an AngularJS $timeout because there’s a delay between clicking on an option and the corresponding model updating. Don’t forget to inject the $timeout service into your controller!

//inside your controller
$scope.clearUnselected = function () {
    $timeout(function () {
        if (!$scope.myField) {   //the model was not set by the typeahead
            $scope.myForm.myField.$setViewValue('');
            $scope.myForm.myField.$render();
        }
    }, 250);    //a 250 ms delay should be safe enough
}

PrestaShop Module Development by Fabien Serny – a book review

Posted on 2015/02/18 by R. Kazeno

This book, written by one of the original members of the Prestashop developer team, is an invaluable resource for anyone whose income depends at least in part on developing Prestashop modules. The time savings alone, as compared to having to browse through the Prestashop classes and native modules just to find how an example of functionality similar to what you require, are definitely worth the price of admission.

The book outlines Prestashop best practices, goes into detail on the different types of modules, and describes the usage of helper form and list classes, the context object, overrides, and how to handle module updates. Part of this information can be also found in Prestashop’s developer guide, however the book goes into much more detail and lays out quite explanatory code samples. However, the most useful part of it all might be the Native Hooks Appendix, which has a list of all the 145 native hooks (as of Prestashop 1.6) with their descriptions, parameters, and the files they are called from.

As for stuff missing from the book, I believe a section on Functional Testing of modules would have been really useful, since the Prestashop developer community would surely benefit from incorporating automated testing into its development standards. Apart from this, the book’s a real time saver if you’re a Prestashop beginner, and even if you’ve got quite a few modules under your belt, it should at least help to improve the quality of your code.

Thinking around the Box (and other Models)

Stuff I work on in e-commerce programming, mathematical modelling, and random musings

Package and Access External Data Files in your Python Wheels with PDM

The Scenario

Installing PDM

Configuring the Build System

Example Auxiliary Data and Scripts

Accessing the Auxiliar Data and Scripts from within the Package

Finishing the Proof of Concept

Build the Distributables

Install the Package

Run this Puppy!

Conclusion

A Simple Yet Powerful Cache Tagging Strategy for Redis

Active Data Invalidation

Writing to the Cache

Reading from the Cache

Invalidating Tagged Data in the Cache

Pros and Cons

Advantages

Disadvantages

Variants

The Alternative: Passive Data Invalidation

Writing to the Cache

Reading from the Cache

Invalidating Tagged Data in the Cache

Pros and Cons

Advantages

Disadvantages

Conclusion

MySQL Function for Calculating the Working Days between 2 Dates

Display Custom Equation Numbers in Wolfram Mathematica

Clear the Field of an non-Editable AngularUI Typeahead

PrestaShop Module Development by Fabien Serny – a book review