Package and Access External Data Files in your Python Wheels with PDM

Posted on 2024/09/01 by R. Kazeno

When building a Python package that needs to include data files such as binaries, configuration files, etc., the usual recommendation is to include those files as part of the package you’re building, i.e. having them in the same parent directory as the rest of your source files. The only reason for that is making the configuration of your package build system much simpler, but in exchange for that, you end up mixing files that are not really part of your source code into the folders that actually contain your code.

If this isn’t something that bothers you, and you see no value in separation of concerns at the directory level, please soldier on. However, if you would prefer to maintain your auxiliary data (and even specific scripts!) external to your source code, while still being able to incorporate it into -and access it from- your installable Python package, this is not only possible, but also relatively simple with modern Python tools and standards.

The Scenario

For the example system we will be implementing here, we will be using the following src layout:

.
├── README.md
├── pyproject.toml
├── src/
│   └── awesome_package/
│       ├── __init__.py
│       └── awesome_module.py
├── awesome_auxiliary_data/
│   ├── awesome_image.png
│   └── awesome_binary.xls
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

We will be using pyproject.toml to configure the build system, and PDM to actually build the final Python package wheel, which will contain the auxiliary_data and auxiliary_scripts as data files separate from the source code.

You can download the final example files from the example GitHub repository.

Installing PDM

As we will be using PDM for this example, if you havent installed it yet, just run this on your terminal:

pip install --upgrade pdm

Configuring the Build System

If you don’t already have a pyproject.toml file, you can have PDM create it for you. Just run the following on your terminal from within your project’s directory, and follow the prompts that will ask you for your project’s details:

pdm init

Once you’ve done that, or if you already had a pyproject.toml file, we will need to add or modify a couple of sections in it. Open the file in your editor, and first we’ll make sure that PDM is configured as the build system:

# Inside your pyproject.toml file

[build-system]
requires = ["pdm-backend"]
build-backend = "pdm.backend"

Next, we’ll mark the project as “distributable”, ensure the src directory is included and considered the “package directory”, and we’ll also add our auxiliary file directories as source includes so that they are also added to the sdist, as some build tools build the wheels directly from the sdist:

# Inside your pyproject.toml file

[tool.pdm]
distribution = true

[tool.pdm.build]
includes = [
    "src",
]
package-dir = "src"
source-includes = [
    "awesome_auxiliary_data",
    "awesome_auxiliary_scripts",
]

Lastly, we’ll include the wheel data file configuration:

# Inside your pyproject.toml file

[tool.pdm.build.wheel-data]
data = [    # install the auxiliary data dir at the python default data scheme dir
    {path = "awesome_auxiliary_data/*", relative-to = "."},
]
scripts = [    # install the auxiliary scripts dir at the python default scripts scheme dir
    {path = "awesome_auxiliary_scripts/*", relative-to = "."},
]

This syntax is specific for PDM, and allows you to choose which auxiliary files to install into which Python install scheme paths.

Just as a recap, Python currently uses 8 different install scheme paths, each with a specific file type in mind, and you can install stuff into any of them:

stdlib: directory containing the standard Python library files that are not platform-specific.
platstdlib: directory containing the standard Python library files that are platform-specific.
platlib: directory for site-specific, platform-specific files.
purelib: directory for site-specific, non-platform-specific files (‘pure’ Python).
include: directory for non-platform-specific header files for the Python C-API.
platinclude: directory for platform-specific header files for the Python C-API.
scripts: directory for script files.
data: directory for data files.

In this example we’re only using the most common ones for auxiliary files, which are data and scripts. Also you might have noticed that we’re defining the wheel data paths to be relative to the parent folder with relative-to = “.”: this is because we want the package installation to copy the whole awesome_auxiliary_data and awesome_auxiliary_scripts directories into their respective install scheme directories as subdirectories, instead of just directly dumping the files. This is a common courtesy to make sure we’re not overwriting some file with the same name that might have been installed there by a different package.

After the package installation, these should be the end results inside the {data install scheme dir} and {scripts install scheme dir} directories (which will be in different locations depending on the platform or operating system):

{data install scheme dir}
└── awesome_auxiliary_data/
    ├── awesome_image.png
    └── awesome_binary.xls

{scripts install scheme dir}
└── awesome_auxiliary_scripts/
    ├── generate_awesomeness.py
    └── decrease_world_suck.py

Example Auxiliary Data and Scripts

The example GitHub repository contains some binary files in the awesome_auxiliary_data subdirectory and some Python modules in the awesome_auxiliary_scripts directory. The binary files are just a PNG image and an Excel file. As for the modules, they just include a test function each, to show that they are running from outside the installed package.

Code above not showing? Go directly to the source.

Accessing the Auxiliar Data and Scripts from within the Package

From within a Python module, you can get the path to any of the install scheme directories on the current machine with the sysconfig.get_path function. For this example, we can get the {data install scheme dir} and {scripts install scheme dir} paths like this:

Of course, those additional files will not exist in those directories until the package is installed.

Finishing the Proof of Concept

To test that our installed package can effectively access those auxiliary files, we will add a module to our package that copies the auxiliar data files into your current directory and executes the functions inside the auxiliary scripts. This will be the main module of the package:

Code above not showing? Go directly to the source.

Build the Distributables

Now that we have all our files in place, we can use PDM to build the package by running on the terminal:

pdm build

This will create a subdirectory called dist, build the wheel and the sdist, and put them there.

Install the Package

Now that you have the wheel created in the previous step, you can install it in any environment. In my case it was named awesome_package-1.0.0-py3-none-any.whl, so to install it I just run the following on the terminal:

pip install %PATH_TO_THE_DIST_DIR%/awesome_package-1.0.0-py3-none-any.whl

where %PATH_TO_THE_DIST_DIR% is the path of the directory containing the wheel.

Run this Puppy!

Now that we have the package installed, we can import it into our scripts as any other library. However, the easiest way to run it and see it in action is just running the main module directly from the terminal:

python -m awesome_package.awesome_module

This should copy the auxiliary data files into the current working directory on your terminal, and print the outputs of the auxiliary scripts, showing that they were correctly loaded from their corresponding Python install scheme directories.

Conclusion

In this example, we accomplished the following:

Added auxiliary data files and scripts to a Python package wheel, without them being part of the source code.
Those data files and scripts were automatically copied to the corresponding Python install scheme paths when the package was installed with pip install.
Once installed, our package was able to directly access those installed data files and scripts.

With this, we were able to cleanly separate those auxiliary files from the rest of our source code, and it allows us to do more advanced things, for example storing templates that we can use for setting up project boiler plate or scaffolding for the users of our framework. (Whether that particular use case is a good idea or not, is a whole other discussion, though.)

Did you find this useful? What would you use this for? Leave a comment and let me know!

A Simple Yet Powerful Cache Tagging Strategy for Redis

Posted on 2024/03/09 by R. Kazeno

When using Redis for caching data related to specific entities, a basic requirement is being able to invalidate all the data related to any entity we update, so that any subsequent request related to that entity will refresh its cache. This is usually called “tagging”, as the basic idea is to tag the data you’re caching with the identifiers of the entities related to that data, so that when one of your entities changes, you (somehow) invalidate all the data tagged with that entity’s identifier.

Active Data Invalidation

The usual advice for implementing this on Redis is based on an active data invalidation strategy, where we actively invalidate (read: delete) all of the cache entries for any tag at the point where we’re updating its data.

Writing to the Cache

Insert a record with the data you’re caching, usually as a string, with some identifier of your data as the key and your serialized data as the value.
SET mydata1 "Serialized data for mydata1"
SET mydata2 "Serialized data for mydata2"
SET mydata3 "Serialized data for mydata3"
Take the data identifier key you used in (1) and add it to a set or list corresponding to each associated tag record.
SADD tag:mytag1 mydata1 mydata2
SADD tag:mytag2 mydata3

Reading from the Cache

Just get the cached data by key, using its identifier, and deserialize it.
GET mydata1
If there’s no data in the cache: retrieve it from persistent storage, generate it, etc. and then write it to cache.

Invalidating Tagged Data in the Cache

Get the record corresponding to the tag you want to invalidate.
SMEMBERS tag:mytag1
For each data identifier key in that record, delete the record by key.
DEL mydata1 mydata2
Delete the record for the tag, effectively invalidating it.
DEL tag:mytag1

Pros and Cons

Advantages

Relatively simple to implement

Disadvantages

Slow invalidation, as it first requires retrieving the list of keys associated with that tag, then deleting them as well as the tag record. This slows down any write involving updating a tag’s associated data.
Greater risk of inconsistent data reads, as another client might read stale cache data while keys are being invalidated.
If any of your tag records get evicted or deleted by mistake, there’s no easy way to invalidate those records.

Variants

There are slightly more elaborate approaches of this basic idea, for specific use cases, for example when you need to be able to easily access the tags corresponding to each cached data record you can use primary and secondary indices.

The Alternative: Passive Data Invalidation

This tagging strategy is based on keeping numeric “versions” of each tag. The invalidation becomes much easier and quicker as the tag’s version number simply has to be increased, at the cost of making the cache retrieval slighly more complex since you need to check if the version number of the data corresponds to the tag’s current version.

For this we can use Redis hashes, which are a very handy data structure that works as a collection of field-value pairs.

Writing to the Cache

All the tag versions will be saved in a hash record with an easily remembered key, for example tags, @tags, etc. We’ll use @tags in the rest of this example.

Read your @tags record to get the current versions of your tags. If it hasn’t been created yet it will just be empty.
HGETALL @tags
Insert a record for the data you’re caching as another hash, with some identifier of your data as the key. The serialized data you’re caching would go into a specific field in that hash, for example let’s call it @data. Also, for each tag associated with this data we add a field named after the tag, with the current version of that tag, as read in step (1), as the value. For tags not present in @tags we use 0 as the version.
HMSET mydata1 @data "Serialized data for data1" mytag1 1 mytag2 0

Reading from the Cache

Retrieve the current tag versions from @tags
HGETALL @tags
Get the cached data hash by key, using its identifier. If it’s not in cache: retrieve it from persistent storage, generate it, etc., and then write it to cache.
HGETALL mydata1
Compare the version of each tag in your cached data hash with its current version in @tags. If it is not exactly the same for any of the tags then ignore the cached data, and instead: retrieve it from persistent storage, generate it, etc., and then write it to cache.
If all the tag versions in your cached data are current, then deserialize the @data field.

Invalidating Tagged Data in the Cache

Increase by 1 the value of the field named after your tag, inside your @tags record. Helpfully, Redis creates this field automatically if it doesn’t exist, and will end up with value 1 after the increment.
HINCRBY @tags mytag1 1

Pros and Cons

Advantages

Invalidating tags is super quick and easy, as you just increment a value for each tag you updated.
Less risk of inconsistent data reads, as tag invalidations are atomic.
If you’re evicting by least recently or least frequently used, your @tags record should not be evicted as you will be accessing it all the time.

Disadvantages

It’s a bit more complex to implement, especially the reading part.
Reading data from cache requires also reading the @tags record. This can be mitigated by reading that record right after connecting to Redis and keeping it in memory for any subsequent cache retrievals during the same connection. Doing that also helps to keep the data consistent, as its less probable that you will mix stale and updated data in the same connection.

Conclusion

I have successfully used this approach for caching DB results while tagging them based on the tables the data was retrieved from. When writing to any table I’m also increasing its tag version, and all subsequent reads of the cached results based on statements such as SELECT or JOIN involving that table will automatically refresh that cached data. Systems that have a more robust entity tagging system can use the same idea for tagging individual entities inside the DB, like a particular user, blog post, etc.

If there’s interest, I might add an example of a PhpRedis implementation for setting, getting, and invalidating tagged data using this strategy. In the meantime, if you end up implementing this strategy, please let me know how you implemented and/or modified it for your particular use case, and what cool things you learned in the process!

MySQL Function for Calculating the Working Days between 2 Dates

Posted on 2019/08/07 by R. Kazeno

Lately I needed a MySQL stored function to calculate the working days (or business days) between 2 dates, however all the solutions I found online were either not configurable in terms of which week days count as working days, or really hard to read/understand. So I rolled my own, and decided to post it here in case anyone else finds it useful.

Here’s the function declaration code, as well as a usage example, hosted in GitHub (or you can find it in https://gist.github.com/kazeno/8bad9453d1e4d2aed33e6af14d1aa7a1 if it’s not showing in your browser):

The function accepts 2 dates, as well as a string that specifies which week days should count as working days. The week days are input as the integers corresponding to their WEEKDAY function representation, i.e.:

0 = Monday
1 = Tuesday
2 = Wednesday
3 = Thursday
4 = Friday
5 = Saturday
6 = Sunday

Thus if the working days are Monday to Friday, the workdays argument would be ‘01234’, and if for a more abstact example the working days are Tuesday to Thursday plus Saturday, the workdays argument would then become ‘1235’.

The function itself determines the start and end dates from the first 2 arguments (so you can use it with the earlier and later dates in any position), counts the number of whole weeks (Monday to Sunday) between the 2 dates, and then loops through the remaining days not belonging to a whole week and counts them if they are contained in the 3rd argument.

Hope you found it useful!

Clear the Field of an non-Editable AngularUI Typeahead

Posted on 2015/03/03 by R. Kazeno

When working with non-editable AngularUI Typeaheads (i.e. ones that have the typeahead-editable attribute set to false), a logical thing to do would be to erase the typeahead input field’s view value if the user doesn’t select a valid typeahead option. However, there’s no built-in option to do just that, so here’s a workaround.

The main idea here is to use the ng-blur attribute to set a function that will clear the typeahead field if no valid option was selected. To do that first we need to access by name the form controller object containing the typeahead. If the typeahead is not contained in a form element, what we can do is declare one of its parent elements as a form controller using ng-form, and give names to both of them:

<div ng-form name="myForm">
    <input type="text" name="myField" ng-model="myField" ng-blur="clearUnselected()" typeahead="myField in myFields" typeahead-editable="false" />
</div>

Now we can access the the typeahead’s form properties from the AngularJS controller, and reset its $viewValue property if no valid option was selected by the user. We wrap this into an AngularJS $timeout because there’s a delay between clicking on an option and the corresponding model updating. Don’t forget to inject the $timeout service into your controller!

//inside your controller
$scope.clearUnselected = function () {
    $timeout(function () {
        if (!$scope.myField) {   //the model was not set by the typeahead
            $scope.myForm.myField.$setViewValue('');
            $scope.myForm.myField.$render();
        }
    }, 250);    //a 250 ms delay should be safe enough
}

How to fix blank screen in Preferences > Themes on the Prestashop Back Office

Posted on 2014/10/17 by R. Kazeno

The Symptoms

You’ve linked your shop to your Prestashop Addons account, where you’ve previously or recently bought a theme. You try opening the Preferences > Themes menu, just to find that after trying to load the page for a while, you either get a blank page (even with dev mode on) or a cryptic 500 error.

The cause

When the Admin Themes Controller initializes, it checks if you’re logged into Prestashop Addons, and if so, whether there’s a theme in your account that hasn’t been installed in your store. If there is, it tries to download it automatically, and here’s where the problem lies: your server is timing out during this connection. If you go to your server error log, you’ll probably find an error such as mod_fcgid: read data timeout in xx seconds if your PHP installation is running on FastCGI.

The Solution

Unfortunately, if you are running on a shared host you almost certainly won’t have the required access necessary to change the server’s timeout parameters. In that case the only course of action is to disable the theme auto-download functionality in AdminThemesController.php, and this can be cleanly done in an override. This is all the code you need in your override:

If you don’t know how to create overrides don’t worry, you can download the following zip with the override already in place. All you need to do is extract the folder within, upload it into your store’s root Prestashop folder, and delete the file cache/class_index.php on your store so the override is detected.

Download Admin Themes Override zip file

Access Private and Protected Properties of Objects in PHP

Posted on 2013/10/10 by R. Kazeno

The theory goes that forcibly overriding Class behaviour from outside its scope is bad practice. However, in the real world we developers often times find ourselves having to adapt to, or work on top of, existing code outside of our control. That’s where hacks like these come in handy, though you should really use them with caution, and always because there’s no other choice available.

The Array Cast Override

This is the conceptually simplest method I’ve found, and the only one I know that works on PHP versions previous to 5.3. It involves casting an Object as an Array, and looking through the Array elements for the property you need. When doing this, the index of the Array element corresponding to each property is named according to special rules, depending on its visibility inside the Object. The next pseudo-code snippet shows these rules:

switch ($visibility) {
    case 'public':
        $array_index = $property_name;
        break;
    case 'protected':
        $array_index = "\0*\0" . $property_name;
        break;
    case 'private':
        $array_index = "\0" . $class_name . "\0" . $property_name;
        break;
}

This means that the way you access the property depends on its visibility. As you can see PHP sandwiches either an asterisk or the Class name (depending on if the visibility is protected or private, respectively) between two NULL bytes, and prepends this to the property name to form the Array index. You can thus access the hidden properties of an object as follows:

class MyExampleClass                 //example of a class with hidden properties
{
    public $myPublicVar = "I'm public!";
    protected $myProtectedVar = "I'm protected!";
    private $myPrivateVar = "I'm private!";
}

$obj = new MyExampleClass();
$obj_array = (Array)$obj;         //cast the object as an Array

echo $obj_array['myPublicVar'];                                        //echoes: I'm public!
echo $obj_array["\0*\0" . 'myProtectedVar'];                           //echoes: I'm protected!
echo $obj_array["\0" . 'MyExampleClass' . "\0" . 'myPrivateVar'];      //echoes: I'm private!

The type of the property doesn’t matter, I’ve even accessed resource handlers this way.

The Reflection Override (PHP >= 5.3)

When talking about overriding a property’s visibility, this is the method most seasoned developers know. Using the Reflection Class, you reflect an existing object and change the accessibility of each property you need by using the ReflectionProperty::setAccessible method. Here is an example:

class MyExampleClass         //example of a class with hidden properties
{
    public $myPublicVar = "I'm public!";
    protected $myProtectedVar = "I'm protected!";
    private $myPrivateVar = "I'm private!";
}

$obj = new MyExampleClass();
$refObj  = new ReflectionObject($obj);
$refProp1 = $refObj->getProperty('myProtectedVar');
$refProp1->setAccessible(TRUE);
$refProp2 = $refObj->getProperty('myPrivateVar');
$refProp2->setAccessible(TRUE);

echo $refProp1->getValue($obj);               //echoes: I'm protected!
echo $refProp2->getValue($obj);               //echoes: I'm private!

Using Reflection is supposedly really slow, but the upside is that you can not only read the value of the property, but also change it using the ReflectionProperty::setValue method.

The Closure Bind Override (PHP >= 5.4)

This is the newest method I’ve come across, and it’s also the quickest and most painless one. It uses the bind static method of the Closure class to bind a Closure to our Object, which we’ll use to return the properties we want. The usage is really straightforward:

class MyExampleClass                 //example of a class with hidden properties
{
    public $myPublicVar = "I'm public!";
    protected $myProtectedVar = "I'm protected!";
    private $myPrivateVar = "I'm private!";
}

$obj = new MyExampleClass();
$propGetter = Closure::bind(  function($prop){return $this->$prop;}, $obj, $obj );

echo $propGetter('myProtectedVar');                     //echoes: I'm protected!
echo $propGetter('myPrivateVar');                       //echoes: I'm private!

It works almost like magic, the disadvantage being that most shared hosting providers are lazy and haven’t bothered to update their PHP versions since the Late Pleistocene. If you can use it, this is the method I would recommend the most.

Do you know any other methods to access hidden properties of objects? If so, I would love to read you comments below.

Acknowledgements and References

fmmarzoa at librexpresion dot org
Tobias Schlitt
PHP.Net