MINIMALLY PAINFUL PYTHON DEVELOPMENT
====================================

[Stef Countryman -- 2021/06/21] [home] [blogs]
[Stef Countryman -- 2022/08/29] (Updated)

This post is meant to help you quickly get a nice, maintainable python environment up and running. Click on code blocks to copy them. You can just follow the instructions in the TL;DR section, referencing the extended explanations as needed.

TL;DR
-----

  1. Set up your interpreter. Install Anaconda [MacOS] [Win] [Linux]. If it asks you whether you'd like to install PyCharm as well, say yes if you don't have it installed already (in which case you can skip that step).
  2. Track your dependencies. Find a list of all of your dependencies, ideally manually, and put them in a conda environment YAML file called something like your-project-env.yml. Update this environment as you install/remove new packages for your project.
  3. Organize your project into a module/package. Keep all of your shared code and scripts in a single directory and use a tool like Flit to handle installing and distributing it.
  4. Pick an IDE/Editor. Install PyCharm (if you didn't already do it while installing Anaconda).

MOTIVATION
----------

Developing multiple python projects with arcane dependencies using the naive approach (install one copy of python, pip --install a bunch of packages by hand, lose track of which projects need which packages, face a dependency conflict, suffer, despair) will ruin your computer and your life. When it comes to dependency management, packaging, writing extensions, and even choosing an interpreter, a web search will show dozens of ways to do the same thing, most of them obsolete, overwrought, unmaintainable, slow, bug-ridden, inextensible, or otherwise terrible.

This presents a problem to the novice programmer, who has heard that python is easy to learn and has many useful libraries that you can freely use, and therefore starts a project without setting up a decent dev environment, not realizing what horrors await them. Fortunately, practices are improving somewhat, and some reasonable choices, managing python projects can be almost as forgiving to novice programmers as the syntax of the code itself.

The aim of this post is to provide such a guide to my friends and colleagues in a central place.

SET UP YOUR INTERPRETER
-----------------------

The first thing to do is to install an actual python interpreter. Python is a scripting language, meaning that the programs are just text files which are executed by a program called an interpreter. The language itself changes roughly every year, so you will probably want an interpreter that runs a recent version of the language, as well as the ability to manage multiple versions (in case you find that certain of your programs require older/newer versions than what you typically run).

MacOS computers come with a very old version of python pre-installed. Linux machines sometimes come with a newer version pre-installed, though very minimal distributions will come without a python interpreter altogether. MacOS, most Linux distributions, and Windows all have system package managers (either official, like apt on Debian Linux, or third party, like MacPorts or Homebrew on MacOS/Chocolatey on Windows) that let you quickly install a version of python; don't use them. They are tied in to various tools throughout your system, so that breaking them might break system-wide tools you depend on, and they tend to be incomplete and out of date. Installing python and libraries [1] using the system package manager means possibly having to remove and reinstall everything in your fragile dev environment at some point in the future, which can easily take over a day for complex environments, assuming it's possible at all. This violates what is, in my opinion, the most important rule of a development environment:

It should be easy to destroy and rebuild your dev environment.

Having this property means you will never live in fear of the day when you inevitably mess up your configuration and need to delete it all and start over. You will not fear adding new team members or switching development computers, and chances are your deployments will become much easier.

The (likely) easiest solution is to use Anaconda's python distribution, which includes a nice, fast, cross-platform python package manager (conda) and a nice development environment management system, which I'll discuss in the next section.

TRACK YOUR DEPENDENCIES (IN A CONDA ENVIRONMENT)
------------------------------------------------

You can use Anaconda to create distinct Python environments for each project you are working on. You can create a new environment (with no packages yet installed) using conda create -n my-environment-name, list your existing python environments with conda env list, and activate the one you want (e.g. my-environment-name) with conda activate my-environment-name.

In general, the easiest way to manage your environments is to have a single file which specifies all the dependencies, allowing you to install all of them with a single command. pip allows you to do this with what is usually referred to as a requirements file, called requirements.txt by convention. The Conda approach is more powerful; it uses a YAML file which optionally includes information about the python version and environment name as well as the ability to specify pip dependencies (since more niche packages usually need to be installed that way).

If you already have a conda environment file called, e.g., env.yml for an environment called my-environment-name, you can simply run conda env create -f env.yml, and it will install all required packages; you can then activate the environment with conda activate my-environment-name and start using it. Such an environment file's contents might look as simple as:

name: my-environment-name
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - numpy
  - pip
  - pip:
    - nptyping==1.4.4

In this example, we are allowing Anaconda to install Python version 3.9, as well as packages from conda-forge (which is user-maintained and contains most packages); more precisely, we are installing Numpy using the conda package manager; and we are installing a more niche package, nptyping, using pip, since it's not available on conda-forge. We specify our pip dependencies as a list under -pip:, but we also need to explicitly install the pip package manager itself using conda in order to be able to do so. We've required nptyping to use a specific version (1.4.4) while allowing Numpy to default to the latest version; in general, it's usually easiest to leave the version numbers unspecified until the point where specific versions are required to avoid bugs you've found. Finally, we have also specified the name of the environment; both this and the Python version are optional, but it's nice to be extra specific with these just to make sure things work.

You can get a list of all top-level imported packages in your codebase by running this command (for regular .py python files):

find . -name '*py' -exec sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' {} \; | sort | uniq

or this command (for (.ipynb Jupyter notebook files), which requires you to install jq for JSON parsing [2]:

find . -name '*ipynb' -exec jq -jr '.cells | .[].source | .[]' {} \; | sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' | sort | uniq

or handle both types of files in one fell swoop:

(
    find . -name '*ipynb' -exec jq -jr '.cells | .[].source | .[]' {} \;
    find . -name '*py' -exec cat {} \;
) \
    | sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' \
    | sort \
    | uniq

Note that the names of packages on conda or pip might differ from the imported package names, and this won't tell you which versions are required (which can be vital for getting complex numerical environments working properly). If you can't get your code to work on a newly-built environment, you might want to see exactly which versions of each package you have installed by running pip freeze in whichever environment you are already using. You should see a bunch of items that look like e.g. typish==1.9.3. Alternatively, if you've already been using conda to manage your dependencies, you can directly export an environment file with conda env export >my-env.yml (substituting my-env.yml with whatever you want to call your environment); this output will include information about the source of each package, i.e. whether it was from PyPI (using pip) or from conda-forge (using conda).

In both cases, you'll end up with a full list of installed packages; most of these will probably be dependencies of the packages you are actually using. If you're taking this approach, it's good to delete everything that doesn't appear in the above find command, unless you really need a specific version or want a specific record of installed packages that worked for your project. [3]

ORGANIZE YOUR PROJECT INTO A MODULE/PACKAGE
-------------------------------------------

You may have already noticed that, if you have a file called example.py and are running a python script or interpreter in the same directory, you can run import example to import the contents of that file into your current environment. This is a module, and it's the simplest way to turn your code into a reusable library. If you have a bunch of scripts or Jupyter notebooks in a project using the same functions repeatedly, it might make sense to put those functions into a module and import them (like you would any other libraries) in the scripts/notebooks themselves.

If you want to reuse a lot of code, it makes sense to break your library up into smaller submodules; Numpy does this with its less common linear algebra tools, which can be accessed via e.g. import numpy.linalg or from numpy import linalg. Such a library composed of multiple source files (modules) bundled together is called a package. If we wanted to make an example package, we can simply run

mkdir example
mv example.py example/__init__.py

We can then run import example again within python just as before. This works because Python treats a directory containing an __init__.py file as if it were a module of the same name. Unlike a module, however, we can add additional submodules and subpackages to a package; for example, if we add a submodule examples.constants and a subpackage examples.utils, our entire package's directory structure will now look like:

example
├── __init__.py
├── constants.py
└── utils
    └── __init__.py

You can continue this recursively as needed, building out a nicely organized library.

PICK AN IDE/EDITOR
------------------

You don't really need an integrated development environment (IDE, i.e. a program for editing and running your code) for python dev, since ipython, Jupyter, ripgrep, and your favorite text editor can handle 98% of what you'd need an IDE for. If you're command-line savvy and have a favorite text editor, you can keep using your favorite tools quite easily.

If you're not yet a command-line guru, however, you should try a nice IDE like PyCharm. It has great code exploration, debugging, and running/testing tools built in, with full support for various arcane interpreter setups (in case you need to deviate in the future from the set up I suggest in this post), and is cross-platform. The community version is free and adequate for most purposes.

------------------------------------------------------------

[1] Seasoned python devs might correctly observe that you could simply use a virtualenv. This is a valid technique, but it does not solve the related problem of picking and installing an interpreter (let alone managing multiple versions), nor of the limitations of pip or the inconvenience of having virtual environments scattered all over your dev folders. pipenv uses virtualenv and tracks your dependencies automatically, but anything that pipenv is capable of automatically tracking is going to be very easy to track or reconstruct manually, reducing it's utility (not to mention it was slow and buggy last I used it) [3].

[2] You can easily install jq with e.g. brew install jq (MacOS using Homebrew) or choco install jq (Windows using Chocolatey); see react 0 to 1 for relevant links.

[3] If the latter is critical to getting your code to work, however, you might want to go one step further and immortalize your environment in a docker image. It's more work, but for some numerical/scientific environments, it's the only convenient way to have a portable and reusable environment for running your code.

[4] Update: this still seems to be true in 2022.