[Stef Countryman -- 2021/06/21] [home] [blogs]
[Stef Countryman -- 2022/08/29] (Updated)
This post is meant to help you quickly get a nice, maintainable python environment up and running. Click on code blocks to copy them. You can just follow the instructions in the TL;DR section, referencing the extended explanations as needed.
your-project-env.yml
. Update this environment as you
install/remove new packages for your project.
Developing multiple python projects with arcane dependencies using the
naive approach (install one copy of python, pip --install
a
bunch of packages by hand, lose track of which projects need which
packages, face a dependency conflict, suffer, despair) will ruin your
computer and your life.
When it comes to dependency management, packaging, writing extensions, and
even choosing an interpreter, a web search will show dozens of ways to do
the same thing, most of them obsolete, overwrought, unmaintainable, slow,
bug-ridden, inextensible, or otherwise terrible.
This presents a problem to the novice programmer, who has heard that python is easy to learn and has many useful libraries that you can freely use, and therefore starts a project without setting up a decent dev environment, not realizing what horrors await them. Fortunately, practices are improving somewhat, and some reasonable choices, managing python projects can be almost as forgiving to novice programmers as the syntax of the code itself.
The aim of this post is to provide such a guide to my friends and colleagues in a central place.
The first thing to do is to install an actual python interpreter. Python is a scripting language, meaning that the programs are just text files which are executed by a program called an interpreter. The language itself changes roughly every year, so you will probably want an interpreter that runs a recent version of the language, as well as the ability to manage multiple versions (in case you find that certain of your programs require older/newer versions than what you typically run).
MacOS computers come with a very old version of python pre-installed.
Linux machines sometimes come with a newer version pre-installed, though
very minimal distributions will come without a python interpreter
altogether.
MacOS, most Linux distributions, and Windows all have system
package managers (either official, like apt
on Debian Linux,
or third party, like MacPorts or Homebrew on MacOS/Chocolatey on Windows)
that let you quickly install a version of python;
don't use them.
They are tied in to various tools throughout your system, so that breaking
them might break system-wide tools you depend on, and they tend to be
incomplete and out of date.
Installing python and libraries
[1]
using the system package manager means
possibly having to remove and reinstall everything in your fragile dev
environment at some point in the future, which can easily take over a day
for complex environments, assuming it's possible at all.
This violates what is, in my opinion, the most important rule of a
development environment:
It should be easy to destroy and rebuild your dev environment.
Having this property means you will never live in fear of the day when you inevitably mess up your configuration and need to delete it all and start over. You will not fear adding new team members or switching development computers, and chances are your deployments will become much easier.
The (likely) easiest solution is to use Anaconda's python distribution,
which includes a nice, fast, cross-platform python package manager
(conda
) and a nice development environment management system,
which I'll discuss in the next section.
You can use Anaconda to create distinct Python environments for each
project you are working on.
You can create a new environment (with no packages yet installed) using
conda create -n my-environment-name
, list your existing python
environments with conda env list
, and activate the one you want
(e.g. my-environment-name
)
with conda activate my-environment-name
.
In general, the easiest way to manage your environments is to have a single
file which specifies all the dependencies, allowing you to install all of
them with a single command.
pip
allows you to do this with what is usually referred to as
a requirements file, called requirements.txt
by convention.
The Conda approach is more powerful; it uses a
YAML file which optionally includes
information about the python version and environment name as well as the
ability to specify pip
dependencies (since more niche packages
usually need to be installed that way).
If you already have a conda environment file called, e.g.,
env.yml
for an environment
called my-environment-name
, you can simply run
conda env create -f env.yml
, and it will install all required
packages; you can then activate the environment with
conda activate my-environment-name
and start using it.
Such an environment file's contents might look as simple as:
name: my-environment-name channels: - conda-forge - defaults dependencies: - python=3.9 - numpy - pip - pip: - nptyping==1.4.4
In this example, we are allowing Anaconda to install Python version 3.9, as
well as packages from
conda-forge
(which is user-maintained and contains most
packages); more precisely, we are installing Numpy using the
conda
package manager; and we are installing a more niche
package, nptyping
, using pip
, since it's not
available on conda-forge
.
We specify our pip
dependencies as a list under
-pip:
, but we also need to explicitly install the
pip
package manager itself using conda
in order
to be able to do so.
We've required nptyping
to use a specific version (1.4.4)
while allowing Numpy to default to the latest version; in general, it's
usually easiest to leave the version numbers unspecified until the point
where specific versions are required to avoid bugs you've found.
Finally, we have also specified the name of the environment; both this and
the Python version are optional, but it's nice to be extra specific with
these just to make sure things work.
You can get a list of all top-level imported packages in your
codebase by running this command (for regular .py
python
files):
find . -name '*py' -exec sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' {} \; | sort | uniq
or this command (for (.ipynb
Jupyter notebook files), which
requires you to install
jq
for JSON
parsing [2]:
find . -name '*ipynb' -exec jq -jr '.cells | .[].source | .[]' {} \; | sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' | sort | uniq
or handle both types of files in one fell swoop:
( find . -name '*ipynb' -exec jq -jr '.cells | .[].source | .[]' {} \; find . -name '*py' -exec cat {} \; ) \ | sed -n 's/^import *\([^ .]*\).*/\1/p;s/^from *\([^ .]*\) *import.*/\1/p' \ | sort \ | uniq
Note that the names of packages on conda or pip might differ from the
imported package names, and this won't tell you which versions are required
(which can be vital for getting complex numerical environments working
properly).
If you can't get your code to work on a newly-built environment,
you might want to see exactly which versions of each package you
have installed by running pip freeze
in whichever
environment you are already using. You should see a bunch of items
that look like e.g. typish==1.9.3
.
Alternatively, if you've already been using conda
to manage
your dependencies, you can directly export an environment file with
conda env export >my-env.yml
(substituting
my-env.yml
with whatever you want to call your environment);
this output will include information about the source of each package, i.e.
whether it was from
PyPI (using
pip
) or from
conda-forge
(using conda
).
In both cases, you'll end up with a full list of installed
packages; most of these will probably be dependencies of the packages you
are actually using.
If you're taking this approach, it's good to delete everything that doesn't
appear in the above find
command, unless you really need a
specific version or want a specific record of installed packages
that worked for your project.
[3]
You may have already noticed that, if you have a file called
example.py
and are running a python script or interpreter in
the same directory, you can run import example
to import the
contents of that file into your current environment.
This is a module, and it's the simplest way to turn your code into
a reusable library.
If you have a bunch of scripts or Jupyter notebooks in a project using the
same functions repeatedly, it might make sense to put those functions into
a module and import them (like you would any other libraries) in the
scripts/notebooks themselves.
If you want to reuse a lot of code, it makes sense to break your library up
into smaller submodules; Numpy does this with its less common linear
algebra tools, which can be accessed via e.g.
import numpy.linalg
or from numpy import linalg
.
Such a library composed of multiple source files (modules) bundled together
is called a package.
If we wanted to make an example
package, we can simply run
mkdir example mv example.py example/__init__.py
We can then run import example
again within python just as
before.
This works because Python treats a directory containing an
__init__.py
file as if it were a module of the same name.
Unlike a module, however, we can add additional submodules and subpackages
to a package; for example, if we add a submodule
examples.constants
and a subpackage
examples.utils
, our entire package's directory structure will
now look like:
example ├── __init__.py ├── constants.py └── utils └── __init__.py
You can continue this recursively as needed, building out a nicely organized library.
You don't really need an
integrated development environment
(IDE, i.e. a program for editing and running your code)
for python dev, since ipython
, Jupyter, ripgrep
,
and your favorite text editor can handle 98% of what you'd need an IDE for.
If you're command-line savvy and have a favorite text editor, you can keep
using your favorite tools quite easily.
If you're not yet a command-line guru, however, you should try a nice IDE like PyCharm. It has great code exploration, debugging, and running/testing tools built in, with full support for various arcane interpreter setups (in case you need to deviate in the future from the set up I suggest in this post), and is cross-platform. The community version is free and adequate for most purposes.
------------------------------------------------------------
[1]
Seasoned python devs might correctly observe that you could simply use a
virtualenv
.
This is a valid technique, but it does not solve the related problem of
picking and installing an interpreter (let alone managing multiple
versions), nor of the limitations of
pip
or the inconvenience of having virtual environments
scattered all over your dev folders.
pipenv
uses virtualenv
and tracks your
dependencies automatically, but anything that pipenv
is
capable of automatically tracking is going to be very easy to track or
reconstruct manually, reducing it's utility (not to mention it was slow and
buggy last I used it) [3].
[2]
You can easily install jq
with e.g. brew install jq
(MacOS using Homebrew) or choco install jq
(Windows using
Chocolatey); see react 0 to 1
for relevant links.
[3] If the latter is critical to getting your code to work, however, you might want to go one step further and immortalize your environment in a docker image. It's more work, but for some numerical/scientific environments, it's the only convenient way to have a portable and reusable environment for running your code.
[4] Update: this still seems to be true in 2022.