-
Notifications
You must be signed in to change notification settings - Fork 2
Developer Setup
Welcome to Data Engineering at DCP! This guide is intended to help you get you set up to contribute to our codebase.
This repository is our primary location for code, issues, and automated workflows.
-
If you don't already have a Github account, create one, and have a team member add you to the NYCPlanning organization. You can either use a personal account to link to the organization or make one for DCP purposes (some of us on the team do each).
-
Generate SSH Keys and add your key to Github account.
-
Create a
.envfile in the localdata-engineeringdirectory. Add environment variables to the.envfile; they will be used when creating a Dockerdevcontainer (see docker). A few others are included, but the basic ones needed for most of our pipelines areBUILD_ENGINE AWS_S3_ENDPOINT AWS_SECRET_ACCESS_KEY AWS_ACCESS_KEY_IDMost of the relevant secrets can be found in 1password.
Most developers at DCP use VSCode
Definite extensions to install:
- Python
- Dev Containers
- Pylance
- Docker
Other potentially useful ones
- Jupyter
- GitLens
- CodeRunner
- Rainbow CSV
- Data Wrangler
- power user for dbt
We store secrets, credentials, etc in 1password. Talk to a teammate to get set up
- Homebrew if on mac
-
IPython
- If running notebooks in VSCode, extensions can take care of install/setup
- Postgres or Postgres.app
- QGIS
- R
- Poetry (Python package manager)
This section describes general workflow how to run code (qa app, data pipeline, etc) locally
The simplest way to develop and run pipelines locally is using a dev container. This is a dockerized environment that VSCode can connect to. While it's an effective way to simply set up a complete, production-ready environment (and ensure that code runs the same locally as it does on the cloud), it's also often less performant than running locally outside of a container. For now though, it's certainly still the best place to start (and generally we try to avoid running computationally expensive jobs on our own machines anyways).
All files needed for the container are stored in the data-engineering/.devcontainer/ directory:
-
Dockerfiledescribes how to build "initial" image for our container. It's largely setting variables that VS Code expects to run in the container properly. -
docker-compose.ymldescribes how to set up ourdevcontainer. It also specifies that we need to build from theDockerfileprior to initiating the container. We used to specify a postgres service as well, but have moved in favor of using a lighter-weight container and connecting to our persisted cloud dbs even when running locally. Now, this mainly is exposing a port for running streamlit from inside the container and making sure volumes are properly mounted. -
devcontainer.jsonis specifically used to create ourdevcontainer in VSCode. We don't need this file if we createdevcontainer from a terminal. It handles things like expected extensions for VSCode while running in the container, commands that should be run before or after starting the container, etc.
There are (at least) 2 ways to spin up the container:
-
From VSCode (which also will then run VS Code within the container):
-
Open VSCode
-
Open the cloned
data-engineeringdirectory -
VSCode will detect an existing container config file. Click on "Reopen in Container":

-
VSCode may ask for a passphrase associated with your Github SSH key:
If you don't remember the passphrase but saved the it in the Keychain Access app upon creation, you can find the password in the app. -
If the container was started successfully, the page will look like this:

-
-
From terminal:
- navigate to
data-engineering/.devcontainer/directory - run command
docker-compose up (-d). This command will use existing.ymlfile to set up the container. With-d, it will keep running in the background - if you go this route, you can run VS Code outside of the container or within. Both have advantages - inside the container, you can see performance issues. However, outside the container, you need just a little more decoration around running commands inside the container.
- Running
docker exec -ti de bashopen a terminal prompt in it.
- navigate to
Outside of a dev container, we use tools like homebrew and python virtual environments. There are many ways to do this we typically use venv or pyenv (pyenv repo, usage, tutorial). If you're familiar with conda, conda would probably work fine as well, most of us just don't use conda.
With homebrew, install
-
gdal- if possible, the same version as inadmin/run_environment/requirements.txt - postgres (latest version)
To install our python packages, you will need a virtualenv of your choice set up and activated.
python3 -m venvsource venv/bin/activate
To install this repo's python packages and the dcpy package:
python3 -m pip install --requirement ./admin/run_environment/requirements.txt
python3 -m pip install --editable . --constraint ./admin/run_environment/constraints.txtCreate and activate a Python 3.13 environment
conda create -n myenv python=3.13conda activate myenv
With Conda (Miniconda/Anaconda) in a Git Bash terminal, install
-
gdal libgdal libgdal-pgfrom the conda-forge channel, run the following to ensure PostGIS support:
conda install -c conda-forge gdal libgdal libgdal-pg- PostgreSQL (latest Windows installer)
(Optional but handy) — expose the PostgreSQL CLI tools to your shell
echo 'export PATH="$PATH:/c/Program Files/PostgreSQL/17/bin"' >> ~/.bashrc
source ~/.bashrcFix psql encoding (UTF-8) - The Windows psql client defaults to WIN1252, but our dumps are UTF-8. Add one of the following:
# Option 1 — shell-level (add to ~/.bashrc or ~/.bash_profile)
export PGCLIENTENCODING=UTF8-- Option 2 — psql-level (add to ~/.psqlrc)
SET client_encoding = 'UTF8';To install this repo’s Python packages and the dcpy package:
python -m pip install --upgrade pip
python -m pip install --requirement ./admin/run_environment/requirements.txt
python -m pip install --editable . --constraint ./admin/run_environment/constraints.txt