Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV: reduce the size of the dev environment.yml #49998

Open
jorisvandenbossche opened this issue Dec 1, 2022 · 26 comments
Open

DEV: reduce the size of the dev environment.yml #49998

jorisvandenbossche opened this issue Dec 1, 2022 · 26 comments
Labels
CI Continuous Integration Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 1, 2022

The main entry point for setting up a development environment for people to start contributing is to use the environment.yml file. Currently, this includes everything you would potentially need to run every single test, build every page of the docs, etc. This is of course nice (so people aren't surprised about something not working), but it also creates a huge and complex environment (more to download, taking up a lot of space, being slower to solve with conda).

I think that for 99% of the contributors you don't necessarily need all optional (test) dependencies. For example, we have some packages installed that are used for just a single test (eg geopandas, statsmodels, scikit-learn). pytorch was recently removed for this reason (#49796).
But the problem is that we don't only use the environment.yml for contributors setting up a dev env, but also for testing in our CI. So of course we if we make the environment.yml slimmer, then we need to solve this issue.

There is a PR to remove geopandas, but we should probably have a more general discussion about this, since it not just geopandas. From that PR, a comment by @mroeschke in #49994 (comment)

Yeah I would be partial to be consistent and remove geopandas everywhere if we also want to remove it from the environment.yml for consistency (CI dependency files, unit tests, any ASVs, any doc reference)

cc @MarcoGorelli @noatamir

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 1, 2022

consistency (CI dependency files, unit tests, any ASVs, any doc reference)

Looking at one set of test dependencies: the ones that are only used for test_downstream.py. I don't thiink that consistency is very important here, as those dependencies being present or not should not affect linting or asv. They are only used for a single test (and those will be skipped if the package is not installed). In addition, for the downstream tests, we already have a separate env yml file (actions-38-downstream_compat.yaml).
So I would argue that those dependencies could easily be removed from the main environment.yaml:

  • statsmodels (also used for a few old whatsnew files, but those ipython code blocks can be converted to verbatim ones)
  • scikit-learn
  • seaborn (also used for a colormap in style.ipynb, but that could be replaces with a matplotlib one)
  • pandas_datareader
  • pandas_gbq (this is also used for our read_gbq/to_gbq, but those don't seem to have tests, I suppose they are only tested in the downstream package since they are slim wrappers)

Apart from those, the downstream tests also use:

  • dask: used for downstream tests, and for a small section of the docs (in scale.rst) and a few array compat tests
  • xarray: also used for a few tests for to_xarray

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 1, 2022

I'd be all for removing those from the main environment file and just keeping them in the downstream compat file

If it allows 99% of contributors to run all the tests they need, I think that's good enough

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 1, 2022

Seems this is somewhat duplicate with #48828, although that issue goes further towards a really minimal yml file (which could be something in addition to the environment.yml file), while removing some deps might already be a good improvement for environment.yml usage as well.

(that said, I think having an actual environment_minimal.yaml would also be useful)

@MarcoGorelli
Copy link
Member

Let's discuss in the next dev call, there's differing opinions, it'll be easier to resolve them

@jorisvandenbossche
Copy link
Member Author

Let's discuss in the next dev call, there's differing opinions, it'll be easier to resolve them

Let's first try to just resolve it here, maybe those differing opinions turn out to be not that different if they are written down ;)

@mroeschke
Copy link
Member

Agreed with reducing the size of environment.yml, but I would prefer to see it broken up into various dependency files that can be installed on top of each other #49798 (comment).

For consistency and given that the CI and the dependencies installed there are the ultimate gatekeeper for pull requests, it would be nice to make the sets of dependencies available in some form.

@jorisvandenbossche
Copy link
Member Author

it would be nice to make the sets of dependencies available in some form.

All the deps needed to run test_downstream.py would still be available through the specific yml file used for those tests (i.e. currently actions-38-downstream_compat.yaml).
So you can still install them that way right now, and we could certainly also create an additional dep file specifically with the downstream deps to install on top of the main environment.yaml

@mroeschke
Copy link
Member

Yeah if it's as simple as reorganizing our CI deps and environment.yml into a set of dependency files that both contributors and CI could use I would be very happy with that. It would allow contributors to install a smaller environment scoped to a contribution area and align with what the CI installs

@lithomas1 lithomas1 added CI Continuous Integration Needs Discussion Requires discussion from core team before further action Dependencies Required and optional dependencies labels Dec 2, 2022
@MarcoGorelli
Copy link
Member

Just to check I've understood, would your aim be:

  • you start with mamba env create --file environment.yml
  • if you want to run downstream deps tests as well, you do mamba env update --file envs/test_downstream.yml
  • if you want to run database ones, you do mamba env update --file envs/test_database.yml
    ?

@WillAyd
Copy link
Member

WillAyd commented Dec 2, 2022

If we wanted to leverage dockerhub we could create images like:

  • pandas-dev:3.X-alpine
  • pandas-dev:3.X-minimal
  • pandas-dev:3.X-all
  • pandas-dev:3.X-mamba-minimal
  • pandas-dev:3.X-mamba-all

And some other variations. When I set up an alpine image for pandas with numpy, pytz, dateutil, hypothesis, cython and pytest it was around a 300 MB image. The 3.X-all is akin to our current Dockerfile and clocks in under 3 GB. mamba-all is what we have on gitpod which clock in a little over 6 GB

The advantage of docker would be users can pick what they want. I have a main development environment that has all of our dependencies, but setting that up for one-off tests in other versions again is a pain. Having an alpine image for another Python version would make that trivial to hop around

@mroeschke
Copy link
Member

Just to check I've understood, would your aim be:

Yeah that was the flow I was imagining, such that users could install a subset of dependencies but also know that extra dependencies for development are specified in a certain place

The advantage of docker would be users can pick what they want.

Although it would be a larger development change, I would be supportive of moving in this direction. Especially since it allows users to use the same environment as what could be tested on the CI

@jorisvandenbossche
Copy link
Member Author

I think docker can be very nice if you want to be able to replicate exactly one of the environments on CI (versions combo), but for local development I want an easy way to setup without docker using environment.yml.
But so if docker can be a way to ensure you still can replicate CI without having to have environment.yml include everything to be able to replicate CI with that (and so allowing us to slim down environment.yml), that sounds good to me.

@WillAyd
Copy link
Member

WillAyd commented Dec 2, 2022

I can look at setting those up. We probably would want an official pandas-dev dockerhub account. Can apply to this unless any concerns:

https://www.docker.com/community/open-source/application/

@jorisvandenbossche
Copy link
Member Author

The main goal of this issue (at least for which I opened it) is to simplify environment.yml / simplify the main entrypoint for people to set up their local dev environment (without docker).

I find it important that this is without docker, but if the concern is that people should still be able to set up an environment that is close to the one on CI, and docker can help to alleviate that concern, that sounds great. But so maybe we should open a dedicated discussion about "dockerizing our CI", since that is a bigger change.

@WillAyd there are already several pandas dockerhub accounts (there are "pandas" and "pandas-dev" accounts, that I don't know who owns them (I asked on the core mailing list a few months ago without answer), so I created "pythonpandas" org (https://hub.docker.com/u/pythonpandas) to which I can add people (at the moment we only use it for gitpod). But it would be nice to have "pandas-dev"

@WillAyd
Copy link
Member

WillAyd commented Dec 2, 2022

Hmm interesting and thanks for the heads up. I posted in the form we would like the pandas name. Will update when I hear back

@jorisvandenbossche
Copy link
Member Author

On the original topic of the environment.yaml file. From a code search, it seems we currently use this file on CI for the following actions:

  • asv-bot
  • code-checks (Docstring validation and typing + ASV bechmarks jobs)
  • docbuild-and-upload

The other actions use a specific env file from /ci/deps/. So AFAIU, we don't actually use the environment.yml file for actual tests (pytest tests). Is that correct?

For docstring validation, typing and asvs, we don't need any of the "downstream" packages, I think. Only for the docs, we would need dask out of those packages (and currently seaborn for a color map and statsmodels for an old whatsnew, but that would be easy to replace).
Specifically for dask, we could also limit ourselves to dask-core for conda/mamba, which would avoid bokeh/distributed (which we don't need for the docs).

So based on the above, a concrete proposal as first step of reducing environment.yml: remove the downstream packages (cftime, seaborn, scikit-learn, statsmodels, pandas-datareader, and replace dask with dask-core).
Those aren't actually needed for CI. And we can provide instructions in the contributing docs about how to install those in case you need to run test_downstream.py.

(those instructions could also be: create a separate env based on ci/deps/actions-38-downstream_compat.yaml, instead of providing a conda install ... instruction or providing a new to-be-created downstream.yml file with those packages)

@WillAyd
Copy link
Member

WillAyd commented Dec 2, 2022

I think that's a good first step

@MarcoGorelli
Copy link
Member

big +1 to that

@mroeschke
Copy link
Member

Sure, although this breaks my hope that "envrionment.yml (local install point) contains everything you need that the CI has", it's not a hill I'll die on

@MarcoGorelli
Copy link
Member

Would "envrionment.yml (local install point) contains everything 99% of contributors need that the CI has" be an acceptable compromise? Very few contributors are running downstream tests

@mroeschke
Copy link
Member

Would "envrionment.yml (local install point) contains everything 99% of contributors need that the CI has" be an acceptable compromise?

Yeah that's definitely okay with me to remove the downstream test dependencies (and other rarely use ones too) from environment.yml.

@WillAyd
Copy link
Member

WillAyd commented Dec 5, 2022

To not lose sight of it maybe we splits those off from requirements_dev.txt into requirements_downstream.txt or something similar to that? I believe pip allows you to specify multiple install files; not sure about conda

@noatamir
Copy link
Member

noatamir commented Dec 6, 2022

I like the idea of split files as well. It would make it easier to explain in documentation how to install the base, and how to install the full environment, with context on why you'd want to use either.

@mroeschke
Copy link
Member

Split files sounds good.

@jorisvandenbossche
Copy link
Member Author

Regarding splitting files, I am not sure you can do that with conda if you want to use environment.yml like files and create the environment in a single step.

In arrow, we do have a set of different files with dependencies for conda (so you can choose which set to install), but they are plain txt files (like requirement.txt files, but with conda dependencies), and then the instruction to create the environment would look like conda create -n pandas-dev -c conda-forge --file conda_deps_core.txt --file conda_deps_extra.txt ... (with potentially multiple files). See https://arrow.apache.org/docs/dev/developers/python.html#using-conda for the example in the arrow docs.

An alternative with multiple files is that you would update the environment created with the base environment.yml in a second step: conda env update --file environment_extra.yml

One of those ways of splitting in multiple files would be certainly be useful if we further want to reduce the main environment.yml (after #50157).
When for now just removing a few of the downstream deps (as #50157 does), we can also mention this in the contributing docs that you need to install pip/conda install seaborn scikit-learn statsmodels cftime pandas-datareader additionally to be able to run test_downstream.py.

@tdhock
Copy link
Contributor

tdhock commented Feb 28, 2024

Hi! @agroce and I are having a similar issue. I am trying to run the test suite, and get the same results on my machine, as on Github actions. I read https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#running-the-test-suite and followed the instructions to install mamba, activate pandas-dev environment, and then run pytest, but I observed many more errors/failures than on github actions. All of the error messages have to do with databases, so I guess my system is missing the required database setup, and I would need to install the testing databases, in order to get those tests to pass.
So I was wondering if it would please be possible to make the following changes?

  1. add the required packages for database test support to environment.yml
  2. add documentation on https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#running-the-test-suite which explains how to setup the test databases.
    Currently on https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#running-the-test-suite the only mention of databases is that "not db" can be used to skip database tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants