-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEV: reduce the size of the dev environment.yml #49998
Comments
Looking at one set of test dependencies: the ones that are only used for
Apart from those, the downstream tests also use:
|
I'd be all for removing those from the main environment file and just keeping them in the downstream compat file If it allows 99% of contributors to run all the tests they need, I think that's good enough |
Seems this is somewhat duplicate with #48828, although that issue goes further towards a really minimal yml file (which could be something in addition to the environment.yml file), while removing some deps might already be a good improvement for environment.yml usage as well. (that said, I think having an actual environment_minimal.yaml would also be useful) |
Let's discuss in the next dev call, there's differing opinions, it'll be easier to resolve them |
Let's first try to just resolve it here, maybe those differing opinions turn out to be not that different if they are written down ;) |
Agreed with reducing the size of For consistency and given that the CI and the dependencies installed there are the ultimate gatekeeper for pull requests, it would be nice to make the sets of dependencies available in some form. |
All the deps needed to run test_downstream.py would still be available through the specific yml file used for those tests (i.e. currently actions-38-downstream_compat.yaml). |
Yeah if it's as simple as reorganizing our CI deps and |
Just to check I've understood, would your aim be:
|
If we wanted to leverage dockerhub we could create images like:
And some other variations. When I set up an alpine image for pandas with numpy, pytz, dateutil, hypothesis, cython and pytest it was around a 300 MB image. The 3.X-all is akin to our current Dockerfile and clocks in under 3 GB. mamba-all is what we have on gitpod which clock in a little over 6 GB The advantage of docker would be users can pick what they want. I have a main development environment that has all of our dependencies, but setting that up for one-off tests in other versions again is a pain. Having an alpine image for another Python version would make that trivial to hop around |
Yeah that was the flow I was imagining, such that users could install a subset of dependencies but also know that extra dependencies for development are specified in a certain place
Although it would be a larger development change, I would be supportive of moving in this direction. Especially since it allows users to use the same environment as what could be tested on the CI |
I think docker can be very nice if you want to be able to replicate exactly one of the environments on CI (versions combo), but for local development I want an easy way to setup without docker using environment.yml. |
I can look at setting those up. We probably would want an official pandas-dev dockerhub account. Can apply to this unless any concerns: |
The main goal of this issue (at least for which I opened it) is to simplify I find it important that this is without docker, but if the concern is that people should still be able to set up an environment that is close to the one on CI, and docker can help to alleviate that concern, that sounds great. But so maybe we should open a dedicated discussion about "dockerizing our CI", since that is a bigger change. @WillAyd there are already several pandas dockerhub accounts (there are "pandas" and "pandas-dev" accounts, that I don't know who owns them (I asked on the core mailing list a few months ago without answer), so I created "pythonpandas" org (https://hub.docker.com/u/pythonpandas) to which I can add people (at the moment we only use it for gitpod). But it would be nice to have "pandas-dev" |
Hmm interesting and thanks for the heads up. I posted in the form we would like the pandas name. Will update when I hear back |
On the original topic of the environment.yaml file. From a code search, it seems we currently use this file on CI for the following actions:
The other actions use a specific env file from For docstring validation, typing and asvs, we don't need any of the "downstream" packages, I think. Only for the docs, we would need So based on the above, a concrete proposal as first step of reducing environment.yml: remove the downstream packages ( (those instructions could also be: create a separate env based on |
I think that's a good first step |
big +1 to that |
Sure, although this breaks my hope that " |
Would "envrionment.yml (local install point) contains everything 99% of contributors need that the CI has" be an acceptable compromise? Very few contributors are running downstream tests |
Yeah that's definitely okay with me to remove the downstream test dependencies (and other rarely use ones too) from |
To not lose sight of it maybe we splits those off from requirements_dev.txt into requirements_downstream.txt or something similar to that? I believe pip allows you to specify multiple install files; not sure about conda |
I like the idea of split files as well. It would make it easier to explain in documentation how to install the base, and how to install the full environment, with context on why you'd want to use either. |
Split files sounds good. |
Regarding splitting files, I am not sure you can do that with conda if you want to use environment.yml like files and create the environment in a single step. In arrow, we do have a set of different files with dependencies for conda (so you can choose which set to install), but they are plain txt files (like requirement.txt files, but with conda dependencies), and then the instruction to create the environment would look like An alternative with multiple files is that you would update the environment created with the base environment.yml in a second step: One of those ways of splitting in multiple files would be certainly be useful if we further want to reduce the main environment.yml (after #50157). |
Hi! @agroce and I are having a similar issue. I am trying to run the test suite, and get the same results on my machine, as on Github actions. I read https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#running-the-test-suite and followed the instructions to install mamba, activate pandas-dev environment, and then run pytest, but I observed many more errors/failures than on github actions. All of the error messages have to do with databases, so I guess my system is missing the required database setup, and I would need to install the testing databases, in order to get those tests to pass.
|
The main entry point for setting up a development environment for people to start contributing is to use the
environment.yml
file. Currently, this includes everything you would potentially need to run every single test, build every page of the docs, etc. This is of course nice (so people aren't surprised about something not working), but it also creates a huge and complex environment (more to download, taking up a lot of space, being slower to solve with conda).I think that for 99% of the contributors you don't necessarily need all optional (test) dependencies. For example, we have some packages installed that are used for just a single test (eg geopandas, statsmodels, scikit-learn). pytorch was recently removed for this reason (#49796).
But the problem is that we don't only use the
environment.yml
for contributors setting up a dev env, but also for testing in our CI. So of course we if we make theenvironment.yml
slimmer, then we need to solve this issue.There is a PR to remove
geopandas
, but we should probably have a more general discussion about this, since it not justgeopandas
. From that PR, a comment by @mroeschke in #49994 (comment)cc @MarcoGorelli @noatamir
The text was updated successfully, but these errors were encountered: