Table of Contents
- Introduction
- What This Proposal is Not About
- Why Git, and Why GitHub?
- Straw Man Migration Plan
- One or Multiple Repositories?
- Workflow Before/After
- Moving Local Branches to the Monorepo
- References
This is a proposal to move our current revision control system from our own hosted Subversion to GitHub. Below are the financial and technical arguments as to why we are proposing such a move and how people (and validation infrastructure) will continue to work with a Git-based LLVM.
There will be a survey pointing at this document which we'll use to gauge the community's reaction and, if we collectively decide to move, the time-frame. Be sure to make your view count.
Additionally, we will discuss this during a BoF at the next US LLVM Developer meeting (http://llvm.org/devmtg/2016-11/).
Changing the development policy.
This proposal relates only to moving the hosting of our source-code repository from SVN hosted on our own servers to Git hosted on GitHub. We are not proposing using GitHub's issue tracker, pull-requests, or code-review.
Contributors will continue to earn commit access on demand under the Developer Policy, except that that a GitHub account will be required instead of SVN username/password-hash.
This discussion began because we currently host our own Subversion server and Git mirror on a voluntary basis. The LLVM Foundation sponsors the server and provides limited support, but there is only so much it can do.
Volunteers are not sysadmins themselves, but compiler engineers that happen to know a thing or two about hosting servers. We also don't have 24/7 support, and we sometimes wake up to see that continuous integration is broken because the SVN server is either down or unresponsive.
We should take advantage of one of the services out there (GitHub, GitLab, and BitBucket, among others) that offer better service (24/7 stability, disk space, Git server, code browsing, forking facilities, etc) for free.
Many new coders nowadays start with Git, and a lot of people have never used SVN, CVS, or anything else. Websites like GitHub have changed the landscape of open source contributions, reducing the cost of first contribution and fostering collaboration.
Git is also the version control many LLVM developers use. Despite the sources being stored in a SVN server, these developers are already using Git through the Git-SVN integration.
Git allows you to:
- Commit, squash, merge, and fork locally without touching the remote server.
- Maintain local branches, enabling multiple threads of development.
- Collaborate on these branches (e.g. through your own fork of llvm on GitHub).
- Inspect the repository history (blame, log, bisect) without Internet access.
- Maintain remote forks and branches on Git hosting services and integrate back to the main repository.
In addition, because Git seems to be replacing many OSS projects' version control systems, there are many tools that are built over Git. Future tooling may support Git first (if not only).
GitHub, like GitLab and BitBucket, provides free code hosting for open source projects. Any of these could replace the code-hosting infrastructure that we have today.
These services also have a dedicated team to monitor, migrate, improve and distribute the contents of the repositories depending on region and load.
GitHub has one important advantage over GitLab and BitBucket: it offers read-write SVN access to the repository (https://github.com/blog/626-announcing-svn-support). This would enable people to continue working post-migration as though our code were still canonically in an SVN repository.
In addition, there are already multiple LLVM mirrors on GitHub, indicating that part of our community has already settled there.
The current SVN repository hosts all the LLVM sub-projects alongside each other. A single revision number (e.g. r123456) thus identifies a consistent version of all LLVM sub-projects.
Git does not use sequential integer revision number but instead uses a hash to identify each commit.
The loss of a sequential integer revision number has been a sticking point in past discussions about Git:
- "The 'branch' I most care about is mainline, and losing the ability to say 'fixed in r1234' (with some sort of monotonically increasing number) would be a tragic loss." [LattnerRevNum]
- "I like those results sorted by time and the chronology should be obvious, but timestamps are incredibly cumbersome and make it difficult to verify that a given checkout matches a given set of results." [TrickRevNum]
- "There is still the major regression with unreadable version numbers. Given the amount of Bugzilla traffic with 'Fixed in...', that's a non-trivial issue." [JSonnRevNum]
- "Sequential IDs are important for LNT and llvmlab bisection tool." [MatthewsRevNum].
However, Git can emulate this increasing revision number:
git rev-list --count <commit-hash>
. This identifier is unique only
within a single branch, but this means the tuple (num, branch-name) uniquely
identifies a commit.
We can thus use this revision number to ensure that e.g. clang -v reports a user-friendly revision number (e.g. master-12345 or 4.0-5321), addressing the objections raised above with respect to this aspect of Git.
In contrast to SVN, Git makes branching easy. Git's commit history is represented as a DAG, a departure from SVN's linear history. However, we propose to mandate making merge commits illegal in our canonical Git repository.
Unfortunately, GitHub does not support server side hooks to enforce such a policy. We must rely on the community to avoid pushing merge commits.
GitHub offers a feature called Status Checks: a branch protected by status checks requires commits to be whitelisted before the push can happen. We could supply a pre-push hook on the client side that would run and check the history, before whitelisting the commit being pushed [statuschecks]. However this solution would be somewhat fragile (how do you update a script installed on every developer machine?) and prevents SVN access to the repository.
We will need a new bot to send emails for each commit. This proposal leaves the email format unchanged besides the commit URL.
- Update docs to mention the move, so people are aware of what is going on.
- Set up a read-only version of the GitHub project, mirroring our current SVN repository.
- Add the required bots to implement the commit emails, as well as the umbrella repository update (if the multirepo is selected) or the read-only Git views for the sub-projects (if the monorepo is selected).
- Update the buildbots to pick up updates and commits from the GitHub repository. Not all bots have to migrate at this point, but it'll help provide infrastructure testing.
- Update Phabricator to pick up commits from the GitHub repository.
- LNT and llvmlab have to be updated: they rely on unique monotonically increasing integer across branch [MatthewsRevNum].
- Instruct downstream integrators to pick up commits from the GitHub repository.
- Review and prepare an update for the LLVM documentation.
Until this point nothing has changed for developers, it will just boil down to a lot of work for buildbot and other infrastructure owners.
The migration will pause here until all dependencies have cleared, and all problems have been solved.
- Collect developers' GitHub account information, and add them to the project.
- Switch the SVN repository to read-only and allow pushes to the GitHub repository.
- Update the documentation.
- Mirror Git to SVN.
- Archive the SVN repository.
- Update links on the LLVM website pointing to viewvc/klaus/phab etc. to point to GitHub instead.
There are two major variants for how to structure our Git repository: The "multirepo" and the "monorepo".
This variant recommends moving each LLVM sub-project to a separate Git repository. This mimics the existing official read-only Git repositories (e.g., http://llvm.org/git/compiler-rt.git), and creates new canonical repositories for each sub-project.
This will allow the individual sub-projects to remain distinct: a developer interested only in compiler-rt can checkout only this repository, build it, and work in isolation of the other sub-projects.
A key need is to be able to check out multiple projects (i.e. lldb+clang+llvm or clang+llvm+libcxx for example) at a specific revision.
A tuple of revisions (one entry per repository) accurately describes the state across the sub-projects. For example, a given version of clang would be <LLVM-12345, clang-5432, libcxx-123, etc.>.
To make this more convenient, a separate umbrella repository will be provided. This repository will be used for the sole purpose of understanding the sequence in which commits were pushed to the different repositories and to provide a single revision number.
This umbrella repository will be read-only and continuously updated to record the above tuple. The proposed form to record this is to use Git [submodules], possibly along with a set of scripts to help check out a specific revision of the LLVM distribution.
A regular LLVM developer does not need to interact with the umbrella repository -- the individual repositories can be checked out independently -- but you would need to use the umbrella repository to bisect multiple sub-projects at the same time, or to check-out old revisions of LLVM with another sub-project at a consistent state.
This umbrella repository will be updated automatically by a bot (running on notice from a webhook on every push, and periodically) on a per commit basis: a single commit in the umbrella repository would match a single commit in a sub-project.
Downstream SVN users can use the read/write SVN bridges with the following caveats:
- Be prepared for a one-time change to the upstream revision numbers.
- The upstream sub-project revision numbers will no longer be in sync.
Downstream Git users can continue without any major changes, with the minor change of upstreaming using git push instead of git svn dcommit.
Git users also have the option of adopting an umbrella repository downstream. The tooling for the upstream umbrella can easily be reused for downstream needs, incorporating extra sub-projects and branching in parallel with sub-project branches.
As a preview (disclaimer: this rough prototype, not polished and not representative of the final solution), you can look at the following:
- Repository: https://github.com/llvm-beanz/llvm-submodules
- Update bot: http://beanz-bot.com:8180/jenkins/job/submodule-update/
- Because GitHub does not allow server-side hooks, and because there is no "push timestamp" in Git, the umbrella repository sequence isn't totally exact: commits from different repositories pushed around the same time can appear in different orders. However, we don't expect it to be the common case or to cause serious issues in practice.
- You can't have a single cross-projects commit that would update both LLVM and other sub-projects (something that can be achieved now). It would be possible to establish a protocol whereby users add a special token to their commit messages that causes the umbrella repo's updater bot to group all of them into a single revision.
- Another option is to group commits that were pushed closely enough together in the umbrella repository. This has the advantage of allowing cross-project commits, and is less sensitive to mis-ordering commits. However, this has the potential to group unrelated commits together, especially if the bot goes down and needs to catch up.
- This variant relies on heavier tooling. But the current prototype shows that it is not out-of-reach.
- Submodules don't have a good reputation / are complicating the command line. However, in the proposed setup, a regular developer will seldom interact with submodules directly, and certainly never update them.
- Refactoring across projects is not friendly: taking some functions from clang to make it part of a utility in libSupport wouldn't carry the history of the code in the llvm repo, preventing recursively applying git blame for instance. However, this is not very different than how most people are Interacting with the repository today, by splitting such change in multiple commits.
- :ref:`Checkout/Clone a Single Project, without Commit Access <workflow-checkout-commit>`.
- :ref:`Checkout/Clone a Single Project, with Commit Access <workflow-multicheckout-nocommit>`.
- :ref:`Checkout/Clone Multiple Projects, with Commit Access <workflow-multicheckout-multicommit>`.
- :ref:`Commit an API Change in LLVM and Update the Sub-projects <workflow-cross-repo-commit>`.
- :ref:`Branching/Stashing/Updating for Local Development or Experiments <workflow-multi-branching>`.
- :ref:`Bisecting <workflow-multi-bisecting>`.
This variant recommends moving all LLVM sub-projects to a single Git repository, similar to https://github.com/llvm-project/llvm-project. This would mimic an export of the current SVN repository, with each sub-project having its own top-level directory. Not all sub-projects are used for building toolchains. In practice, www/ and test-suite/ will probably stay out of the monorepo.
Putting all sub-projects in a single checkout makes cross-project refactoring naturally simple:
- New sub-projects can be trivially split out for better reuse and/or layering (e.g., to allow libSupport and/or LIT to be used by runtimes without adding a dependency on LLVM).
- Changing an API in LLVM and upgrading the sub-projects will always be done in a single commit, designing away a common source of temporary build breakage.
- Moving code across sub-project (during refactoring for instance) in a single commit enables accurate git blame when tracking code change history.
- Tooling based on git grep works natively across sub-projects, allowing to easier find refactoring opportunities across projects (for example reusing a datastructure initially in LLDB by moving it into libSupport).
- Having all the sources present encourages maintaining the other sub-projects when changing API.
Finally, the monorepo maintains the property of the existing SVN repository that the sub-projects move synchronously, and a single revision number (or commit hash) identifies the state of the development across all projects.
Nobody will be forced to build unnecessary projects. The exact structure is TBD, but making it trivial to configure builds for a single sub-project (or a subset of sub-projects) is a hard requirement.
As an example, it could look like the following:
mkdir build && cd build # Configure only LLVM (default) cmake path/to/monorepo # Configure LLVM and lld cmake path/to/monorepo -DLLVM_ENABLE_PROJECTS=lld # Configure LLVM and clang cmake path/to/monorepo -DLLVM_ENABLE_PROJECTS=clang
With the Monorepo, the existing single-subproject mirrors (e.g. http://llvm.org/git/compiler-rt.git) with git-svn read-write access would continue to be maintained: developers would continue to be able to use the existing single-subproject git repositories as they do today, with no changes to workflow. Everything (git fetch, git svn dcommit, etc.) could continue to work identically to how it works today. The monorepo can be set-up such that the SVN revision number matches the SVN revision in the GitHub SVN-bridge.
Downstream SVN users can use the read/write SVN bridge. The SVN revision number can be preserved in the monorepo, minimizing the impact.
Downstream Git users can continue without any major changes, by using the git-svn mirrors on top of the SVN bridge.
Git users can also work upstream with monorepo even if their downstream fork has split repositories. They can apply patches in the appropriate subdirectories of the monorepo using, e.g., git am --directory=..., or plain diff and patch.
Alternatively, Git users can migrate their own fork to the monorepo. As a demonstration, we've migrated the "CHERI" fork to the monorepo in two ways:
- Using a script that rewrites history (including merges) so that it looks like the fork always lived in the monorepo [LebarCHERI]. The upside of this is when you check out an old revision, you get a copy of all llvm sub-projects at a consistent revision. (For instance, if it's a clang fork, when you check out an old revision you'll get a consistent version of llvm proper.) The downside is that this changes the fork's commit hashes.
- Merging the fork into the monorepo [AminiCHERI]. This preserves the fork's commit hashes, but when you check out an old commit you only get the one sub-project.
As a preview (disclaimer: this rough prototype, not polished and not representative of the final solution), you can look at the following:
- Full Repository: https://github.com/joker-eph/llvm-project
- Single sub-project view with SVN write access to the full repo: https://github.com/joker-eph/compiler-rt
- Using the monolithic repository may add overhead for those contributing to a standalone sub-project, particularly on runtimes like libcxx and compiler-rt that don't rely on LLVM; currently, a fresh clone of libcxx is only 15MB (vs. 1GB for the monorepo), and the commit rate of LLVM may cause more frequent git push collisions when upstreaming. Affected contributors can continue to use the SVN bridge or the single-subproject Git mirrors with git-svn for read-write.
- Using the monolithic repository may add overhead for those integrating a standalone sub-project, even if they aren't contributing to it, due to the same disk space concern as the point above. The availability of the sub-project Git mirror addresses this, even without SVN access.
- Preservation of the existing read/write SVN-based workflows relies on the GitHub SVN bridge, which is an extra dependency. Maintaining this locks us into GitHub and could restrict future workflow changes.
- :ref:`Checkout/Clone a Single Project, without Commit Access <workflow-checkout-commit>`.
- :ref:`Checkout/Clone a Single Project, with Commit Access <workflow-monocheckout-nocommit>`.
- :ref:`Checkout/Clone Multiple Projects, with Commit Access <workflow-monocheckout-multicommit>`.
- :ref:`Commit an API Change in LLVM and Update the Sub-projects <workflow-cross-repo-commit>`.
- :ref:`Branching/Stashing/Updating for Local Development or Experiments <workflow-mono-branching>`.
- :ref:`Bisecting <workflow-mono-bisecting>`.
This variant recommends moving only the LLVM sub-projects that are rev-locked to LLVM into a monorepo (clang, lld, lldb, ...), following the multirepo proposal for the rest. While neither variant recommends combining sub-projects like www/ and test-suite/ (which are completely standalone), this goes further and keeps sub-projects like libcxx and compiler-rt in their own distinct repositories.
- This has most disadvantages of multirepo and monorepo, without bringing many of the advantages.
- Downstream have to upgrade to the monorepo structure, but only partially. So they will keep the infrastructure to integrate the other separate sub-projects.
- All projects that use LIT for testing are effectively rev-locked to LLVM. Furthermore, some runtimes (like compiler-rt) are rev-locked with Clang. It's not clear where to draw the lines.
This section goes through a few examples of workflows, intended to illustrate how end-users or developers would interact with the repository for various use-cases.
Except the URL, nothing changes. The possibilities today are:
svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm # or with Git git clone http://llvm.org/git/llvm.git
After the move to GitHub, you would do either:
git clone https://github.com/llvm-project/llvm.git # or using the GitHub svn native bridge svn co https://github.com/llvm-project/llvm/trunk
The above works for both the monorepo and the multirepo, as we'll maintain the existing read-only views of the individual sub-projects.
# direct SVN checkout svn co https://user@llvm.org/svn/llvm-project/llvm/trunk llvm # or using the read-only Git view, with git-svn git clone http://llvm.org/git/llvm.git cd llvm git svn init https://llvm.org/svn/llvm-project/llvm/trunk --username=<username> git config svn-remote.svn.fetch :refs/remotes/origin/master git svn rebase -l # -l avoids fetching ahead of the git mirror.
Commits are performed using svn commit or with the sequence git commit and git svn dcommit.
With the multirepo variant, nothing changes but the URL, and commits can be performed using svn commit or git commit and git push:
git clone https://github.com/llvm/llvm.git llvm # or using the GitHub svn native bridge svn co https://github.com/llvm/llvm/trunk/ llvm
With the monorepo variant, there are a few options, depending on your constraints. First, you could just clone the full repository:
git clone https://github.com/llvm/llvm-projects.git llvm # or using the GitHub svn native bridge svn co https://github.com/llvm/llvm-projects/trunk/ llvm
At this point you have every sub-project (llvm, clang, lld, lldb, ...), which :ref:`doesn't imply you have to build all of them <build_single_project>`. You can still build only compiler-rt for instance. In this way it's not different from someone who would check out all the projects with SVN today.
You can commit as normal using git commit and git push or svn commit, and read the history for a single project (git log libcxx for example).
Secondly, there are a few options to avoid checking out all the sources.
Using the GitHub SVN bridge
The GitHub SVN native bridge allows to checkout a subdirectory directly:
svn co https://github.com/llvm/llvm-projects/trunk/compiler-rt compiler-rt —username=...
This checks out only compiler-rt and provides commit access using "svn commit", in the same way as it would do today.
Using a Subproject Git Mirror
You can use git-svn and one of the sub-project mirrors:
# Clone from the single read-only Git repo git clone http://llvm.org/git/llvm.git cd llvm # Configure the SVN remote and initialize the svn metadata $ git svn init https://github.com/joker-eph/llvm-project/trunk/llvm —username=... git config svn-remote.svn.fetch :refs/remotes/origin/master git svn rebase -l
In this case the repository contains only a single sub-project, and commits can be made using git svn dcommit, again exactly as we do today.
Using a Sparse Checkouts
You can hide the other directories using a Git sparse checkout:
git config core.sparseCheckout true echo /compiler-rt > .git/info/sparse-checkout git read-tree -mu HEAD
The data for all sub-projects is still in your .git directory, but in your checkout, you only see compiler-rt. Before you push, you'll need to fetch and rebase (git pull --rebase) as usual.
Note that when you fetch you'll likely pull in changes to sub-projects you don't care about. If you are using spasre checkout, the files from other projects won't appear on your disk. The only effect is that your commit hash changes.
You can check whether the changes in the last fetch are relevant to your commit by running:
git log origin/master@{1}..origin/master -- libcxx
This command can be hidden in a script so that git llvmpush would perform all these steps, fail only if such a dependent change exists, and show immediately the change that prevented the push. An immediate repeat of the command would (almost) certainly result in a successful push. Note that today with SVN or git-svn, this step is not possible since the "rebase" implicitly happens while committing (unless a conflict occurs).
Let's look how to assemble llvm+clang+libcxx at a given revision.
svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm -r $REVISION cd llvm/tools svn co http://llvm.org/svn/llvm-project/clang/trunk clang -r $REVISION cd ../projects svn co http://llvm.org/svn/llvm-project/libcxx/trunk libcxx -r $REVISION
Or using git-svn:
git clone http://llvm.org/git/llvm.git cd llvm/ git svn init https://llvm.org/svn/llvm-project/llvm/trunk --username=<username> git config svn-remote.svn.fetch :refs/remotes/origin/master git svn rebase -l git checkout `git svn find-rev -B r258109` cd tools git clone http://llvm.org/git/clang.git cd clang/ git svn init https://llvm.org/svn/llvm-project/clang/trunk --username=<username> git config svn-remote.svn.fetch :refs/remotes/origin/master git svn rebase -l git checkout `git svn find-rev -B r258109` cd ../../projects/ git clone http://llvm.org/git/libcxx.git cd libcxx git svn init https://llvm.org/svn/llvm-project/libcxx/trunk --username=<username> git config svn-remote.svn.fetch :refs/remotes/origin/master git svn rebase -l git checkout `git svn find-rev -B r258109`
Note that the list would be longer with more sub-projects.
With the multirepo variant, the umbrella repository will be used. This is where the mapping from a single revision number to the individual repositories revisions is stored.:
git clone https://github.com/llvm-beanz/llvm-submodules cd llvm-submodules git checkout $REVISION git submodule init git submodule update clang llvm libcxx # the list of sub-project is optional, `git submodule update` would get them all.
At this point the clang, llvm, and libcxx individual repositories are cloned and stored alongside each other. There are CMake flags to describe the directory structure; alternatively, you can just symlink clang to llvm/tools/clang, etc.
Another option is to checkout repositories based on the commit timestamp:
git checkout `git rev-list -n 1 --before="2009-07-27 13:37" master`
The repository contains natively the source for every sub-projects at the right revision, which makes this straightforward:
git clone https://github.com/llvm/llvm-projects.git llvm-projects cd llvm-projects git checkout $REVISION
As before, at this point clang, llvm, and libcxx are stored in directories alongside each other.
Today this is possible, even though not common (at least not documented) for subversion users and for git-svn users. For example, few Git users try to update LLD or Clang in the same commit as they change an LLVM API.
The multirepo variant does not address this: one would have to commit and push separately in every individual repository. It would be possible to establish a protocol whereby users add a special token to their commit messages that causes the umbrella repo's updater bot to group all of them into a single revision.
The monorepo variant handles this natively.
SVN does not allow this use case, but developers that are currently using git-svn can do it. Let's look in practice what it means when dealing with multiple sub-projects.
To update the repository to tip of trunk:
git pull cd tools/clang git pull cd ../../projects/libcxx git pull
To create a new branch:
git checkout -b MyBranch cd tools/clang git checkout -b MyBranch cd ../../projects/libcxx git checkout -b MyBranch
To switch branches:
git checkout AnotherBranch cd tools/clang git checkout AnotherBranch cd ../../projects/libcxx git checkout AnotherBranch
The multirepo works the same as the current Git workflow: every command needs to be applied to each of the individual repositories. However, the umbrella repository makes this easy using git submodule foreach to replicate a command on all the individual repositories (or submodules in this case):
To create a new branch:
git submodule foreach git checkout -b MyBranch
To switch branches:
git submodule foreach git checkout AnotherBranch
Regular Git commands are sufficient, because everything is in a single repository:
To update the repository to tip of trunk:
git pull
To create a new branch:
git checkout -b MyBranch
To switch branches:
git checkout AnotherBranch
Assuming a developer is looking for a bug in clang (or lld, or lldb, ...).
SVN does not have builtin bisection support, but the single revision across sub-projects makes it possible to script around.
Using the existing Git read-only view of the repositories, it is possible to use the native Git bisection script over the llvm repository, and use some scripting to synchronize the clang repository to match the llvm revision.
With the multi-repositories variant, the cross-repository synchronization is achieved using the umbrella repository. This repository contains only submodules for the other sub-projects. The native Git bisection can be used on the umbrella repository directly. A subtlety is that the bisect script itself needs to make sure the submodules are updated accordingly.
For example, to find which commit introduces a regression where clang-3.9 crashes but not clang-3.8 passes, one should be able to simply do:
git bisect start release_39 release_38 git bisect run ./bisect_script.sh
With the bisect_script.sh script being:
#!/bin/sh cd $UMBRELLA_DIRECTORY git submodule update llvm clang libcxx #.... cd $BUILD_DIR ninja clang || exit 125 # an exit code of 125 asks "git bisect" # to "skip" the current commit ./bin/clang some_crash_test.cpp
When the git bisect run command returns, the umbrella repository is set to the state where the regression is introduced. The commit diff in the umbrella indicate which submodule was updated, and the last commit in this sub-projects is the one that the bisect found.
Bisecting on the monorepo is straightforward, and very similar to the above, except that the bisection script does not need to include the git submodule update step.
The same example, finding which commit introduces a regression where clang-3.9 crashes but not clang-3.8 passes, will look like:
git bisect start release_39 release_38 git bisect run ./bisect_script.sh
With the bisect_script.sh script being:
#!/bin/sh cd $BUILD_DIR ninja clang || exit 125 # an exit code of 125 asks "git bisect" # to "skip" the current commit ./bin/clang some_crash_test.cpp
Also, since the monorepo handles commits update across multiple projects, you're less like to encounter a build failure where a commit change an API in LLVM and another later one "fixes" the build in clang.
Suppose you have been developing against the existing LLVM git mirrors. You have one or more git branches that you want to migrate to the "final monorepo".
The simplest way to migrate such branches is with the
migrate-downstream-fork.py
tool at
https://github.com/jyknight/llvm-git-migration.
Basic instructions for migrate-downstream-fork.py
are in the
Python script and are expanded on below to a more general recipe:
# Make a repository which will become your final local mirror of the # monorepo. mkdir my-monorepo git -C my-monorepo init # Add a remote to the monorepo. git -C my-monorepo remote add upstream/monorepo https://github.com/llvm/llvm-project.git # Add remotes for each git mirror you use, from upstream as well as # your local mirror. All projects are listed here but you need only # import those for which you have local branches. my_projects=( clang clang-tools-extra compiler-rt debuginfo-tests libcxx libcxxabi libunwind lld lldb llvm openmp polly ) for p in ${my_projects[@]}; do git -C my-monorepo remote add upstream/split/${p} https://github.com/llvm-mirror/${p}.git git -C my-monorepo remote add local/split/${p} https://my.local.mirror.org/${p}.git done # Pull in all the commits. git -C my-monorepo fetch --all # Run migrate-downstream-fork to rewrite local branches on top of # the upstream monorepo. ( cd my-monorepo migrate-downstream-fork.py \ refs/remotes/local \ refs/tags \ --new-repo-prefix=refs/remotes/upstream/monorepo \ --old-repo-prefix=refs/remotes/upstream/split \ --source-kind=split \ --revmap-out=monorepo-map.txt ) # Octopus-merge the resulting local split histories to unify them. # Assumes local work on local split mirrors is on master (and # upstream is presumably represented by some other branch like # upstream/master). my_local_branch="master" git -C my-monorepo branch --no-track local/octopus/master \ $(git -C my-monorepo merge-base refs/remotes/upstream/monorepo/master \ refs/remotes/local/split/llvm/${my_local_branch}) git -C my-monorepo checkout local/octopus/${my_local_branch} subproject_branches=() for p in ${my_projects[@]}; do subproject_branch=${p}/local/monorepo/${my_local_branch} git -C my-monorepo branch ${subproject_branch} \ refs/remotes/local/split/${p}/${my_local_branch} if [[ "${p}" != "llvm" ]]; then subproject_branches+=( ${subproject_branch} ) fi done git -C my-monorepo merge ${subproject_branches[@]} for p in ${my_projects[@]}; do subproject_branch=${p}/local/monorepo/${my_local_branch} git -C my-monorepo branch -d ${subproject_branch} done # Create local branches for upstream monorepo branches. for ref in $(git -C my-monorepo for-each-ref --format="%(refname)" \ refs/remotes/upstream/monorepo); do upstream_branch=${ref#refs/remotes/upstream/monorepo/} git -C my-monorepo branch upstream/${upstream_branch} ${ref} done
The above gets you to a state like the following:
U1 - U2 - U3 <- upstream/master \ \ \ \ \ - Llld1 - Llld2 - \ \ \ \ - Lclang1 - Lclang2-- Lmerge <- local/octopus/master \ / - Lllvm1 - Lllvm2-----
Each branched component has its branch rewritten on top of the monorepo and all components are unified by a giant octopus merge.
If additional active local branches need to be preserved, the above
operations following the assignment to my_local_branch
should be
done for each branch. Ref paths will need to be updated to map the
local branch to the corresponding upstream branch. If local branches
have no corresponding upstream branch, then the creation of
local/octopus/<local branch>
need not use git-merge-base
to
pinpont its root commit; it may simply be branched from the
appropriate component branch (say, llvm/local_release_X
).
The octopus merge is suboptimal for many cases, because walking back through the history of one component leaves the other components fixed at a history that likely makes things unbuildable.
Some downstream users track the order commits were made to subprojects with some kind of "umbrella" project that imports the project git mirrors as submodules, similar to the multirepo umbrella proposed above. Such an umbrella repository looks something like this:
UM1 ---- UM2 -- UM3 -- UM4 ---- UM5 ---- UM6 ---- UM7 ---- UM8 <- master | | | | | | | Lllvm1 Llld1 Lclang1 Lclang2 Lllvm2 Llld2 Lmyproj1
The vertical bars represent submodule updates to a particular local
commit in the project mirror. UM3
in this case is a commit of
some local umbrella repository state that is not a submodule update,
perhaps a README
or project build script update. Commit UM8
updates a submodule of local project myproj
.
The tool zip-downstream-fork.py
at
https://github.com/greened/llvm-git-migration/tree/zip can be used to
convert the umbrella history into a monorepo-based history with
commits in the order implied by submodule updates:
U1 - U2 - U3 <- upstream/master \ \ \ \ -----\--------------- local/zip--. \ \ \ | - Lllvm1 - Llld1 - UM3 - Lclang1 - Lclang2 - Lllvm2 - Llld2 - Lmyproj1 <-'
The U*
commits represent upstream commits to the monorepo master
branch. Each submodule update in the local UM*
commits brought in
a subproject tree at some local commit. The trees in the L*1
commits represent merges from upstream. These result in edges from
the U*
commits to their corresponding rewritten L*1
commits.
The L*2
commits did not do any merges from upstream.
Note that the merge from U2
to Lclang1
appears redundant, but
if, say, U3
changed some files in upstream clang, the Lclang1
commit appearing after the Llld1
commit would actually represent a
clang tree earlier in the upstream clang history. We want the
local/zip
branch to accurately represent the state of our umbrella
history and so the edge U2 -> Lclang1
is a visual reminder of what
clang's tree actually looks like in Lclang1
.
Even so, the edge U3 -> Llld1
could be problematic for future
merges from upstream. git will think that we've already merged from
U3
, and we have, except for the state of the clang tree. One
possible migitation strategy is to manually diff clang between U2
and U3
and apply those updates to local/zip
. Another,
possibly simpler strategy is to freeze local work on downstream
branches and merge all submodules from the latest upstream before
running zip-downstream-fork.py
. If downstream merged each project
from upstream in lockstep without any intervening local commits, then
things should be fine without any special action. We anticipate this
to be the common case.
The tree for Lclang1
outside of clang will represent the state of
things at U3
since all of the upstream projects not participating
in the umbrella history should be in a state respecting the commit
U3
. The trees for llvm and lld should correctly represent commits
Lllvm1
and Llld1
, respectively.
Commit UM3
changed files not related to submodules and we need
somewhere to put them. It is not safe in general to put them in the
monorepo root directory because they may conflict with files in the
monorepo. Let's assume we want them in a directory local
in the
monorepo.
Example 1: Umbrella looks like the monorepo
For this example, we'll assume that each subproject appears in its own
top-level directory in the umbrella, just as they do in the monorepo .
Let's also assume that we want the files in directory myproj
to
appear in local/myproj
.
Given the above run of migrate-downstream-fork.py
, a recipe to
create the zipped history is below:
# Import any non-LLVM repositories the umbrella references. git -C my-monorepo remote add localrepo \ https://my.local.mirror.org/localrepo.git git fetch localrepo subprojects=( clang clang-tools-extra compiler-rt debuginfo-tests libclc libcxx libcxxabi libunwind lld lldb llgo llvm openmp parallel-libs polly pstl ) # Import histories for upstream split projects (this was probably # already done for the ``migrate-downstream-fork.py`` run). for project in ${subprojects[@]}; do git remote add upstream/split/${project} \ https://github.com/llvm-mirror/${subproject}.git git fetch umbrella/split/${project} done # Import histories for downstream split projects (this was probably # already done for the ``migrate-downstream-fork.py`` run). for project in ${subprojects[@]}; do git remote add local/split/${project} \ https://my.local.mirror.org/${subproject}.git git fetch local/split/${project} done # Import umbrella history. git -C my-monorepo remote add umbrella \ https://my.local.mirror.org/umbrella.git git fetch umbrella # Put myproj in local/myproj echo "myproj local/myproj" > my-monorepo/submodule-map.txt # Rewrite history ( cd my-monorepo zip-downstream-fork.py \ refs/remotes/umbrella \ --new-repo-prefix=refs/remotes/upstream/monorepo \ --old-repo-prefix=refs/remotes/upstream/split \ --revmap-in=monorepo-map.txt \ --revmap-out=zip-map.txt \ --subdir=local \ --submodule-map=submodule-map.txt \ --update-tags ) # Create the zip branch (assuming umbrella master is wanted). git -C my-monorepo branch --no-track local/zip/master refs/remotes/umbrella/master
Note that if the umbrella has submodules to non-LLVM repositories,
zip-downstream-fork.py
needs to know about them to be able to
rewrite commits. That is why the first step above is to fetch commits
from such repositories.
With --update-tags
the tool will migrate annotated tags pointing
to submodule commits that were inlined into the zipped history. If
the umbrella pulled in an upstream commit that happened to have a tag
pointing to it, that tag will be migrated, which is almost certainly
not what is wanted. The tag can always be moved back to its original
commit after rewriting, or the --update-tags
option may be
discarded and any local tags would then be migrated manually.
Example 2: Nested sources layout
The tool handles nested submodules (e.g. llvm is a submodule in
umbrella and clang is a submodule in llvm). The file
submodule-map.txt
is a list of pairs, one per line. The first
pair item describes the path to a submodule in the umbrella
repository. The second pair item secribes the path where trees for
that submodule should be written in the zipped history.
Let's say your umbrella repository is actually the llvm repository and
it has submodules in the "nested sources" layout (clang in
tools/clang, etc.). Let's also say projects/myproj
is a submodule
pointing to some downstream repository. The submodule map file should
look like this (we still want myproj mapped the same way as
previously):
tools/clang clang tools/clang/tools/extra clang-tools-extra projects/compiler-rt compiler-rt projects/debuginfo-tests debuginfo-tests projects/libclc libclc projects/libcxx libcxx projects/libcxxabi libcxxabi projects/libunwind libunwind tools/lld lld tools/lldb lldb projects/openmp openmp tools/polly polly projects/myproj local/myproj
If a submodule path does not appear in the map, the tools assumes it should be placed in the same place in the monorepo. That means if you use the "nested sources" layout in your umrella, you must provide map entries for all of the projects in your umbrella (except llvm). Otherwise trees from submodule updates will appear underneath llvm in the zippped history.
Because llvm is itself the umbrella, we use --subdir to write its
content into llvm
in the zippped history:
# Import any non-LLVM repositories the umbrella references. git -C my-monorepo remote add localrepo \ https://my.local.mirror.org/localrepo.git git fetch localrepo subprojects=( clang clang-tools-extra compiler-rt debuginfo-tests libclc libcxx libcxxabi libunwind lld lldb llgo llvm openmp parallel-libs polly pstl ) # Import histories for upstream split projects (this was probably # already done for the ``migrate-downstream-fork.py`` run). for project in ${subprojects[@]}; do git remote add upstream/split/${project} \ https://github.com/llvm-mirror/${subproject}.git git fetch umbrella/split/${project} done # Import histories for downstream split projects (this was probably # already done for the ``migrate-downstream-fork.py`` run). for project in ${subprojects[@]}; do git remote add local/split/${project} \ https://my.local.mirror.org/${subproject}.git git fetch local/split/${project} done # Import umbrella history. We want this under a different refspec # so zip-downstream-fork.py knows what it is. git -C my-monorepo remote add umbrella \ https://my.local.mirror.org/llvm.git git fetch umbrella # Create the submodule map. echo "tools/clang clang" > my-monorepo/submodule-map.txt echo "tools/clang/tools/extra clang-tools-extra" >> my-monorepo/submodule-map.txt echo "projects/compiler-rt compiler-rt" >> my-monorepo/submodule-map.txt echo "projects/debuginfo-tests debuginfo-tests" >> my-monorepo/submodule-map.txt echo "projects/libclc libclc" >> my-monorepo/submodule-map.txt echo "projects/libcxx libcxx" >> my-monorepo/submodule-map.txt echo "projects/libcxxabi libcxxabi" >> my-monorepo/submodule-map.txt echo "projects/libunwind libunwind" >> my-monorepo/submodule-map.txt echo "tools/lld lld" >> my-monorepo/submodule-map.txt echo "tools/lldb lldb" >> my-monorepo/submodule-map.txt echo "projects/openmp openmp" >> my-monorepo/submodule-map.txt echo "tools/polly polly" >> my-monorepo/submodule-map.txt echo "projects/myproj local/myproj" >> my-monorepo/submodule-map.txt # Rewrite history ( cd my-monorepo zip-downstream-fork.py \ refs/remotes/umbrella \ --new-repo-prefix=refs/remotes/upstream/monorepo \ --old-repo-prefix=refs/remotes/upstream/split \ --revmap-in=monorepo-map.txt \ --revmap-out=zip-map.txt \ --subdir=llvm \ --submodule-map=submodule-map.txt \ --update-tags ) # Create the zip branch (assuming umbrella master is wanted). git -C my-monorepo branch --no-track local/zip/master refs/remotes/umbrella/master
Comments at the top of zip-downstream-fork.py
describe in more
detail how the tool works and various implications of its operation.
You may have additional repositories that integrate with the LLVM ecosystem, essentially extending it with new tools. If such repositories are tightly coupled with LLVM, it may make sense to import them into your local mirror of the monorepo.
If such repositores participated in the umbrella repository used
during the zipping process above, they will automatically be added to
the monorepo. For downstream repositories that don't participate in
an umbrella setup, the import-downstream-repo.py
tool at
https://github.com/greened/llvm-git-migration/tree/import can help with
getting them into the monorepo. A recipe follows:
# Import downstream repo history into the monorepo. git -C my-monorepo remote add myrepo https://my.local.mirror.org/myrepo.git git fetch myrepo my_local_tags=( refs/tags/release refs/tags/hotfix ) ( cd my-monorepo import-downstream-repo.py \ refs/remotes/myrepo \ ${my_local_tags[@]} \ --new-repo-prefix=refs/remotes/upstream/monorepo \ --subdir=myrepo \ --tag-prefix="myrepo-" ) # Preserve release braches. for ref in $(git -C my-monorepo for-each-ref --format="%(refname)" \ refs/remotes/myrepo/release); do branch=${ref#refs/remotes/myrepo/} git -C my-monorepo branch --no-track myrepo/${branch} ${ref} done # Preserve master. git -C my-monorepo branch --no-track myrepo/master refs/remotes/myrepo/master # Merge master. git -C my-monorepo checkout local/zip/master # Or local/octopus/master git -C my-monorepo merge myrepo/master
You may want to merge other corresponding branches, for example
myrepo
release branches if they were in lockstep with LLVM project
releases.
--tag-prefix
tells import-downstream-repo.py
to rename
annotated tags with the given prefix. Due to limitations with
fast_filter_branch.py
, unannotated tags cannot be renamed
(fast_filter_branch.py
considers them branches, not tags). Since
the upstream monorepo had its tags rewritten with an "llvmorg-"
prefix, name conflicts should not be an issue. --tag-prefix
can
be used to more clearly indicate which tags correspond to various
imported repositories.
Given this repository history:
R1 - R2 - R3 <- master ^ | release/1
The above recipe results in a history like this:
U1 - U2 - U3 <- upstream/master \ \ \ \ -----\--------------- local/zip--. \ \ \ | - Lllvm1 - Llld1 - UM3 - Lclang1 - Lclang2 - Lllvm2 - Llld2 - Lmyproj1 - M1 <-' / R1 - R2 - R3 <-. ^ | | | myrepo-release/1 | | myrepo/master--'
Commits R1
, R2
and R3
have trees that only contain blobs
from myrepo
. If you require commits from myrepo
to be
interleaved with commits on local project branches (for example,
interleaved with llvm1
, llvm2
, etc. above) and myrepo doesn't
appear in an umbrella repository, a new tool will need to be
developed. Creating such a tool would involve:
- Modifying
fast_filter_branch.py
to optionally take a revlist directly rather than generating it itself - Creating a tool to generate an interleaved ordering of local
commits based on some criteria (
zip-downstream-fork.py
uses the umbrella history as its criterion) - Generating such an ordering and feeding it to
fast_filter_branch.py
as a revlist
Some care will also likely need to be taken to handle merge commits, to ensure the parents of such commits migrate correctly.
Once all of the migrating, zipping and importing is done, it's time to
clean up. The python tools use git-fast-import
which leaves a lot
of cruft around and we want to shrink our new monorepo mirror as much
as possible. Here is one way to do it:
git -C my-monorepo checkout master # Delete branches we no longer need. Do this for any other branches # you merged above. git -C my-monorepo branch -D local/zip/master || true git -C my-monorepo branch -D local/octopus/master || true # Remove remotes. git -C my-monorepo remote remove upstream/monorepo for p in ${my_projects[@]}; do git -C my-monorepo remote remove upstream/split/${p} git -C my-monorepo remote remove local/split/${p} done git -C my-monorepo remote remove localrepo git -C my-monorepo remote remove umbrella git -C my-monorepo remote remove myrepo # Add anything else here you don't need. refs/tags/release is # listed below assuming tags have been rewritten with a local prefix. # If not, remove it from this list. refs_to_clean=( refs/original refs/remotes refs/tags/backups refs/tags/release ) git -C my-monorepo for-each-ref --format="%(refname)" ${refs_to_clean[@]} | xargs -n1 --no-run-if-empty git -C my-monorepo update-ref -d git -C my-monorepo reflog expire --all --expire=now # fast_filter_branch.py might have gc running in the background. while ! git -C my-monorepo \ -c gc.reflogExpire=0 \ -c gc.reflogExpireUnreachable=0 \ -c gc.rerereresolved=0 \ -c gc.rerereunresolved=0 \ -c gc.pruneExpire=now \ gc --prune=now; do continue done # Takes a LOOOONG time! git -C my-monorepo repack -A -d -f --depth=250 --window=250 git -C my-monorepo prune-packed git -C my-monorepo prune
You should now have a trim monorepo. Upload it to your git server and happy hacking!
[LattnerRevNum] | Chris Lattner, http://lists.llvm.org/pipermail/llvm-dev/2011-July/041739.html |
[TrickRevNum] | Andrew Trick, http://lists.llvm.org/pipermail/llvm-dev/2011-July/041721.html |
[JSonnRevNum] | Joerg Sonnenberg, http://lists.llvm.org/pipermail/llvm-dev/2011-July/041688.html |
[MatthewsRevNum] | (1, 2) Chris Matthews, http://lists.llvm.org/pipermail/cfe-dev/2016-July/049886.html |
[submodules] | Git submodules, https://git-scm.com/book/en/v2/Git-Tools-Submodules) |
[statuschecks] | GitHub status-checks, https://help.github.com/articles/about-required-status-checks/ |
[LebarCHERI] | Port CHERI to a single repository rewriting history, http://lists.llvm.org/pipermail/llvm-dev/2016-July/102787.html |
[AminiCHERI] | Port CHERI to a single repository preserving history, http://lists.llvm.org/pipermail/llvm-dev/2016-July/102804.html |