Cross-channel VCS Graph

We opted to create a commit graph that reflects the original history. We initially tried a number of different algorithms to create a linearized history, but the results reflected the original attribution of content poorly.

To do so, we take the original graph, pick only the revisions that are of interest to l10n files, and create content for each of those revisions. It’s then committed with the original commit message and commit metadata, while amending the message with metadata about which original commits went into the generated content.

If you’re only interested in a high-level overview of cross-channel localization, you may want to skip forward to Commits and Metadata.

The history of mozilla-central is rather involved. There are a few aspects to be aware of:

  1. The graph has multiple roots.

  2. Files move in and out of the source paths configured by project configs.

  3. Project configs change to expose existing files with history.

  4. Merge days work differently in mozilla-central and comm-central.

  5. Merge days don’t have correct file change metadata.

The overall process to update the graph for a new revision REV in the target goes as follows.

  1. Find the revisions that are currently covered in the target repository.

  2. Find all revisions that are ancestors of REV, but not ancestors of already converted revisions from the previous step.

  3. For all new revisions, find the ones that touch l10n. Keep track of new files.

  4. For all new files, track their revisions, including moves and deletions.

  5. Skip known-bad revisions.

  6. Create a graph of the relevant revisions.

  7. Fix the graph.

  8. Ensure there’s a single head.

  9. Find the right entry points for the fixed graph onto the target graph.

The first few steps are straightforward, but the actual graph needs work.

Sparse Graph

Let’s start with an example. Common graphs in mozilla-central look like the following, with “green” nodes being the ones affecting localization that we want to transform.

digraph full_tree { graph [ rankdir=LR ]; "red0" -> "green1" ; "red0" -> "red1" ; "red1" -> "merge-red1" ; "green1" -> "merge-red1" ; "merge-red1" -> "green2" ; "merge-red1" -> "red2" ; "red2" -> "merge-red2" ; "green2" -> "merge-red2" ; "merge-red2" -> "green3" ; "merge-red2" -> "red3" ; "green3" -> "merge-red3" ; "red3" -> "merge-red3" ; "merge-red3" -> "green4" ; "merge-red3" -> "red4" ; "red4" -> "merge-red4" ; "green4" -> "merge-red4" ; "merge-red4" -> "green5" ; "merge-red4" -> "red5" ; "red5" -> "green4.5" ; "green4.5" -> "merge-red5" ; "green5" -> "merge-red5" ; "green1" [ color=green ] ; "green2" [ color=green ] ; "green3" [ color=green ] ; "green4" [ color=green ] ; "green4.5" [ color=green ] ; "green5" [ color=green ] ; }

What we’d expect to get would be a graph that just goes through our nodes, in this case effectively a linear graph.

digraph target { graph [ rankdir=LR ]; "green1" -> "green2" ; "green2" -> "green3" ; "green3" -> "green4" ; "green4" -> "green5" ; "green4" -> "green4.5" ; }

All Arcs

Natively, Mercurial creates a graph that shows all paths from each node to another that can be taken in the full graph, creating, in this case, a graph that connects all nodes.

digraph mesh { graph [ rankdir=LR ]; "green1" -> "green2" ; "green1" -> "green3" ; "green2" -> "green3" ; "green3" -> "green4" ; "green1" -> "green4" ; "green2" -> "green4" ; "green3" -> "green5" ; "green1" -> "green5" ; "green4" -> "green5" ; "green2" -> "green5" ; "green3" -> "green4.5" ; "green1" -> "green4.5" ; "green4" -> "green4.5" ; "green2" -> "green4.5" ; }

Removing Shortcuts

The code generating our target repository needs to strip that graph down. This is the step 7 in our list above. To do this, we find and remove shortcuts in the graph. I.e., the arc from green1 to green3 is a shortcut for the connection green1green2green3. The algorithm removes the shortcut arc from green1 to green3. Applying this algorithm over the full graph yields a simplifed graph as follows.

digraph mesh { graph [ rankdir=LR ]; "green1" -> "green2" ; "green2" -> "green3" ; "green3" -> "green4" ; "green1" -> "green5" ; "green4" -> "green5" ; "green1" -> "green4.5" ; "green4" -> "green4.5" ; }

This is almost the graph we’re looking for. There are still two more arcs which shortcut, each from the root to each head of our graph here. They’re pretty hard to find by extending the algorithm we used in the first simplification. You need to check two, three, four, and five intermediate nodes to discover them already in this example, and in practice, these shortcuts can span much wider ranges. So to find these, we use a second algorithm, starting with merge nodes, i.e., nodes with more than one parent. For each merge node, we try to find a non-trivial path to the merge node from any of its parents. If we find one, we drop the arc from that parent to the merge node. This code is efficient at this point as we only have a much smaller list of merge nodes, compared to the initial graph. And the code can rely on the fact that the numeric IDs of Mercurial changesets of parents are always smaller than the child. That allows to abort the search when the algorithm walks “past” the given merge node.

This results in the expected sparse graph that we started out with.

Single Head

To have a single repository state that we can translate, we ensure that we have a single head in the target repository. As you can see in our example, that doesn’t necessarily need to be in our sparse set of nodes. If that’s the case, we pick the oldest merge commit that is a descendant of all our heads. Oldest in this case means the one with the lowest numeric ID. There’s not necessarily a unique choice for this, and given that the numeric IDs depend on how you added remote changesets to your local clone, this choice might be only unique to your local clone.

Entry Points

Generally, the sparse graph here will be an incremental update to an existing sparse graph. The increment can also have roots that are not children of the known converted revisions in the target repository.

This used to be very frequent back when we had multiple integration repositories like mozilla-inbound that merged into mozilla-central. The string changes that landed there were based on a much older version of mozilla-central than the last conversion.

To find the right parent in the target repository, there’s an algorithm that finds a node in the target that corresponds to a parent in the source. That’s run for each root of the incremental sparse graph.

Part of this algorithm is also to skip over comm-* changesets in the target when searching for entry points for mozilla-* and vice versa. That way, we linearize the history across comm and mozilla.

Maximum Parent Count

In Mercurial, a changeset can have one or two parents, but no more. There’s no octopus merges like in git. The sparse graphs can contain merge nodes with more than two parents, though, so when we encounter those merge nodes, we ingest additional merge commits. The algorithm for finding good candidates for this corresponds to the algorithm we used when ensuring a single head on the update.

With a known state of the changeset to convert, we’ll describe the context with which the content is generated in the next section.