Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK IV: Add 'git survey' command #1821

Open
wants to merge 16 commits into
base: api-upstream
Choose a base branch
from

Commits on Nov 8, 2024

  1. path-walk: introduce an object walk by path

    In anticipation of a few planned applications, introduce the most basic form
    of a path-walk API. It currently assumes that there are no UNINTERESTING
    objects, and does not include any complicated filters. It calls a function
    pointer on groups of tree and blob objects as grouped by path. This only
    includes objects the first time they are discovered, so an object that
    appears at multiple paths will not be included in two batches.
    
    These batches are collected in 'struct type_and_oid_list' objects, which
    store an object type and an oid_array of objects.
    
    The data structures are documented in 'struct path_walk_context', but in
    summary the most important are:
    
      * 'paths_to_lists' is a strmap that connects a path to a
        type_and_oid_list for that path. To avoid conflicts in path names,
        we make sure that tree paths end in "/" (except the root path with
        is an empty string) and blob paths do not end in "/".
    
      * 'path_stack' is a string list that is added to in an append-only
        way. This stores the stack of our depth-first search on the heap
        instead of using recursion.
    
      * 'path_stack_pushed' is a strmap that stores path names that were
        already added to 'path_stack', to avoid repeating paths in the
        stack. Mostly, this saves us from quadratic lookups from doing
        unsorted checks into the string_list.
    
    The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
    push_to_stack() method. Call this instead of inserting into these
    structures directly.
    
    The walk_objects_by_path() method initializes these structures and
    starts walking commits from the given rev_info struct. The commits are
    used to find the list of root trees which populate the start of our
    depth-first search.
    
    The core of our depth-first search is in a while loop that continues
    while we have not indicated an early exit and our 'path_stack' still has
    entries in it. The loop body pops a path off of the stack and "visits"
    the path via the walk_path() method.
    
    The walk_path() method gets the list of OIDs from the 'path_to_lists'
    strmap and executes the callback method on that list with the given path
    and type. If the OIDs correspond to tree objects, then iterate over all
    trees in the list and run add_children() to add the child objects to
    their own lists, adding new entries to the stack if necessary.
    
    In testing, this depth-first search approach was the one that used the
    least memory while iterating over the object lists. There is still a
    chance that repositories with too-wide path patterns could cause memory
    pressure issues. Limiting the stack size could be done in the future by
    limiting how many objects are being considered in-progress, or by
    visiting blob paths earlier than trees.
    
    There are many future adaptations that could be made, but they are left for
    future updates when consumers are ready to take advantage of those features.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    b7e9b81 View commit details
    Browse the repository at this point in the history
  2. test-lib-functions: add test_cmp_sorted

    This test helper will be helpful to reduce repeated logic in
    t6601-path-walk.sh, but may be helpful elsewhere, too.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    cf2ed61 View commit details
    Browse the repository at this point in the history
  3. t6601: add helper for testing path-walk API

    Add some tests based on the current behavior, doing interesting checks
    for different sets of branches, ranges, and the --boundary option. This
    sets a baseline for the behavior and we can extend it as new options are
    introduced.
    
    Store and output a 'batch_nr' value so we can demonstrate that the paths are
    grouped together in a batch and not following some other ordering. This
    allows us to test the depth-first behavior of the path-walk API. However, we
    purposefully do not test the order of the objects in the batch, so the
    output is compared to the expected output through a sort.
    
    It is important to mention that the behavior of the API will change soon as
    we start to handle UNINTERESTING objects differently, but these tests will
    demonstrate the change in behavior.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a3c754d View commit details
    Browse the repository at this point in the history
  4. path-walk: allow consumer to specify object types

    We add the ability to filter the object types in the path-walk API so
    the callback function is called fewer times.
    
    This adds the ability to ask for the commits in a list, as well. We
    re-use the empty string for this set of objects because these are passed
    directly to the callback function instead of being part of the
    'path_stack'.
    
    Future changes will add the ability to visit annotated tags.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    83b746f View commit details
    Browse the repository at this point in the history
  5. path-walk: visit tags and cached objects

    The rev_info that is specified for a path-walk traversal may specify
    visiting tag refs (both lightweight and annotated) and also may specify
    indexed objects (blobs and trees). Update the path-walk API to walk
    these objects as well.
    
    When walking tags, we need to peel the annotated objects until reaching
    a non-tag object. If we reach a commit, then we can add it to the
    pending objects to make sure we visit in the commit walk portion. If we
    reach a tree, then we will assume that it is a root tree. If we reach a
    blob, then we have no good path name and so add it to a new list of
    "tagged blobs".
    
    When the rev_info includes the "--indexed-objects" flag, then the
    pending set includes blobs and trees found in the cache entries and
    cache-tree. The cache entries are usually blobs, though they could be
    trees in the case of a sparse index. The cache-tree stores
    previously-hashed tree objects but these are cleared out when staging
    objects below those paths. We add tests that demonstrate this.
    
    The indexed objects come with a non-NULL 'path' value in the pending
    item. This allows us to prepopulate the 'path_to_lists' strmap with
    lists for these paths.
    
    The tricky thing about this walk is that we will want to combine the
    indexed objects walk with the commit walk, especially in the future case
    of walking objects during a command like 'git repack'.
    
    Whenever possible, we want the objects from the index to be grouped with
    similar objects in history. We don't want to miss any paths that appear
    only in the index and not in the commit history.
    
    Thus, we need to be careful to let the path stack be populated initially
    with only the root tree path (and possibly tags and tagged blobs) and go
    through the normal depth-first search. Afterwards, if there are other
    paths that are remaining in the paths_to_lists strmap, we should then
    iterate through the stack and visit those objects recursively.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    97765aa View commit details
    Browse the repository at this point in the history
  6. path-walk: mark trees and blobs as UNINTERESTING

    When the input rev_info has UNINTERESTING starting points, we want to be
    sure that the UNINTERESTING flag is passed appropriately through the
    objects. To match how this is done in places such as 'git pack-objects', we
    use the mark_edges_uninteresting() method.
    
    This method has an option for using the "sparse" walk, which is similar in
    spirit to the path-walk API's walk. To be sure to keep it independent, add a
    new 'prune_all_uninteresting' option to the path_walk_info struct.
    
    To check how the UNINTERSTING flag is spread through our objects, extend the
    'test-tool path-walk' command to output whether or not an object has that
    flag. This changes our tests significantly, including the removal of some
    objects that were previously visited due to the incomplete implementation.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    a4aaa3b View commit details
    Browse the repository at this point in the history
  7. survey: stub in new experimental 'git-survey' command

    Start work on a new 'git survey' command to scan the repository
    for monorepo performance and scaling problems.  The goal is to
    measure the various known "dimensions of scale" and serve as a
    foundation for adding additional measurements as we learn more
    about Git monorepo scaling problems.
    
    The initial goal is to complement the scanning and analysis performed
    by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
    It is hoped that by creating a builtin command, we may be able to take
    advantage of internal Git data structures and code that is not
    accessible from GO to gain further insight into potential scaling
    problems.
    
    Co-authored-by: Derrick Stolee <[email protected]>
    Signed-off-by: Jeff Hostetler <[email protected]>
    Signed-off-by: Derrick Stolee <[email protected]>
    jeffhostetler and derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    512f033 View commit details
    Browse the repository at this point in the history
  8. survey: add command line opts to select references

    By default we will scan all references in "refs/heads/", "refs/tags/"
    and "refs/remotes/".
    
    Add command line opts let the use ask for all refs or a subset of them
    and to include a detached HEAD.
    
    Signed-off-by: Jeff Hostetler <[email protected]>
    Signed-off-by: Derrick Stolee <[email protected]>
    jeffhostetler authored and derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    842d126 View commit details
    Browse the repository at this point in the history
  9. survey: start pretty printing data in table form

    When 'git survey' provides information to the user, this will be presented
    in one of two formats: plaintext and JSON. The JSON implementation will be
    delayed until the functionality is complete for the plaintext format.
    
    The most important parts of the plaintext format are headers specifying the
    different sections of the report and tables providing concreted data.
    
    Create a custom table data structure that allows specifying a list of
    strings for the row values. When printing the table, check each column for
    the maximum width so we can create a table of the correct size from the
    start.
    
    The table structure is designed to be flexible to the different kinds of
    output that will be implemented in future changes.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    d2b33be View commit details
    Browse the repository at this point in the history
  10. survey: add object count summary

    At the moment, nothing is obvious about the reason for the use of the
    path-walk API, but this will become more prevelant in future iterations. For
    now, use the path-walk API to sum up the counts of each kind of object.
    
    For example, this is the reachable object summary output for my local repo:
    
    REACHABLE OBJECT SUMMARY
    ========================
    Object Type |  Count
    ------------+-------
           Tags |   1343
        Commits | 179344
          Trees | 314350
          Blobs | 184030
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    9b211cf View commit details
    Browse the repository at this point in the history
  11. survey: summarize total sizes by object type

    Now that we have explored objects by count, we can expand that a bit more to
    summarize the data for the on-disk and inflated size of those objects. This
    information is helpful for diagnosing both why disk space (and perhaps
    clone or fetch times) is growing but also why certain operations are slow
    because the inflated size of the abstract objects that must be processed is
    so large.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    0c15c1e View commit details
    Browse the repository at this point in the history
  12. survey: show progress during object walk

    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    7e80a8b View commit details
    Browse the repository at this point in the history
  13. survey: add ability to track prioritized lists

    In future changes, we will make use of these methods. The intention is to
    keep track of the top contributors according to some metric. We don't want
    to store all of the entries and do a sort at the end, so track a
    constant-size table and remove rows that get pushed out depending on the
    chosen sorting algorithm.
    
    Co-authored-by: Jeff Hostetler <[email protected]>
    Signed-off-by; Jeff Hostetler <[email protected]>
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee and Jeff Hostetler committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    cce275f View commit details
    Browse the repository at this point in the history
  14. survey: add report of "largest" paths

    Since we are already walking our reachable objects using the path-walk API,
    let's now collect lists of the paths that contribute most to different
    metrics. Specifically, we care about
    
     * Number of versions.
     * Total size on disk.
     * Total inflated size (no delta or zlib compression).
    
    This information can be critical to discovering which parts of the
    repository are causing the most growth, especially on-disk size. Different
    packing strategies might help compress data more efficiently, but the toal
    inflated size is a representation of the raw size of all snapshots of those
    paths. Even when stored efficiently on disk, that size represents how much
    information must be processed to complete a command such as 'git blame'.
    
    Since the on-disk size is likely to be fragile, stop testing the exact
    output of 'git survey' and check that the correct set of headers is
    output.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    b70eee6 View commit details
    Browse the repository at this point in the history
  15. survey: add --top=<N> option and config

    The 'git survey' builtin provides several detail tables, such as "top
    files by on-disk size". The size of these tables defaults to 100,
    currently.
    
    Allow the user to specify this number via a new --top=<N> option or the
    new survey.top config key.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    dd54ff7 View commit details
    Browse the repository at this point in the history
  16. survey: clearly note the experimental nature in the output

    While this command is definitely something we _want_, chances are that
    upstreaming this will require substantial changes.
    
    We still want to be able to experiment with this before that, to focus
    on what we need out of this command: To assist with diagnosing issues
    with large repositories, as well as to help monitoring the growth and
    the associated painpoints of such repositories.
    
    To that end, we are about to integrate this command into
    `microsoft/git`, to get the tool into the hands of users who need it
    most, with the idea to iterate in close collaboration between these
    users and the developers familar with Git's internals.
    
    However, we will definitely want to avoid letting anybody have the
    impression that this command, its exact inner workings, as well as its
    output format, are anywhere close to stable. To make that fact utterly
    clear (and thereby protect the freedom to iterate and innovate freely
    before upstreaming the command), let's mark its output as experimental
    in all-caps, as the first thing we do.
    
    Signed-off-by: Johannes Schindelin <[email protected]>
    dscho authored and derrickstolee committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    38e1168 View commit details
    Browse the repository at this point in the history