• Show log

    Commit

  • Hash : 4dfcc50f
    Author : Patrick Steinhardt
    Date : 2020-04-01T15:16:18

    merge: cache negative cache results for similarity metrics
    
    When computing renames, we cache the hash signatures for each of the
    potentially conflicting entries so that we do not need to repeatedly
    read the file and can at least halfway efficiently determine whether two
    files are similar enough to be deemed a rename. In order to make the
    hash signatures meaningful, we require at least four lines of data to be
    present, resulting in at least four different hashes that can be
    compared. Files that are deemed too small are not cached at all and
    will thus be repeatedly re-hashed, which is usually not a huge issue.
    
    The issue with above heuristic is in case a file does _not_ have at
    least four lines, where a line is anything separated by a consecutive
    run of "\n" or "\0" characters. For example "a\nb" is two lines, but
    "a\0\0b" is also just two lines. Taken to the extreme, a file that has
    megabytes of consecutive space- or NUL-only may also be deemed as too
    small and thus not get cached. As a result, we will repeatedly load its
    blob, calculate its hash signature just to finally throw it away as we
    notice it's not of any value. When you've got a comparitively big file
    that you compare against a big set of potentially renamed files, then
    the cost simply expodes.
    
    The issue can be trivially fixed by introducing negative cache entries.
    Whenever we determine that a given blob does not have a meaningful
    representation via a hash signature, we store this negative cache marker
    and will from then on not hash it again, but also ignore it as a
    potential rename target. This should help the "normal" case already
    where you have a lot of small files as rename candidates, but in the
    above scenario it's savings are extraordinarily high.
    
    To verify we do not hit the issue anymore with described solution, this
    commit adds a test that uses the exact same setup described above with
    one 50 megabyte blob of '\0' characters and 1000 other files that get
    renamed. Without the negative cache:
    
    $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null
    real    11m48.377s
    user    11m11.576s
    sys     0m35.187s
    
    And with the negative cache:
    
    $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null
    real    0m1.972s
    user    0m1.851s
    sys     0m0.118s
    
    So this represents a ~350-fold performance improvement, but it obviously
    depends on how many files you have and how big the blob is. The test
    number were chosen in a way that one will immediately notice as soon as
    the bug resurfaces.