tests/merge/trees/renames.c


Log

Author Commit Date CI Message
Edward Thomson f0e693b1 2021-09-07T17:53:49 str: introduce `git_str` for internal, `git_buf` is external libgit2 has two distinct requirements that were previously solved by `git_buf`. We require: 1. A general purpose string class that provides a number of utility APIs for manipulating data (eg, concatenating, truncating, etc). 2. A structure that we can use to return strings to callers that they can take ownership of. By using a single class (`git_buf`) for both of these purposes, we have confused the API to the point that refactorings are difficult and reasoning about correctness is also difficult. Move the utility class `git_buf` to be called `git_str`: this represents its general purpose, as an internal string buffer class. The name also is an homage to Junio Hamano ("gitstr"). The public API remains `git_buf`, and has a much smaller footprint. It is generally only used as an "out" param with strict requirements that follow the documentation. (Exceptions exist for some legacy APIs to avoid breaking callers unnecessarily.) Utility functions exist to convert a user-specified `git_buf` to a `git_str` so that we can call internal functions, then converting it back again.
Patrick Steinhardt 0cf9b666 2020-05-12T11:41:44 tests: merge: fix printf formatter on 32 bit arches We currently use `PRIuMAX` to print an integer of type `size_t` in merge::trees::rename::cache_recomputation. While this works just fine on 64 bit arches, it doesn't on 32 bit ones. As a result, our nightly builds on x86 and arm32 fail. Fix the issue by using `PRIuZ` instead.
Patrick Steinhardt 4dfcc50f 2020-04-01T15:16:18 merge: cache negative cache results for similarity metrics When computing renames, we cache the hash signatures for each of the potentially conflicting entries so that we do not need to repeatedly read the file and can at least halfway efficiently determine whether two files are similar enough to be deemed a rename. In order to make the hash signatures meaningful, we require at least four lines of data to be present, resulting in at least four different hashes that can be compared. Files that are deemed too small are not cached at all and will thus be repeatedly re-hashed, which is usually not a huge issue. The issue with above heuristic is in case a file does _not_ have at least four lines, where a line is anything separated by a consecutive run of "\n" or "\0" characters. For example "a\nb" is two lines, but "a\0\0b" is also just two lines. Taken to the extreme, a file that has megabytes of consecutive space- or NUL-only may also be deemed as too small and thus not get cached. As a result, we will repeatedly load its blob, calculate its hash signature just to finally throw it away as we notice it's not of any value. When you've got a comparitively big file that you compare against a big set of potentially renamed files, then the cost simply expodes. The issue can be trivially fixed by introducing negative cache entries. Whenever we determine that a given blob does not have a meaningful representation via a hash signature, we store this negative cache marker and will from then on not hash it again, but also ignore it as a potential rename target. This should help the "normal" case already where you have a lot of small files as rename candidates, but in the above scenario it's savings are extraordinarily high. To verify we do not hit the issue anymore with described solution, this commit adds a test that uses the exact same setup described above with one 50 megabyte blob of '\0' characters and 1000 other files that get renamed. Without the negative cache: $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null real 11m48.377s user 11m11.576s sys 0m35.187s And with the negative cache: $ time ./libgit2_clar -smerge::trees::renames::cache_recomputation >/dev/null real 0m1.972s user 0m1.851s sys 0m0.118s So this represents a ~350-fold performance improvement, but it obviously depends on how many files you have and how big the blob is. The test number were chosen in a way that one will immediately notice as soon as the bug resurfaces.
Patrick Steinhardt e54343a4 2019-06-29T09:17:32 fileops: rename to "futils.h" to match function signatures Our file utils functions all have a "futils" prefix, e.g. `git_futils_touch`. One would thus naturally guess that their definitions and implementation would live in files "futils.h" and "futils.c", respectively, but in fact they live in "fileops.h". Rename the files to match expectations.
Patrick Steinhardt 9994cd3f 2018-06-25T11:56:52 treewide: remove use of C++ style comments C++ style comment ("//") are not specified by the ISO C90 standard and thus do not conform to it. While libgit2 aims to conform to C90, we did not enforce it until now, which is why quite a lot of these non-conforming comments have snuck into our codebase. Do a tree-wide conversion of all C++ style comments to the supported C style comments to allow us enforcing strict C90 compliance in a later commit.
Edward Thomson 49806e9b 2017-02-09T16:52:03 merge_trees: introduce test for submodule renames Test that shows that submodules are incorrectly considered in renames, and `git_merge_trees` will fail to lookup the submodule as a blob.
Edward Thomson 19ed4d0c 2017-01-01T22:19:23 merge: set default rename threshold When `GIT_MERGE_FIND_RENAMES` is set, provide a default for `rename_threshold` when it is unset.
Edward Thomson 5aa2ac6d 2014-03-11T22:47:39 Update git_merge_tree_opts to git_merge_options
Ben Straub 17820381 2013-11-14T14:05:52 Rename tests-clar to tests