src/diff_tform.c


Log

Author Commit Date CI Message
Edward Thomson 0f35efeb 2020-05-23T10:15:51 git_pool_init: handle failure cases Propagate failures caused by pool initialization errors.
Edward Thomson b59c71d8 2020-01-18T14:11:01 iterator: update enum type name for consistency libgit2 does not use `type_t` suffixes as it's redundant; thus, rename `git_iterator_type_t` to `git_iterator_t` for consistency.
Edward Thomson 4334b177 2019-06-23T15:43:38 blob: use `git_object_size_t` for object size Instead of using a signed type (`off_t`) use a new `git_object_size_t` for the sizes of objects.
Patrick Steinhardt e54343a4 2019-06-29T09:17:32 fileops: rename to "futils.h" to match function signatures Our file utils functions all have a "futils" prefix, e.g. `git_futils_touch`. One would thus naturally guess that their definitions and implementation would live in files "futils.h" and "futils.c", respectively, but in fact they live in "fileops.h". Rename the files to match expectations.
Edward Thomson 5d92e547 2019-06-08T17:28:35 oid: `is_zero` instead of `iszero` The only function that is named `issomething` (without underscore) was `git_oid_iszero`. Rename it to `git_oid_is_zero` for consistency with the rest of the library.
Edward Thomson f673e232 2018-12-27T13:47:34 git_error: use new names in internal APIs and usage Move to the `git_error` name in the internal API for error-related functions.
Etienne Samson 0a8745f2 2018-10-24T01:26:48 diff: assert that we're passed a valid git_diff object CID 1386176, 1386177, 1388219
Edward Thomson 168fe39b 2018-11-28T14:26:57 object_type: use new enumeration names Use the new object_type enumeration names within the codebase.
Patrick Steinhardt c65568d8 2018-08-09T12:48:26 diff: fix OOM on AIX when finding similar deltas in empty diff The function `git_diff_find_similar` keeps a function of cache similarity metrics signatures, whose size depends on the number of deltas passed in via the `diff` parameter. In case where the diff is empty and thus doesn't have any deltas at all, we may end up allocating this cache via a call to `git__calloc(0, sizeof(void *))`. At least on AIX, allocating 0 bytes will result in a `NULL` pointer being returned, which causes us to erroneously return an OOM error. Fix this situation by simply returning early in case where we are being passed an empty diff, as we cannot find any similarities in that case anyway.
Patrick Steinhardt ecf4f33a 2018-02-08T11:14:48 Convert usage of `git_buf_free` to new `git_buf_dispose`
Patrick Steinhardt ce7080a0 2018-02-20T10:38:27 diff_tform: fix rename detection with rewrite/delete pair A rewritten file can either be classified as a modification of its contents or of a delete of the complete file followed by an addition of the new content. This distinction becomes important when we want to detect renames for rewrites. Given a scenario where a file "a" has been deleted and another file "b" has been renamed to "a", this should be detected as a deletion of "a" followed by a rename of "a" -> "b". Thus, splitting of the original rewrite into a delete/add pair is important here. This splitting is represented by a flag we can set at the current delta. While the flag is already being set in case we want to break rewrites, we do not do so in case where the `GIT_DIFF_FIND_RENAMES_FROM_REWRITES` flag is set. This can trigger an assert when we try to match the source and target deltas. Fix the issue by setting the `GIT_DIFF_FLAG__TO_SPLIT` flag at the delta when it is a rename target and `GIT_DIFF_FIND_RENAMES_FROM_REWRITES` is set.
Patrick Steinhardt 0c7f49dd 2017-06-30T13:39:01 Make sure to always include "common.h" first Next to including several files, our "common.h" header also declares various macros which are then used throughout the project. As such, we have to make sure to always include this file first in all implementation files. Otherwise, we might encounter problems or even silent behavioural differences due to macros or defines not being defined as they should be. So in fact, our header and implementation files should make sure to always include "common.h" first. This commit does so by establishing a common include pattern. Header files inside of "src" will now always include "common.h" as its first other file, separated by a newline from all the other includes to make it stand out as special. There are two cases for the implementation files. If they do have a matching header file, they will always include this one first, leading to "common.h" being transitively included as first file. If they do not have a matching header file, they instead include "common.h" as first file themselves. This fixes the outlined problems and will become our standard practice for header and source files inside of the "src/" from now on.
Edward Thomson 191474a1 2017-02-09T18:28:19 diff: don't do rename detection on submodules
Edward Thomson 909d5494 2016-12-29T12:25:15 giterr_set: consistent error messages Error messages should be sentence fragments, and therefore: 1. Should not begin with a capital letter, 2. Should not conclude with punctuation, and 3. Should not end a sentence and begin a new one
Edward Thomson 9be638ec 2016-04-19T15:12:18 git_diff_generated: abstract generated diffs
Patrick Steinhardt 1a8c11f4 2016-03-10T10:40:47 diff_tform: fix potential NULL pointer access When the user passes in a diff which has no repository associated we may call `git_config__get_int_force` with a NULL-pointer configuration. Even though `git_config__get_int_force` is designed to swallow errors, it is not intended to be called with a NULL pointer configuration. Fix the issue by only calling `git_config__get_int_force` only when configuration could be retrieved from the repository.
Patrick Steinhardt 32f07984 2016-02-23T11:07:03 diff_tform: fix potential NULL pointer access The `normalize_find_opts` function in theory allows for the incoming diff to have no repository. When the caller does not pass in diff find options or if the GIT_DIFF_FIND_BY_CONFIG value is set, though, we try to derive the configuration from the diff's repository configuration without first verifying that the repository is actually set to a non-NULL value. Fix this issue by explicitly checking if the repository is set and if it is not, fall back to a default value of GIT_DIFF_FIND_RENAMES.
Vicent Marti 1e5e02b4 2015-10-27T17:26:04 pool: Simplify implementation
Edward Thomson 90177111 2015-06-23T16:27:33 stash: save the workdir file when deleted in index When stashing the workdir tree, examine the index as well. Using a mechanism similar to `git_diff_tree_to_workdir_with_index` allows us to determine that a file was added in the index and subsequently modified in the working directory. Without examining the index, we would erroneously believe that this file was untracked and fail to include it in the working directory tree. Use a slightly modified `git_diff_tree_to_workdir_with_index` in order to avoid some of the behavior custom to `git diff`. In particular, be sure to include the working directory side of a file when it was deleted in the index.
Edward Thomson 5ef43d41 2015-06-23T10:29:59 git_diff__merge: allow pluggable diff merges
Edward Thomson 83ba5e36 2015-06-22T18:43:13 diff_tform: remove reversed copy of delta merger Drop `git_diff__merge_like_cgit_reversed`, since it's a copy and paste mess of slightly incompatible changes.
Pierre-Olivier Latour cb63e7e8 2015-06-17T08:55:09 Explicitly handle GIT_DELTA_CONFLICTED in git_diff_merge() This fixes a bug where if a file was in conflicted state in either diff, it would not always remain in conflicted state in the merged diff.
Pierre-Olivier Latour 50456801 2015-06-10T10:09:10 Fixed handling of GIT_DELTA_CONFLICTED in git_diff_find_similar() git_diff_find_similar() now ignores git_diff_delta records with a status of GIT_DELTA_CONFLICTED, which fixes a crash due to assert() being hit.
Pierre-Olivier Latour b9780823 2015-03-30T14:06:21 Make sure to also update delta->nfiles when merging diffs When diffs are generated, the value for the 'nfiles' field of 'git_diff_delta' will be consistent with the value in the 'status' field. Merging diffs can modify the 'status' field of some deltas and the 'nfiles' field needs to be updated accordingly.
Carlos Martín Nieto fe21d708 2015-03-04T00:29:37 Plug a few leaks
Edward Thomson f1453c59 2015-02-12T12:19:37 Make our overflow check look more like gcc/clang's Make our overflow checking look more like gcc and clang's, so that we can substitute it out with the compiler instrinsics on platforms that support it. This means dropping the ability to pass `NULL` as an out parameter. As a result, the macros also get updated to reflect this as well.
Edward Thomson 392702ee 2015-02-09T23:41:13 allocations: test for overflow of requested size Introduce some helper macros to test integer overflow from arithmetic and set error message appropriately.
Pierre-Olivier Latour 36fc5497 2014-12-02T05:11:12 Added GIT_HASHSIG_ALLOW_SMALL_FILES to allow computing signatures for small files The implementation of the hashsig API disallows computing a signature on small files containing only a few lines. This new flag disables this behavior. git_diff_find_similar() sets this flag by default which means that rename / copy detection of small files will now work. This in turn affects the behavior of the git_status and git_blame APIs which will now detect rename of small files assuming the right options are passed.
Pierre-Olivier Latour ea66215d 2014-10-26T10:29:19 Removed some useless variable assignments
Vicent Marti 737b5051 2014-10-01T12:03:24 hashsig: Export as a `sys` header
Alan Rogers 61bef72d 2014-05-20T23:57:40 Start adding GIT_DELTA_UNREADABLE and GIT_STATUS_WT_UNREADABLE.
Russell Belfer 240f4af3 2014-04-28T14:04:29 Add build option for diff internal statistics
Ben Straub c8893d1f 2014-02-24T13:39:04 Use a portable cast
Ben Straub 8ac7da79 2014-02-22T10:15:11 Avoid casting warning
Carlos Martín Nieto 9950bb4e 2014-01-24T20:23:17 diff: rename the file's 'oid' to 'id' In the same vein as the previous commits in this series.
Edward Thomson 410a8e6f 2014-01-22T18:31:25 Sometimes a zero byte file is just a zero byte file Don't go to the ODB to resolve zero byte files in the workdir
Russell Belfer 9cfce273 2013-12-12T12:11:38 Cleanups, renames, and leak fixes This renames git_vector_free_all to the better git_vector_free_deep and also contains a couple of memory leak fixes based on valgrind checks. The fixes are specifically: failure to free global dir path variables when not compiled with threading on and failure to free filters from the filter registry that had not be initialized fully.
Russell Belfer 7e3ed419 2013-12-11T16:56:17 Fix up some valgrind leaks and warnings
Russell Belfer 9f77b3f6 2013-11-25T14:21:34 Add config read fns with controlled error behavior This adds `git_config__lookup_entry` which will look up a key in a config and return either the entry or NULL if the key was not present. Optionally, it can either suppress all errors or can return them (although not finding the key is not an error for this function). Unlike other accessors, this does not normalize the config key string, so it must only be used when the key is known to be in normalized form (i.e. all lower-case before the first dot and after the last dot, with no invalid characters). This also adds three high-level helper functions to look up config values with no errors and a fallback value. The three functions are for string, bool, and int values, and will resort to the fallback value for any error that arises. They are: * `git_config__get_string_force` * `git_config__get_bool_force` * `git_config__get_int_force` None of them normalize the config `key` either, so they can only be used for internal cases where the key is known to be in normal format.
Russell Belfer fcd324c6 2013-12-06T15:04:31 Add git_vector_free_all There are a lot of places that we call git__free on each item in a vector and then call git_vector_free on the vector itself. This just wraps that up into one convenient helper function.
Ben Straub 5a52d6be 2013-12-11T06:43:17 Check version earlier
Ben Straub 7fb4147f 2013-12-06T13:38:59 Don't clobber whitespace settings
Ben Straub 628e92cd 2013-12-05T14:47:04 Don't use weird return codes
Ben Straub c56c6d69 2013-12-05T14:13:46 Implement GIT_DIFF_FIND_BY_CONFIG
Russell Belfer f62c174d 2013-12-02T13:49:58 GIT_DIFF_FIND_REMOVE_UNMODIFIED sounds better
Russell Belfer 97ad85b8 2013-12-02T13:30:05 Add GIT_DIFF_FIND_DELETE_UNMODIFIED flag When doing copy detection, it is often necessary to include UNMODIFIED records in the git_diff so they are available as source records for GIT_DIFF_FIND_COPIES_FROM_UNMODIFIED. Yet in the final diff, often you will not want to have these UNMODIFIED records. This adds a flag which marks these UNMODIFIED records for deletion from the diff list so they will be removed after the rename detect phase is over.
Russell Belfer 2123a17f 2013-12-02T13:27:06 Fix bug making split deltas a COPIED targets When FIND_COPIES is used in combination with BREAK_REWRITES for rename detection, there was a bug where the split MODIFIED delta was only used as a target for RENAME records and not for COPIED records. This fixes that, converting the split into a pair of DELETED and COPIED deltas when that circumstance arises.
Russell Belfer e7c85120 2013-11-01T11:39:37 More tests and fixed for merging reversed diffs There were a lot more cases to deal with to make sure that our merged (i.e. workdir-to-tree-to-index) diffs were matching the output of core Git.
Russell Belfer 3940310e 2013-10-30T13:56:42 Fix some of the glaring errors in GIT_DIFF_REVERSE These changes fix the basic problem with GIT_DIFF_REVERSE being broken for text diffs. The reversed diff entries were getting added to the git_diff correctly, but some of the metadata was kept incorrectly in a way that prevented the text diffs from being generated correctly. Once I fixed that, it became clear that it was not possible to merge reversed diffs correctly. This has a first pass at fixing that problem. We probably need more tests to make sure that is really fixed thoroughly.
Russell Belfer 74a627f0 2013-10-21T09:07:19 Tweak to git_diff_delta structure for nfiles While the base git_diff_delta structure always contains two files, when we introduce conflict data, it will be helpful to have an indicator when an additional file is involved.
Russell Belfer 10672e3e 2013-10-15T15:10:07 Diff API cleanup This lays groundwork for separating formatting options from diff creation options. This groups the formatting flags separately from the diff list creation flags and reorders the options. This also tweaks some APIs to further separate code that uses patches from code that just looks at git_diffs.
Russell Belfer 3ff1d123 2013-10-11T14:51:54 Rename diff objects and split patch.h This makes no functional change to diff but renames a couple of the objects and splits the new git_patch (formerly git_diff_patch) into a new header file.
Vicent Martí 6208bd49 2013-09-03T12:29:18 Merge pull request #1804 from ethomson/rewrites Minor changes for rewrites
Edward Thomson 17c7fbf6 2013-08-21T14:07:53 Split rewrites, status doesn't return rewrites Ensure that we apply splits to rewrites, even if we're not interested in examining it closely for rename/copy detection. In keeping with core git, status should not display rewrites, it should simply show files as "modified".
Russell Belfer b6ac07b5 2013-08-22T14:45:10 Trying to fix Win32 warnings
Russell Belfer 7edb74d3 2013-08-04T14:06:13 Update rename src map for any split src When using a rename source that is actually a to-be-split record, we have to update the best-fit mapping data in both the case where the target is also a split record and the case where the target is a simple added record. Before this commit, we were only doing the update when the target was itself a split record (and even in that case, the test was slightly wrong).
Russell Belfer d730d3f4 2013-07-31T16:40:42 Major rename detection changes After doing further profiling, I found that a lot of time was being spent attempting to insert hashes into the file hash signature when using the rolling hash because the rolling hash approach generates a hash per byte of the file instead of one per run/line of data. To optimize this, I decided to convert back to a run-based file signature algorithm which would be more like core Git. After changing this, a number of the existing tests started to fail. In some cases, this appears to have been because the test was coded to be too specific to the particular results of the file similarity metric and in some cases there appear to have been bugs in the core rename detection code where only by the coincidence of the file similarity scoring were the expected results being generated. This renames all the variables in the core rename detection code to be more consistent and hopefully easier to follow which made it a bit easier to reason about the behavior of that code and fix the problems that I was seeing. I think it's in better shape now. There are a couple of tests now that attempt to stress test the rename detection code and they are quite slow. Most of the time is spent setting up the test data on disk and in the index. When we roll out performance improvements for index insertion, it should also speed up these tests I hope.
Russell Belfer a16e4172 2013-07-25T12:27:39 Fix rename detection to use actual blob size The size data in the index may not reflect the actual size of the blob data from the ODB when content filtering comes into play. This commit fixes rename detection to use the actual blob size when calculating data signatures instead of the value from the index. Because of a misunderstanding on my part, I first converted the git_index_add_bypath API to use the post-filtered blob data size in creating the index entry. I backed that change out, but I kept the overall refactoring of that routine and the new internal git_blob__create_from_paths API because it eliminates an extra stat() call from the code that adds a file to the index. The existing tests actually cover this code path, at least when running on Windows, so at this point I'm not adding new tests to cover the changes.
Russell Belfer effdbeb3 2013-07-24T17:48:37 Make rename detection file size fix better The previous fix for checking file sizes with rename detection always loads the blob. In this version, if the odb backend can get the object header without loading the whole thing into memory, then we'll just use that, so that we can eliminate possible rename sources & targets without loading them.
Russell Belfer a5140f4d 2013-07-24T17:11:49 Fix rename detection for tree-to-tree diffs The performance improvements I introduced for rename detection were not able to run successfully for tree-to-tree diffs because the blob size was not known early enough and so the file signature always had to be calculated nonetheless. This change separates loading blobs into memory from calculating the signature. I can't avoid having to load the large blobs into memory, but by moving it forward, I'm able to avoid the signature calculation if the blob won't come into play for renames.
Russell Belfer f5c4d022 2013-07-24T13:44:35 Fix incorrect comment
Russell Belfer 18e9efc4 2013-07-24T13:10:16 Don't check rename if file size difference is huge
Edward Thomson d55bed1a 2013-07-17T16:55:00 don't include ignored as rename candidates
nulltoken c4ac556e 2013-06-29T12:48:58 Fix compilation warnings
Russell Belfer e4acc3ba 2013-06-18T16:14:35 Fix rename looped reference issues This makes the diff rename tracking code more careful about the order in which it processes renames and more thorough in updating the mapping of correct renames when an earlier rename update alters the index of a later matched pair.
Russell Belfer a1683f28 2013-06-14T16:18:04 More tests and bug fixes for status with rename This changes the behavior of the status RENAMED flags so that they will be combined with the MODIFIED flags if appropriate. If a file is modified in the index and also renamed, then the status code will have both the GIT_STATUS_INDEX_MODIFIED and INDEX_RENAMED bits set. If it is renamed but the OID has not changed, then just the GIT_STATUS_INDEX_RENAMED bit will be set. Similarly, the flags GIT_STATUS_WT_MODIFIED and GIT_STATUS_WT_RENAMED can both be set independently of one another. This fixes a serious bug where the check for unmodified files that was done at data load time could end up erasing the RENAMED state of a file that was renamed with no changes. Lastly, this contains a bunch of new tests for status with renames, including tests where the only rename changes are case changes. The expected results of these tests have to vary by whether the platform uses a case sensitive filesystem or not, so the expected data covers those platform differences separately.
Russell Belfer 37f66e82 2013-06-12T15:21:21 Fix Windows warnings This fixes problems with missing function prototypes and 64-bit data issues on Windows.
Vicent Martí 88c401be 2013-06-12T14:54:32 Merge pull request #1643 from ethomson/rename_source Keep data about source of similarity
Edward Thomson 690bf41c 2013-06-10T15:16:44 keep source similarity in rename detection
Russell Belfer 114f5a6c 2013-06-10T10:10:39 Reorganize diff and add basic diff driver This is a significant reorganization of the diff code to break it into a set of more clearly distinct files and to document the new organization. Hopefully this will make the diff code easier to understand and to extend. This adds a new `git_diff_driver` object that looks of diff driver information from the attributes and the config so that things like function content in diff headers can be provided. The full driver spec is not implemented in the commit - this is focused on the reorganization of the code and putting the driver hooks in place. This also removes a few #includes from src/repository.h that were overbroad, but as a result required extra #includes in a variety of places since including src/repository.h no longer results in pulling in the whole world.
Russell Belfer 49f70f2c 2013-05-23T15:48:06 Fill out diff rename test coverage This extends the rename tests to make sure that every rename scenario in the inner loop of git_diff_find_similar is actually exercised. Also, fixes an incorrect assert that was in one of the clauses that was not previously being exercised.
Russell Belfer 67db583d 2013-05-23T15:06:07 More diff rename tests; better split swap handling This adds a couple more tests of different rename scenarios. Also, this fixes a problem with the case where you have two "split" deltas and the left half of one matches the right half of the other. That case was already being handled, but in the wrong order in a way that could result in bad output. Also, if the swap also happened to put the other two halves into the correct place (i.e. two files exchanged places with each other), then the second delta was left with the SPLIT flag set when it really should be cleared.
Russell Belfer c68b09dc 2013-05-23T11:52:34 Fix dereference of freed delta I was accidentally using a value that I had just freed. This moves the clearing of the delta internal flags into a better place.
Russell Belfer a21cbb12 2013-05-22T10:37:12 Significant rename detection rewrite This flips rename detection around so instead of creating a forward mapping from deltas to possible rename targets, instead it creates a reverse mapping, looking at possible targets and trying to find a source that they could have been renamed or copied from. This is important because each output can only have a single source, but a given source could map to multiple outputs (in the form of COPIED records). Additionally, this makes a couple of tweaks to the public rename detection APIs, mostly renaming a couple of options that control the behavior to make more sense and to be more like core Git. I walked through the tests looking at the exact results and updated the expectations based on what I saw. The new code is different from the old because it cannot give some nonsense results (like A was renamed to both B and C) which were part of the outputs previously.
Russell Belfer 9be5be47 2013-05-20T13:37:21 More git_diff_find_similar improvements - Add new GIT_DIFF_FIND_EXACT_MATCH_ONLY flag to do similarity matching without using the similarity metric (i.e. only compare the SHA). - Clean up the similarity measurement code to more rigorously distinguish between files that are not similar and files that are not comparable (previously, a 0 could either mean that the files could not be compared or that they were totally different) - When splitting a MODIFIED file into a DELETE/ADD pair, actually make a DELETED/UNTRACKED pair if the right side of the diff is from the working directory. This prevents an odd mix of ADDED and UNTRACKED files on workdir diffs.
Russell Belfer d958e37a 2013-05-17T17:21:45 Fix issues with git_diff_find_similar There are a number of bugs in the rename code that only were obvious when I started testing it against large old repos with more complex patterns. (The code to do that testing is not ready to merge with libgit2, but I do plan to add more thorough tests.) This contains a significant number of changes and also tweaks the public API slightly to make emulating core git easier. Most notably, this separates the GIT_DIFF_FIND_AND_BREAK_REWRITES flag into FIND_REWRITES (which adds a self-similarity score to every modified file) and BREAK_REWRITES (which splits the modified deltas into add/remove pairs in the diff list). When you do a raw output of core git, rewrites show up as M090 or such, not at A and D output, so I wanted to be able to emulate that. Publicly, this also changes the flags to be uint16_t since we don't need values out of that range. Internally, this contains significant changes from a number of small bug fixes (like using the wrong side of the diff to decide if the object could be found in the ODB vs the workdir) to larger issues about which files can and should be compared and how the various edge cases of similarity scores should be treated. Honestly, I don't think this is the last update that will have to be made to this code, but I think this moves us closer to correct behavior and I tried to document the code so it would be easier to follow..
Vicent Martí 71596200 2013-05-15T15:47:46 Merge pull request #1588 from arrbee/fixes-for-checkout-and-diff Bug fixes for checkout and diff
Russell Belfer 09fae31d 2013-05-15T14:58:26 Improve robustness of diff rename detection Under some strange circumstances, diffs can end up listing files that we can't actually open successfully. Instead of aborting the git_diff_find_similar, this makes it so that those files just won't be considered as valid rename/copy targets instead.
nulltoken 1fed6b07 2013-05-13T21:57:37 Fix trailing whitespaces
Edward Thomson 0462fba5 2013-04-30T14:56:41 renames!
Russell Belfer b7f167da 2013-04-29T13:52:12 Make git_oid_cmp public and add git_oid__cmp
Russell Belfer 8cfd54f0 2013-03-26T12:27:15 Fix Windows/Win32 warning
Edward Thomson aa408cbf 2013-03-11T11:18:00 handle small files in similarity metrics
Russell Belfer 0a008913 2013-02-22T10:21:02 Minor improvements to find_similar code This moves a couple of checks outside of the inner loop of the find_similar rename/copy detection phase that are only dependent on the "from" side of a detection. Also, this replaces the inefficient initialization of the options structure when a value is not provided explicitly by the user.
Russell Belfer f8275890 2013-02-22T10:19:50 Replace static data with configured metric Instead of creating three git_diff_similarity_metric statically for the various config options, just create the metric structure on demand and populate it, using the payload to specific the extra flags that should be passed to the hashsig. This removes a level of obfuscation from the code, I think.
Russell Belfer d4b747c1 2013-02-21T16:44:44 Add diff rename tests with partial similarity This adds some new tests that actually exercise the similarity metric between files to detect renames, copies, and split modified files that are too heavily modified. There is still more testing to do - these tests are just partially covering the cases. There is also one bug fix in this where a change set with only MODIFY being broken into ADD/DELETE (due to low self-similarity) without any additional RENAMED entries would end up not processing the split requests (because the num_rewrites counter got reset).
Russell Belfer 960a04dd 2013-02-21T12:40:33 Initial integration of similarity metric to diff This is the initial integration of the similarity metric into the `git_diff_find_similar()` code path. The existing tests all pass, but the new functionality isn't currently well tested. The integration does go through the pluggable metric interface, so it should be possible to drop in an alternative to the internal metric that libgit2 implements. This comes along with a behavior change for an existing interface; namely, passing two NULLs to git_diff_blobs (or passing NULLs to git_diff_blob_to_buffer) will now call the file_cb parameter zero times instead of one time. I know it's strange that that change is paired with this other change, but it emerged from some initialization changes that I ended up making.
Russell Belfer 71a3d27e 2013-02-08T10:06:47 Replace diff delta binary with flags Previously the git_diff_delta recorded if the delta was binary. This replaces that (with no net change in structure size) with a full set of flags. The flag values that were already in use for individual git_diff_file objects are reused for the delta flags, too (along with renaming those flags to make it clear that they are used more generally). This (a) makes things somewhat more consistent (because I was using a -1 value in the "boolean" binary field to indicate unset, whereas now I can just use the flags that are easier to understand), and (b) will make it easier for me to add some additional flags to the delta object in the future, such as marking the results of a copy/rename detection or other deltas that might want a special indicator. While making this change, I officially moved some of the flags that were internal only into the private diff header. This also allowed me to remove a gross hack in rename/copy detect code where I was overwriting the status field with an internal value.
Russell Belfer 9bc8be3d 2013-02-19T10:25:41 Refine pluggable similarity API This plugs in the three basic similarity strategies for handling whitespace via internal use of the pluggable API. In so doing, I realized that the use of git_buf in the hashsig API was not needed and actually just made it harder to use, so I tweaked that API as well. Note that the similarity metric is still not hooked up in the find_similarity code - this is just setting out the function that will be used.
Russell Belfer 5e5848eb 2013-02-14T17:25:10 Change similarity metric to sampled hashes This moves the similarity metric code out of buf_text and into a new file. Also, this implements a different approach to similarity measurement based on a Rabin-Karp rolling hash where we only keep the top 100 and bottom 100 hashes. In theory, that should be sufficient samples to given a fairly accurate measurement while limiting the amount of data we keep for file signatures no matter how large the file is.
Russell Belfer 99ba8f23 2013-01-22T15:27:08 wip: adding metric to diff
Philip Kelley 11d9f6b3 2013-01-27T14:17:07 Vector improvements and their fallout
Edward Thomson 359fc2d2 2013-01-08T17:07:25 update copyrights
Ben Straub c7231c45 2012-11-30T16:31:42 Deploy GITERR_CHECK_VERSION
Ben Straub ca901e7b 2012-11-29T15:16:19 Deploy GIT_DIFF_FIND_OPTIONS_INIT
Russell Belfer db106d01 2012-10-30T09:40:50 Move rename detection into new file This improves the naming for the rename related functionality moving it to be called `git_diff_find_similar()` and renaming all the associated constants, etc. to make more sense. I also moved the new code (plus the existing `git_diff_merge`) into a new file `diff_tform.c` where I can put new functions related to manipulating git diff lists. This also updates the implementation significantly from the last revision fixing some ordering issues (where break-rewrite needs to be handled prior to copy and rename detection) and improving config option handling.