kmx git

Commit	Date	Message
d730d3f4	2013-07-31T16:40:42	Major rename detection changes After doing further profiling, I found that a lot of time was being spent attempting to insert hashes into the file hash signature when using the rolling hash because the rolling hash approach generates a hash per byte of the file instead of one per run/line of data. To optimize this, I decided to convert back to a run-based file signature algorithm which would be more like core Git. After changing this, a number of the existing tests started to fail. In some cases, this appears to have been because the test was coded to be too specific to the particular results of the file similarity metric and in some cases there appear to have been bugs in the core rename detection code where only by the coincidence of the file similarity scoring were the expected results being generated. This renames all the variables in the core rename detection code to be more consistent and hopefully easier to follow which made it a bit easier to reason about the behavior of that code and fix the problems that I was seeing. I think it's in better shape now. There are a couple of tests now that attempt to stress test the rename detection code and they are quite slow. Most of the time is spent setting up the test data on disk and in the index. When we roll out performance improvements for index insertion, it should also speed up these tests I hope.
3658e81e	2013-03-25T14:20:07	Move crlf conversion into buf_text This adds crlf/lf conversion functions into buf_text with more efficient implementations that bypass the high level buffer functions. They attempt to minimize the number of reallocations done and they directly write the buffer data as needed if they know that there is enough memory allocated to memcpy data. Tests are added for these new functions. The crlf.c code is updated to use the new functions. Removed the include of buf_text.h from filter.h and just include it more narrowly in the places that need it.
9bc8be3d	2013-02-19T10:25:41	Refine pluggable similarity API This plugs in the three basic similarity strategies for handling whitespace via internal use of the pluggable API. In so doing, I realized that the use of git_buf in the hashsig API was not needed and actually just made it harder to use, so I tweaked that API as well. Note that the similarity metric is still not hooked up in the find_similarity code - this is just setting out the function that will be used.
aa643260	2013-02-15T11:08:02	More tests of file signatures with whitespace opts Seems to be working pretty well...
5e5848eb	2013-02-14T17:25:10	Change similarity metric to sampled hashes This moves the similarity metric code out of buf_text and into a new file. Also, this implements a different approach to similarity measurement based on a Rabin-Karp rolling hash where we only keep the top 100 and bottom 100 hashes. In theory, that should be sufficient samples to given a fairly accurate measurement while limiting the amount of data we keep for file signatures no matter how large the file is.
9c454b00	2013-01-11T22:13:02	Initial implementation of similarity scoring algo This adds a new `git_buf_text_hashsig` type and functions to generate these hash signatures and compare them to give a similarity score. This can be plugged into diff similarity scoring.
17c92bea	2013-01-29T12:13:24	Test buf join with NULL behavior explicitly
0d65acad	2013-01-11T11:24:26	Match binary file check of core git in diff Core git just looks for NUL bytes in files when deciding about is-binary inside diff (although it uses a better algorithm in checkout, when deciding if CRLF conversion should be done). Libgit2 was using the better algorithm in both places, but that is causing some confusion. For now, this makes diff just look for NUL bytes to decide if a file is binary by content in diff.
7bf87ab6	2012-11-28T09:58:48	Consolidate text buffer functions There are many scattered functions that look into the contents of buffers to do various text manipulations (such as escaping or unescaping data, calculating text stats, guessing if content is binary, etc). This groups all those functions together into a new file and converts the code to use that. This has two enhancements to existing functionality. The old text stats function is significantly rewritten and the BOM detection code was extended (although largely we can't deal with anything other than a UTF8 BOM).
2d3579be	2012-10-10T14:54:31	Add git_buf_put_base64 to buffer API
e9ca852e	2012-08-23T09:20:17	Fix warnings and merge issues on Win64
02a0d651	2012-07-12T16:31:59	Add git_buf_unescape and git__unescape to unescape all characters in a string (in-place)
465092ce	2012-07-12T11:56:50	Fix memory leak in test
039fc406	2012-07-10T15:10:14	Add a couple of useful git_buf utilities * `git_buf_rfind` (with tests and tests for `git_buf_rfind_next`) * `git_buf_puts_escaped` and `git_buf_puts_escaped_regex` (with tests) to copy strings into a buffer while injecting an escape sequence (e.g. '\') in front of particular characters.
41a82592	2012-05-15T14:17:39	Ranged iterators and rewritten git_status_file The goal of this work is to rewrite git_status_file to use the same underlying code as git_status_foreach. This is done in 3 phases: 1. Extend iterators to allow ranged iteration with start and end prefixes for the range of file names to be covered. 2. Improve diff so that when there is a pathspec and there is a common non-wildcard prefix of the pathspec, it will use ranged iterators to minimize excess iteration. 3. Rewrite git_status_file to call git_status_foreach_ext with a pathspec that covers just the one file being checked. Since ranged iterators underlie the status & diff implementation, this is actually fairly efficient. The workdir iterator does end up loading the contents of all the directories down to the single file, which should ideally be avoided, but it is pretty good.
1a6e8f8a	2012-04-13T10:42:00	Update clar and remove old helpers This updates to the latest clar which includes the helpers `cl_assert_equal_s` and `cl_assert_equal_i`. Convert the code over to use those and remove the old libgit2-only helpers.
a4c291ef	2012-03-20T21:57:38	Convert reflog to new errors Cleaned up some other issues.
13224ea4	2012-02-27T04:28:31	buffer: Unify `git_fbuffer` and `git_buf` This makes so much sense that I can't believe it hasn't been done before. Kill the old `git_fbuffer` and read files straight into `git_buf` objects. Also: In order to fully support 4GB files in 32-bit systems, the `git_buf` implementation has been changed from using `ssize_t` for storage and storing negative values on allocation failure, to using `size_t` and changing the buffer pointer to a magical pointer on allocation failure. Hopefully this won't break anything.
3fd1520c	2012-01-24T20:35:15	Rename the Clay test suite to Clar Clay is the name of a programming language on the makings, and we want to avoid confusions. Sorry for the huge diff!

d730d3f4

2013-07-31T16:40:42

Major rename detection changes After doing further profiling, I found that a lot of time was being spent attempting to insert hashes into the file hash signature when using the rolling hash because the rolling hash approach generates a hash per byte of the file instead of one per run/line of data. To optimize this, I decided to convert back to a run-based file signature algorithm which would be more like core Git. After changing this, a number of the existing tests started to fail. In some cases, this appears to have been because the test was coded to be too specific to the particular results of the file similarity metric and in some cases there appear to have been bugs in the core rename detection code where only by the coincidence of the file similarity scoring were the expected results being generated. This renames all the variables in the core rename detection code to be more consistent and hopefully easier to follow which made it a bit easier to reason about the behavior of that code and fix the problems that I was seeing. I think it's in better shape now. There are a couple of tests now that attempt to stress test the rename detection code and they are quite slow. Most of the time is spent setting up the test data on disk and in the index. When we roll out performance improvements for index insertion, it should also speed up these tests I hope.

3658e81e

2013-03-25T14:20:07

Move crlf conversion into buf_text This adds crlf/lf conversion functions into buf_text with more efficient implementations that bypass the high level buffer functions. They attempt to minimize the number of reallocations done and they directly write the buffer data as needed if they know that there is enough memory allocated to memcpy data. Tests are added for these new functions. The crlf.c code is updated to use the new functions. Removed the include of buf_text.h from filter.h and just include it more narrowly in the places that need it.

9bc8be3d

2013-02-19T10:25:41

Refine pluggable similarity API This plugs in the three basic similarity strategies for handling whitespace via internal use of the pluggable API. In so doing, I realized that the use of git_buf in the hashsig API was not needed and actually just made it harder to use, so I tweaked that API as well. Note that the similarity metric is still not hooked up in the find_similarity code - this is just setting out the function that will be used.

aa643260

2013-02-15T11:08:02

More tests of file signatures with whitespace opts Seems to be working pretty well...

5e5848eb

2013-02-14T17:25:10

Change similarity metric to sampled hashes This moves the similarity metric code out of buf_text and into a new file. Also, this implements a different approach to similarity measurement based on a Rabin-Karp rolling hash where we only keep the top 100 and bottom 100 hashes. In theory, that should be sufficient samples to given a fairly accurate measurement while limiting the amount of data we keep for file signatures no matter how large the file is.

9c454b00

2013-01-11T22:13:02

Initial implementation of similarity scoring algo This adds a new `git_buf_text_hashsig` type and functions to generate these hash signatures and compare them to give a similarity score. This can be plugged into diff similarity scoring.

17c92bea

2013-01-29T12:13:24

Test buf join with NULL behavior explicitly

0d65acad

2013-01-11T11:24:26

Match binary file check of core git in diff Core git just looks for NUL bytes in files when deciding about is-binary inside diff (although it uses a better algorithm in checkout, when deciding if CRLF conversion should be done). Libgit2 was using the better algorithm in both places, but that is causing some confusion. For now, this makes diff just look for NUL bytes to decide if a file is binary by content in diff.

7bf87ab6

2012-11-28T09:58:48

Consolidate text buffer functions There are many scattered functions that look into the contents of buffers to do various text manipulations (such as escaping or unescaping data, calculating text stats, guessing if content is binary, etc). This groups all those functions together into a new file and converts the code to use that. This has two enhancements to existing functionality. The old text stats function is significantly rewritten and the BOM detection code was extended (although largely we can't deal with anything other than a UTF8 BOM).

2d3579be

2012-10-10T14:54:31

Add git_buf_put_base64 to buffer API

e9ca852e

2012-08-23T09:20:17

Fix warnings and merge issues on Win64

02a0d651

2012-07-12T16:31:59

Add git_buf_unescape and git__unescape to unescape all characters in a string (in-place)

465092ce

2012-07-12T11:56:50

Fix memory leak in test

039fc406

2012-07-10T15:10:14

Add a couple of useful git_buf utilities * `git_buf_rfind` (with tests and tests for `git_buf_rfind_next`) * `git_buf_puts_escaped` and `git_buf_puts_escaped_regex` (with tests) to copy strings into a buffer while injecting an escape sequence (e.g. '\') in front of particular characters.

41a82592

2012-05-15T14:17:39

Ranged iterators and rewritten git_status_file The goal of this work is to rewrite git_status_file to use the same underlying code as git_status_foreach. This is done in 3 phases: 1. Extend iterators to allow ranged iteration with start and end prefixes for the range of file names to be covered. 2. Improve diff so that when there is a pathspec and there is a common non-wildcard prefix of the pathspec, it will use ranged iterators to minimize excess iteration. 3. Rewrite git_status_file to call git_status_foreach_ext with a pathspec that covers just the one file being checked. Since ranged iterators underlie the status & diff implementation, this is actually fairly efficient. The workdir iterator does end up loading the contents of all the directories down to the single file, which should ideally be avoided, but it is pretty good.

1a6e8f8a

2012-04-13T10:42:00

Update clar and remove old helpers This updates to the latest clar which includes the helpers `cl_assert_equal_s` and `cl_assert_equal_i`. Convert the code over to use those and remove the old libgit2-only helpers.

a4c291ef

2012-03-20T21:57:38

Convert reflog to new errors Cleaned up some other issues.

13224ea4

2012-02-27T04:28:31

buffer: Unify `git_fbuffer` and `git_buf` This makes so much sense that I can't believe it hasn't been done before. Kill the old `git_fbuffer` and read files straight into `git_buf` objects. Also: In order to fully support 4GB files in 32-bit systems, the `git_buf` implementation has been changed from using `ssize_t` for storage and storing negative values on allocation failure, to using `size_t` and changing the buffer pointer to a magical pointer on allocation failure. Hopefully this won't break anything.

3fd1520c

2012-01-24T20:35:15

Rename the Clay test suite to Clar Clay is the name of a programming language on the makings, and we want to avoid confusions. Sorry for the huge diff!

thodg/libgit2/tests-clar/core/buffer.c

tests-clar/core/buffer.c

Log