src/indexer.c


Log

Author Commit Date CI Message
Edward Thomson f673e232 2018-12-27T13:47:34 git_error: use new names in internal APIs and usage Move to the `git_error` name in the internal API for error-related functions.
Edward Thomson 168fe39b 2018-11-28T14:26:57 object_type: use new enumeration names Use the new object_type enumeration names within the codebase.
Patrick Steinhardt 852bc9f4 2018-11-23T19:26:24 khash: remove intricate knowledge of khash types Instead of using the `khiter_t`, `git_strmap_iter` and `khint_t` types, simply use `size_t` instead. This decouples code from the khash stuff and makes it possible to move the khash includes into the implementation files.
Edward Thomson 50186ce8 2018-08-26T11:26:45 Merge pull request #4374 from pks-t/pks/pack-file-verify Pack file verification
Nelson Elhage 32810348 2018-07-20T08:43:54 Use UINT32_MAX as the default object limit This replicates the old behavior of limiting to 2³² by default.
Nelson Elhage bfe34242 2018-07-16T03:12:01 See if this fixes 32-bit build
Nelson Elhage efe3f37d 2018-07-12T04:20:15 Add a git_libgit2_opts option to set the max indexer object count
Nelson Elhage 912c59c9 2018-06-24T06:51:08 while fuzzing, limit # objects read
Patrick Steinhardt 6b51f380 2018-06-22T13:19:31 indexer: correctly initialize struct with {0}
Patrick Steinhardt 5ec4aee9 2017-11-12T10:35:18 indexer: add ability to select connectivity checks Right now, we simply turn on connectivity checks in the indexer as soon as we have access to an object database. But seeing that the connectivity checks may incur additional overhead, we do want the user to decide for himself whether he wants to allow those checks. Furthermore, it might also be desirable to check connectivity in case where no object database is given at all, e.g. in case where a fully connected pack file is expected. Add a flag `verify` to `git_indexer_options` to enable additional verification checks. Also avoid to query the ODB in case none is given to allow users to enable checks when they do not have an ODB.
Patrick Steinhardt c16556aa 2017-11-12T10:31:48 indexer: introduce options struct to `git_indexer_new` We strive to keep an options structure to many functions to be able to extend options in the future without breaking the API. `git_indexer_new` doesn't have one right now, but we want to be able to add an option for enabling strict packfile verification. Add a new `git_indexer_options` structure and adjust callers to use that.
Patrick Steinhardt a616fb16 2017-10-13T13:53:05 indexer: check pack file connectivity When passing `--strict` to `git-unpack-objects`, core git will verify the pack file that is currently being read. In addition to the typical checksum verification, this will especially cause it to verify object connectivity of the received pack file. So it checks, for every received object, if all the objects it references are either part of the local object database or part of the pack file. In libgit2, we currently have no such mechanism, which leaves us unable to verify received pack files prior to writing them into our local object database. This commit introduce the concept of `expected_oids` to the indexer. When pack file verification is turned on by a new flag, the indexer will try to parse each received object first. If the object has any links to other objects, it will check if those links are already satisfied by known objects either part of the object database or objects it has already seen as part of that pack file. If not, it will add them to the list of `expected_oids`. Furthermore, the indexer will remove the current object from the `expected_oids` if it is currently being expected. Like this, we are able to verify whether all object links are being satisfied. As soon as we hit the end of the object stream and have resolved all objects as well as deltified objects, we assert that `expected_oids` is in fact empty. This should always be the case for a valid pack file with full connectivity.
Patrick Steinhardt be41c384 2017-11-12T09:25:49 indexer: extract function reading stream objects The loop inside of `git_indexer_append` iterates over every object that is to be stored as part of the index. While the logic to retrieve every object from the packfile stream is rather involved, it currently just part of the loop, making it unnecessarily hard to follow. Move the logic into its own function `read_stream_object`, which unpacks a single object from the stream. Note that there is some subtletly here involving the special error `GIT_EBUFS`, which indicates to the indexer that no more data is currently available. So instead of returning an error and aborting the whole loop in that case, we do have to catch that value and return successfully to wait for more data to be read.
Patrick Steinhardt 6568f374 2017-10-11T13:20:19 indexer: remove useless local variable The `processed` variable local to `git_indexer_append` counts how many objects have already been processed. But actually, whenever it gets assigned to, we are also assigning the same value to the `stats->indexed_objects` struct member. So in fact, it is being quite useless due to always having the same value as the `indexer_objects` member and makes it a bit harder to understand the code. We can just remove the variable to fix that.
Patrick Steinhardt ecf4f33a 2018-02-08T11:14:48 Convert usage of `git_buf_free` to new `git_buf_dispose`
Patrick Steinhardt c8ee5270 2017-12-08T09:05:58 pack: rename `git_packfile_stream_free` The function `git_packfile_stream_free` frees all state of the packfile stream without freeing the structure itself. This naming makes it hard to spot whether it will try to free the pointer itself or not, causing potential future errors. Due to this reason, we have decided to name a function freeing state without freeing the actual struture a "dispose" function. Rename `git_packfile_stream_free` to `git_packfile_stream_dispose` as a first example following this rule.
Edward Thomson 619f61a8 2018-02-01T06:22:36 odb: error when we can't create object header Return an error to the caller when we can't create an object header for some reason (printf failure) instead of simply asserting.
lhchavez c3514b0b 2017-12-23T14:59:07 Fix unpack double free If an element has been cached, but then the call to packfile_unpack_compressed() fails, the very next thing that happens is that its data is freed and then the element is not removed from the cache, which frees the data again. This change sets obj->data to NULL to avoid the double-free. It also stops trying to resolve deltas after two continuous failed rounds of resolution, and adds a test for this.
lhchavez c8aaba24 2017-12-06T03:03:18 libFuzzer: Fix missing trailer crash This change fixes an invalid memory access when the trailer is missing / corrupt. Found using libFuzzer.
lhchavez 400caed3 2017-12-06T03:22:58 libFuzzer: Fix a git_packfile_stream leak This change ensures that the git_packfile_stream object in git_indexer_append() does not leak when the stream has errors. Found using libFuzzer.
Patrick Steinhardt 0c7f49dd 2017-06-30T13:39:01 Make sure to always include "common.h" first Next to including several files, our "common.h" header also declares various macros which are then used throughout the project. As such, we have to make sure to always include this file first in all implementation files. Otherwise, we might encounter problems or even silent behavioural differences due to macros or defines not being defined as they should be. So in fact, our header and implementation files should make sure to always include "common.h" first. This commit does so by establishing a common include pattern. Header files inside of "src" will now always include "common.h" as its first other file, separated by a newline from all the other includes to make it stand out as special. There are two cases for the implementation files. If they do have a matching header file, they will always include this one first, leading to "common.h" being transitively included as first file. If they do not have a matching header file, they instead include "common.h" as first file themselves. This fixes the outlined problems and will become our standard practice for header and source files inside of the "src/" from now on.
Edward Thomson 6f960b55 2017-06-11T10:37:46 Merge pull request #4088 from chescock/packfile-name-using-complete-hash Ensure packfiles with different contents have different names
Patrick Steinhardt 6c23704d 2017-06-08T21:40:18 settings: rename `GIT_OPT_ENABLE_SYNCHRONOUS_OBJECT_CREATION` Initially, the setting has been solely used to enable the use of `fsync()` when creating objects. Since then, the use has been extended to also cover references and index files. As the option is not yet part of any release, we can still correct this by renaming the option to something more sensible, indicating not only correlation to objects. This commit renames the option to `GIT_OPT_ENABLE_FSYNC_GITDIR`. We also move the variable from the object to repository source code.
Chris Hescock c0e54155 2017-01-11T10:39:59 indexer: name pack files after trailer hash Upstream git.git has changed the way how packfiles are named. Previously, they were using a hash of the contained object's OIDs, which has then been changed to use the hash of the complete packfile instead. See 1190a1acf (pack-objects: name pack files after trailer hash, 2013-12-05) in the git.git repository for more information on this change. This commit changes our logic to match the behavior of core git.
Edward Thomson 1c04a96b 2017-02-28T12:29:29 Honor `core.fsyncObjectFiles`
Edward Thomson 2a5ad7d0 2017-02-17T16:42:40 fsync: call it "synchronous" object writing Rename `GIT_OPT_ENABLE_SYNCHRONIZED_OBJECT_CREATION` -> `GIT_OPT_ENABLE_SYNCHRONOUS_OBJECT_CREATION`.
Edward Thomson 1229e1c4 2017-02-17T16:36:53 fsync parent directories when fsyncing When fsync'ing files, fsync the parent directory in the case where we rename a file into place, or create a new file, to ensure that the directory entry is flushed correctly.
Edward Thomson 1c2c0ae2 2016-12-14T12:51:40 packbuilder: honor git_object__synchronized_writing Honor `git_object__synchronized_writing` when creating a packfile and corresponding index.
Patrick Steinhardt 0d716905 2017-01-27T15:23:15 oidmap: remove GIT__USE_OIDMAP macro
Patrick Steinhardt 85d2748c 2017-01-27T14:05:10 khash: avoid using `kh_key`/`kh_val` as lvalue
Patrick Steinhardt f31cb45a 2017-01-25T15:31:12 khash: avoid using `kh_put` directly
Patrick Steinhardt cb18386f 2017-01-25T14:26:58 khash: avoid using `kh_val`/`kh_value` directly
Patrick Steinhardt 036daa59 2017-01-25T14:11:42 khash: use `git_map_exists` where applicable
Patrick Steinhardt 9694d9ba 2017-01-25T14:09:17 khash: avoid using `kh_foreach`/`kh_foreach_value` directly
Edward Thomson 048c5ea7 2017-01-21T23:55:21 Merge pull request #4053 from chescock/extend-packfile-by-pages Extend packfile in increments of page_size.
Edward Thomson 87b7a705 2017-01-21T15:44:57 indexer: avoid warning about `idx->pack` It must be non-NULL to have a valid `git_indexer`.
Edward Thomson bf339ab0 2017-01-21T14:51:31 indexer: introduce `git_packfile_close` Encapsulation!
Edward Thomson 52949c80 2017-01-21T18:30:12 Merge branch 'pr/4060'
Edward Thomson d030bba9 2017-01-21T17:15:33 indexer: only delete temp file if it was unused Only try to `unlink` our temp file when we know that we didn't copy it into its permanent location.
lhchavez f5586f5c 2017-01-14T16:37:00 Addressed review feedback
lhchavez 96df833b 2017-01-03T19:15:09 Close the file before unlinking I forgot that Windows chokes while trying to delete open files.
lhchavez db535d0a 2017-01-01T12:45:02 Delete temporary packfile in indexer This change deletes the temporary packfile that the indexer creates to avoid littering the pack/ directory with garbage.
Chris Hescock c7a1535f 2016-12-29T11:47:52 Extend packfile in increments of page_size. This improves performance by reducing the number of I/O operations.
Edward Thomson 909d5494 2016-12-29T12:25:15 giterr_set: consistent error messages Error messages should be sentence fragments, and therefore: 1. Should not begin with a capital letter, 2. Should not conclude with punctuation, and 3. Should not end a sentence and begin a new one
Carlos Martín Nieto d53cc13e 2016-03-31T04:12:46 Merge pull request #3575 from pmq20/master-13jan16 Remove duplicated calls to git_mwindow_close
Carlos Martín Nieto e50a49ee 2016-03-22T01:54:49 Merge pull request #3559 from yongthecoder/master Add a sanity check in git_indexer_commit to avoid subtraction overflow.
Carlos Martín Nieto 87c18197 2016-03-16T19:05:11 Split the page size from the mmap alignment While often similar, these are not the same on Windows. We want to use the page size on Windows for the pools, but for mmap we need to use the allocation granularity as the alignment. On the other platforms these values remain the same.
P.S.V.R d4e4f272 2016-01-13T11:07:14 Remove duplicated calls to git_mwindow_close
Yong Li b3eb2cde 2015-12-24T10:04:44 Avoid subtraction overflow in git_indexer_commit
Stefan Widgren c369b379 2015-07-31T16:23:11 Remove extra semicolon outside of a function Without this change, compiling with gcc and pedantic generates warning: ISO C does not allow extra ‘;’ outside of a function.
Edward Thomson 3e8c5e45 2015-06-10T16:43:48 Merge pull request #3174 from libgit2/cmn/idx-fill-hole indexer: use lseek to extend the packfile
Carlos Martín Nieto 02980bdc 2015-06-09T16:53:07 Initialize a few variables Coverity complains about the git_rawobj ones because we use a loop in which we keep remembering the old version, and we end up copying our object as the base, so we want to have the data pointer be NULL.
Carlos Martín Nieto aa57231f 2015-06-02T10:25:22 indexer: use lseek to extend the packfile We've been using `p_ftruncate()` to extend the packfile in order to mmap it and write the new data into it. This works well in the general case, but as truncation does not allocate space in the filesystem, it must do so when we write data to it. The only way the OS has to indicate a failure to allocate space is via SIGBUS which means we tried to write outside the file. This will cause everyone to crash as they don't expect to handle this signal. Switch to using `p_lseek()` and `p_write()` to extend the file in a way which tells the filesystem to allocate the space for the missing data. We can then be sure that we have space to write into.
Edward Thomson e2dd3735 2015-05-22T11:20:47 indexer: avoid loading already existent bases When thickening a pack, avoid loading already loaded bases and trying to insert them all over again.
Edward Thomson 7800048a 2015-03-17T10:06:50 Merge pull request #2972 from libgit2/cmn/pack-objects-walk [WIP] Smarter pack-building
Carlos Martín Nieto 7c63a33f 2015-03-13T19:41:40 indexer: bring back the error message on duplcate commits It turns out that erroring out on duplicate commits is the right thing to do, but git was not hitting the bug on the server-side. Bring back a descriptive error message in case of duplicate entries and error out.
Carlos Martín Nieto dccf59ad 2015-03-13T18:28:07 indexer: don't worry about duplicate objects If a packfile includes duplicate objects, we can choose to use the secon copy instead of the first by using the same logic as if it were the first. Change the error condition from 0 to -1, which indicates a bad resize, and set the OOM message in that case. This does mean we will leak the first copy of the object. We can deal with that later, but making fetches work is more important.
Carlos Martín Nieto a34692c4 2015-03-13T18:00:15 indexer: set an error message on duplicate objects in pack While this is not even close to a fix, we can at least set an error message so we know which error we are facing. Up to know we just returned an error without a message.
Carlos Martín Nieto b63b76e0 2014-10-12T11:42:31 Reorder some khash declarations Keep the definitions in the headers, while putting the declarations in the C files. Putting the function definitions in headers causes them to be duplicated if you include two headers with them.
Edward Thomson c251f3bb 2014-12-08T16:05:47 win32: remember to cleanup our hash_ctx
Ravindra Patel ec7e680c 2014-11-20T12:07:55 Fix for misleading "missing delta bases" error - Fix #2721.
Ravindra Patel 7561f98d 2014-11-19T14:54:30 Fix for memory leak issue in indexer.c, that surfaces on windows
Carlos Martín Nieto 177a29d8 2014-10-27T10:39:45 Merge commit 'refs/pull/2366/head' of github.com:libgit2/libgit2
William Swanson 01b432cf 2014-07-09T14:12:30 Properly report failure when expanding a packfile
Philip Kelley bc8a0886 2014-06-27T11:51:35 Fix assert when receiving uncommon sideband packet
Carlos Martín Nieto b3b66c57 2014-06-18T17:13:12 Share packs across repository instances Opening the same repository multiple times will currently open the same file multiple times, as well as map the same region of the file multiple times. This is not necessary, as the packfile data is immutable. Instead of opening and closing packfiles directly, introduce an indirection and allocate packfiles globally. This does mean locking on each packfile open, but we already use this lock for the global mwindow list so it doesn't introduce a new contention point.
Albert Meltzer 62e562f9 2014-05-18T07:54:41 Fix compiler warning (git_off_t cast to size_t). Use size_t for page size, instead of long. Check result of sysconf. Use size_t for page offset so no cast to size_t (second arg to p_mmap). Use mod instead div/mult pair, so no cast to size_t is necessary.
Albert Meltzer 9c4feef9 2014-05-17T12:44:21 Fix warning on uninitialized variable.
Carlos Martín Nieto 0731a5b4 2014-05-14T19:12:48 indexer: mmap fixes for Windows Windows has its own ftruncate() called _chsize_s(). p_mkstemp() is changed to use p_open() so we can make sure we open for writing; the addition of exclusive create is a good thing to do regardless, as we want a temporary path for ourselves. Lastly, MSVC doesn't quite know how to add two numbers if one of them is a void pointer, so let's alias it to unsigned char.C
Carlos Martín Nieto f7310540 2014-05-13T02:41:48 indexer: use mmap for writing Some OSs cannot keep their ideas about file content straight when mixing standard IO with file mapping. As we use mmap for reading from the packfile, let's make writing to the pack file use mmap.
Linquize b3f27c43 2014-05-13T21:08:50 Initialize local variable
Carlos Martín Nieto 2dde1e0c 2014-05-08T22:31:59 indexer: avoid memory moves Our vector does a move of the rest of the array when we remove an item. Doing this repeatedly can be expensive, and we do this a lot in the indexer. Instead, set the value to NULL and skip those entries. perf reported around 30% of `index-pack` time was going into memmove. With this change, that goes away and we spent most of the time hashing and inflating data.
Jacques Germishuys 48e60ae7 2014-04-21T11:23:29 Don't redefine the same callback types, their signatures may change
Russell Belfer e9d5e5f3 2014-01-28T16:25:42 Some fixes for Windows x64 warnings
Vicent Marti 557bd1f4 2014-01-14T10:27:57 Merge pull request #2043 from arthurschreiber/arthur/fix-memory-leaks Fix a bunch of memory leaks.
Arthur Schreiber 24953757 2014-01-14T19:08:58 Incorporate @arrbee's suggestions.
Edward Thomson c6f26b48 2013-12-13T18:26:46 Refactor zlib for easier deflate streaming
Arthur Schreiber ac44b3d2 2014-01-13T23:28:03 Incorporate @ethomson's suggestions.
Arthur Schreiber ddf1b1ff 2014-01-13T22:33:10 Fix a memory leak in `hash_and_save` and `inject_object`.
Russell Belfer 9cfce273 2013-12-12T12:11:38 Cleanups, renames, and leak fixes This renames git_vector_free_all to the better git_vector_free_deep and also contains a couple of memory leak fixes based on valgrind checks. The fixes are specifically: failure to free global dir path variables when not compiled with threading on and failure to free filters from the filter registry that had not be initialized fully.
Russell Belfer 7697e541 2013-12-11T15:02:20 Test cancel from indexer progress callback This adds tests that try canceling an indexer operation from within the progress callback. After writing the tests, I wanted to run this under valgrind and had a number of errors in that situation because mmap wasn't working. I added a CMake option to force emulation of mmap and consolidated the Amiga-specific code into that new place (so we don't actually need separate Amiga code now, just have to turn on -DNO_MMAP). Additionally, I made the indexer code propagate error codes more reliably than it used to.
Russell Belfer 26c1cb91 2013-12-09T09:44:03 One more rename/cleanup for callback err functions
Russell Belfer fcd324c6 2013-12-06T15:04:31 Add git_vector_free_all There are a lot of places that we call git__free on each item in a vector and then call git_vector_free on the vector itself. This just wraps that up into one convenient helper function.
Russell Belfer 25e0b157 2013-12-06T15:07:57 Remove converting user error to GIT_EUSER This changes the behavior of callbacks so that the callback error code is not converted into GIT_EUSER and instead we propagate the return value through to the caller. Instead of using the giterr_capture and giterr_restore functions, we now rely on all functions to pass back the return value from a callback. To avoid having a return value with no error message, the user can call the public giterr_set_str or some such function to set an error message. There is a new helper 'giterr_set_callback' that functions can invoke after making a callback which ensures that some error message was set in case the callback did not set one. In places where the sign of the callback return value is meaningful (e.g. positive to skip, negative to abort), only the negative values are returned back to the caller, obviously, since the other values allow for continuing the loop. The hardest parts of this were in the checkout code where positive return values were overloaded as meaningful values for checkout. I fixed this by adding an output parameter to many of the internal checkout functions and removing the overload. This added some code, but it is probably a better implementation. There is some funkiness in the network code where user provided callbacks could be returning a positive or a negative value and we want to rely on that to cancel the loop. There are still a couple places where an user error might get turned into GIT_EUSER there, I think, though none exercised by the tests.
Russell Belfer dab89f9b 2013-12-04T21:22:57 Further EUSER and error propagation fixes This continues auditing all the places where GIT_EUSER is being returned and making sure to clear any existing error using the new giterr_user_cancel helper. As a result, places that relied on intercepting GIT_EUSER but having the old error preserved also needed to be cleaned up to correctly stash and then retrieve the actual error. Additionally, as I encountered places where error codes were not being propagated correctly, I tried to fix them up. A number of those fixes are included in the this commit as well.
Jameson Miller db4cbfe5 2013-12-02T14:09:12 Updates to cancellation logic during download and indexing of packfile.
Edward Thomson 1e60e5f4 2013-11-07T12:03:44 Allow callers to set mode on packfile creation
Edward Thomson 1d3a8aeb 2013-11-04T18:28:57 move mode_t to filebuf_open instead of _commit
Russell Belfer 948f00b4 2013-11-01T09:38:03 Merge pull request #1933 from libgit2/vmg/gcc-warnings Warnings for Windows x64 (MSVC) and GCC on Linux
Vicent Marti 51a3dfb5 2013-11-01T16:31:02 pack: `__object_header` always returns unsigned values
Linquize 3343b5ff 2013-10-31T22:59:42 Fix warning on win64
Carlos Martín Nieto a6154f21 2013-10-30T15:00:05 indexer: remove the stream infix It was there to keep it apart from the one which read in from a file on disk. This other indexer does not exist anymore, so there is no need for anything other than git_indexer to refer to it. While here, rename _add() function to _append() and _finalize() to _commit(). The former change is cosmetic, while the latter avoids talking about "finalizing", which OO languages use to mean something completely different.
Vicent Martí 5c50f22a 2013-10-28T09:25:44 Merge pull request #1891 from libgit2/cmn/fix-thin-packs Add support for thin packs
Carlos Martín Nieto ab46b1d8 2013-10-23T15:08:18 indexer: include the delta stats The user is unable to derive the number of deltas in the pack, as that would require them to capture the stats exactly in the moment between download and final processing, which is abstracted away in the fetch. Capture these numbers for the user and expose them in the progress struct. The clone and fetch examples now also present this information to the user.
Carlos Martín Nieto 893055f2 2013-10-11T17:24:29 indexer: clearer stats for thin packs Don't increase the number of total objects, as it can produce suprising progress output. The only addition compared to pre-thin is the addition of local_objects to allow an output similar to git's "completed with %d local objects".
Carlos Martín Nieto 7fb6eb27 2013-10-08T11:54:50 indexer: inject one base at a time There may be multiple deltas referencing the same base as well as OFS deltas which rely on a thin delta. Deal with both at the same time by injecting a single object and going back up to the main delta-resolving loop.
Carlos Martín Nieto 0b33fca0 2013-10-02T13:39:35 indexer: fix thin packs When given an ODB from which to read objects, the indexer will attempt to inject the missing bases at the end of the pack and update the header and trailer to reflect the new contents.
Carlos Martín Nieto cf0582b4 2013-10-02T12:22:54 indexer: do multiple passes over the delta list Though unusual, a packfile may contain a delta whose base is a delta that comes later. In order index such a packfile, we must not give up on the first failure to resolve a delta, but keep it around. If there is a pass which makes no progress, this indicates that the packfile is broken, so fail accordingly.
Russell Belfer af302aca 2013-10-02T14:13:11 Clean up annoying warnings The indexer code was generating warnings on Windows 64-bit. I looked closely at the logic and was able to simplify it a bit. Also this fixes some other Windows and Linux warnings.
Jameson Miller 5b188225 2013-10-02T13:45:32 Support cancellation in push operation This commit adds cancellation for the push operation. This work consists of: 1) Support cancellation during push operation - During object counting phase - During network transfer phase - Propagate GIT_EUSER error code out to caller 2) Improve cancellation support during fetch - Handle cancellation request during network transfer phase - Clear error string when cancelled during indexing 3) Fix error handling in git_smart__download_pack Cancellation during push is still only handled in the pack building and network transfer stages of push (and not during packbuilding).