lib/pack_create.c


Log

Author Commit Date CI Message
Stefan Sperling fae7e038 2022-05-07T11:50:56 run the search for deltas to reuse in got-read-pack This significantly speeds up the deltification step of packing by avoiding imsg traffic. gotadmin no longer requests individual raw deltas from got-read-pack to check whether it can reuse them. Instead, got-read-pack obtains a list of objects we want to pack, and hands back the list of all deltas in its pack file which can be reused. Messages are now batched such that imsg buffers are filled as much as possible. Another advantage is that deltas we are not going to reuse will no longer be written to the delta cache file, saving disk space. Before this patch, any raw delta candidate was written to the delta cache file by got-read-pack, and the decision whether to reuse the delta happened afterwards in the gotadmin process. Code for reading individual raw deltas is now unused and could be removed at some point. ok op@
Stefan Sperling f5e78e05 2022-05-04T15:39:15 avoid loop over the ID set which removes objects IDs with reused deltas ok op@
Stefan Sperling 2f8438b0 2022-05-04T15:39:15 avoid 'remove unused' loop by storing excluded objects in a separate set ok op@
Stefan Sperling 2d9e6abf 2022-05-04T13:43:24 store deltas in compressed form while packing, both in memory and cache file This reduces memory and disk space consumption during packing. with tweaks + memleak on error fix from op@ ok op@
Stefan Sperling 611e8e31 2022-05-01T11:47:21 avoid subtraction of values larger than int in qsort(3) comparison callbacks tweak + ok tb@
Stefan Sperling d7b5a0e8 2022-04-20T14:00:12 inline struct got_object_id in struct got_object_qid Saves us from doing a malloc/free call for every item on the list. ok op@
Stefan Sperling cbc287dc 2022-04-19T20:08:41 reimplement object-ID set data structure on top of a hash table Siphash suggested by jrick as a better alternative to murmurhash for this use case. with small fixes from and ok op@
Stefan Sperling 70f8f24d 2022-04-14T15:05:19 speed up initial stage of packing by adding a "skip" commit color The skip color marks boundary commits and their ancestors. Boundary commits are reachable both via references which we want to exclude from the pack, and via references which we want to include in the pack. We continue processing commit history up to the point we are left with only skip commits on the queue. This can speed up findtwixt() significantly and avoids wrong results produced by the old algorithm which made no distinction between "drop" and "skip". This idea was first implemented by Michael Forney for git9: https://git.9front.org/plan9front/plan9front/2e47badb88312c5c045a8042dc2ef80148e5ab47/commit.html Michael's log message for git9 is reproduced below: git/query: refactor graph painting algorithm (findtwixt, lca) We now keep track of 3 sets during traversal: - keep: commits we've reached from head commits - drop: commits we've reached from tail commits - skip: ancestors of commits in both 'keep' and 'drop' Commits in 'keep' and/or 'drop' may be added later to the 'skip' set if we discover later that they are part of a common subgraph of the head and tail commits. From these sets we can calculate the commits we are interested in: lca commits are those in 'keep' and 'drop', but not in 'skip'. findtwixt commits are those in 'keep', but not in 'drop' or 'skip'. The "LCA" commit returned is a common ancestor such that there are no other common ancestors that can reach that commit. Although there can be multiple commits that meet this criteria, where one is technically lower on the commit-graph than the other, these cases only happen in complex merge arrangements and any choice is likely a decent merge base. Repainting is now done in paint() directly. When we find a boundary commit, we switch our paint color to 'skip'. 'skip' painting does not stop when it hits another color; we continue until we are left with only 'skip' commits on the queue. This fixes several mishandled cases in the current algorithm: 1. If we hit the common subgraph from tail commits first (if the tail commit was newer than the head commit), we ended up traversing the entire commit graph. This is because we couldn't distinguish between 'drop' commits that were part of the common subgraph, and those that were still looking for it. 2. If we traversed through an initial part of the common subgraph from head commits before reaching it from tail commits, these commits were returned from findtwixt even though they were also reachable from tail commits. 3. In the same case as 2, we might end up choosing an incorrect commit as the LCA, which is an ancestor of the real LCA.
Theo Buehler bb6672b6 2022-04-14T11:51:32 make sure callers of got_object_idset_add() free data.
Stefan Sperling fbafdecf 2022-04-10T13:03:29 revert 03c03172 "drop a commit right away if it matches an excluded commit" This change resulted in a full history walk even when no objects will be added to the pack file. Fix this regression by reverting the change.
Stefan Sperling 14dbbf48 2022-04-10T12:15:46 for clarity, move the coloring loop from findtwixt() into a separate function
Stefan Sperling 1d765da3 2022-04-10T12:13:02 remove a pointless object-id dup/free dance in findtwixt()
Stefan Sperling 57bc7b6d 2022-04-10T12:10:52 don't forget to call the cancel callback while coloring commits in findtwixt()
Stefan Sperling 03c03172 2022-04-10T12:08:45 in findtwixt(), drop a commit right away if it matches an excluded commit
Stefan Sperling 912a163e 2022-04-10T11:35:53 the obj_types array in pack_create.c is no longer useful, remove it
Stefan Sperling 29e0594f 2022-04-09T17:34:51 make gotadmin pack -x option work with tag arguments
Stefan Sperling 9d34261e 2022-04-07T20:55:39 in load_object_ids(), process "their" commits and tags in the same loop No functional change, the end result is the same.
Stefan Sperling 6863cbf9 2022-03-21T19:59:03 fix pack progress object counter for loose objects Move pack progres object accounting to a single place. This makes it easier to account for the case were only loose objects are packed. A wrong amount of objects was reported before when packing loose ones.
Stefan Sperling c4e796b2 2022-03-21T16:08:41 in pack progress output, remove excluded objects from 'found' objects counter
Stefan Sperling cdeb891a 2022-03-21T15:52:15 fix a bug where 'gotadmin pack' packed too many objects unless -a was used
Christian Weisgerber bfc73a47 2022-03-19T14:53:07 explicitly include <unistd.h> for close(2)
Stefan Sperling b8af7c06 2022-03-15T10:45:02 print additional progress information while packing ok op@
Stefan Sperling 9b576444 2022-03-14T13:22:20 cache a list of known pack index files when the repository is opened Avoids overhead due to readdir calls while searching a pack index. ok op@
Christian Weisgerber e3f86256 2022-02-18T20:23:32 explicitly include <endian.h> for be32toh()
Stefan Sperling 28526235 2022-02-13T00:12:04 fix pack.sh test failure from reuse-deltas patch by tweaking progress output
Stefan Sperling 67fd6849 2022-02-13T00:10:25 reuse existing deltas when creating pack files tested by thomas, naddy, and myself
Stefan Sperling 72840534 2022-01-19T12:04:58 compress delta data from delta_cache directly into pack file
Stefan Sperling 402a5ec1 2022-01-10T13:13:16 set a cap on the amount of memory we use to store encoded deltas
Stefan Sperling 5060d5a1 2022-01-10T11:09:25 encode short deltas in memory instead of writing them to a temporary file
Stefan Sperling 64a8571e 2022-01-07T23:32:27 map raw object files into memory while packing if possible
Stefan Sperling 59b21794 2022-01-07T14:33:52 only open raw objects if necessary while writing out pack file data significantly speeds up the "writing pack: " step of gotadmin pack
Stefan Sperling 211cfef0 2022-01-05T19:57:10 use time-based rate-limiting for gotadmin progress output Suggested by naddy some time ago. ok tracey
Stefan Sperling 22edbce7 2021-10-24T09:41:04 use up to 128 delta chain elements again; creates smaller packs at same speed
Stefan Sperling 4f4d853e 2021-10-24T09:41:04 try only 3 delta base candidates instead of 10 to speed up packing Tests by kn, thomas_adam and myself made on various repositories indicate that 3 is a good choice. Tyring 10 deltas is much slower and does not result in significantly smaller pack files.
Stefan Sperling a319ca8c 2021-10-15T10:36:12 move encode_delta() in pack_create.c to eliminate a forward declaration
Stefan Sperling 74881701 2021-10-15T10:34:44 while packing, store encoded deltas in a temporary file instead of in memory
Stefan Sperling dc20764a 2021-10-15T09:30:29 limit delta chain length in newly created pack files to 32 deltas Our former limit was 128 which is fairly high. Git uses 50 by default. A smaller limit results in slightly larger pack files but makes both packing and unpacking faster.
Stefan Sperling 94dac27c 2021-10-15T09:24:56 raw object blocksize and read buffer were unused; remove them
Stefan Sperling d3c116bf 2021-10-15T09:10:14 cache raw objects in order to speed up gotadmin pack
Stefan Sperling cc7a354a 2021-10-15T07:15:00 reuse temporary files which were not used by got_object_raw_open()
Stefan Sperling 600b755e 2021-10-14T20:30:26 avoid opening delta base objects in genpack() just to find their size
Stefan Sperling 08347b73 2021-10-14T17:27:26 encode deltas in temporary files to avoid high memory usage
Stefan Sperling 1d19226a 2021-10-13T18:48:15 fix two more error strings in pack_create.c using the wrong function name
Stefan Sperling f8b19efd 2021-10-13T11:09:15 use RB_TREE instead of STAILQ to manage packindex bloom filters; much faster
Stefan Sperling 3af9de88 2021-09-22T13:32:37 fix 'got send' with tree objects which contain symlinks; reported by Omar
Stefan Sperling 26960ff7 2021-09-14T09:52:49 make 'got send' properly send commits which are referenced only by tags Problem reported by Omar Polo.
Stefan Sperling eca70f98 2021-09-03T09:51:31 fix 'got send' adding too many objects to the pack file in some cases Load server-side tags before loading local commits. Otherwise objects which are reachable via server-side tags will not be filtered out.
Stefan Sperling f8a36e22 2021-08-26T12:30:42 add 'got send' command for sending changes to remote repositories Known to work against git-daemon and github Git server implementations. Tests by abieber, naddy, jrick, and myself. Man page additions reviewed by Lucas.
Stefan Sperling dc7edd42 2021-08-22T12:58:34 fix miscalculation of the final pack file size reported by got_pack_create()
Stefan Sperling 07165b17 2021-07-01T14:57:10 cache object type in memory to speed up packing of objects referenced by tags
Stefan Sperling f4a2ff2d 2021-07-01T14:10:33 fix out-of-bounds access in 'gotadmin pack'; wrong array pointer in read_meta()
Christian Weisgerber dbdddfee 2021-06-23T20:48:35 switch from SIMPLEQ to equivalent STAILQ macros The singly-linked tail queue macros were added to OpenBSD 6.9 and are more widely available on other systems. ok stsp
Stefan Sperling 08736cf9 2021-06-23T10:16:23 fix imsg header includes in pack_create.c
Stefan Sperling 05118f5a 2021-06-22T19:37:20 implement gotadmin pack, indexpack, and listpack commands
Stefan Sperling e6bcace5 2021-06-22T19:34:53 initial port of git9's pack file creation code to gameoftrees; thank you, Ori!