lib/pack_create.c


Log

Author Commit Date CI Message
Stefan Sperling db9b9b1c 2022-06-14T20:26:15 let got-read-pack be explicit about whether it could enumerate all objects This allows the main process to avoid looping over all object IDs again in case the pack file used for enumeration is complete. ok op@
Stefan Sperling eb7b30a1 2022-06-13T17:13:59 fix error handling in find_pack_for_enumeration(); pointed out by op@
Stefan Sperling 0ab4c957 2022-06-13T17:13:59 Bring back object enumeration inside got-read-pack as a fast path. The problem that was found in the earlier version has been fixed. ok op@
Stefan Sperling e44d9391 2022-06-07T19:20:01 revert object enumeration in got-read-pack for now; needs more work This implementation marked commits and trees as enumerated before all trees which they depend on were enumerated. This behaviour leads to incomplete pack files when a tree is only partially packed and got-read-pack hits a missing tree entry as a result. The algorithm must be reworked such that packed leave nodes are marked enumerated first, then bubble-up. Found by op@
Stefan Sperling 9f4f302a 2022-06-07T16:04:15 free id and path in load_packed_tree_ids() on error, else they would leak pointed out by op@
Stefan Sperling cee6a7ea 2022-06-07T15:56:46 implement object enumeration support in got-read-pack ok op@
Stefan Sperling ce2bf7b7 2022-05-29T17:51:33 fix a bug in findwixt() which caused pack files with missing parent commits The 'nskip' variable is supposed to reflect commits which are waiting on the queue and have the 'skip' color. Only increment 'nskip' when adding such commits to the queue. Problem observed with got send -T and a tag pointing to a deleted branch. Test to reproduce the bug written by op@.
Omar Polo d6a28ffe 2022-05-20T21:21:42 use random seeds for murmurhash2 change the three hardcoded seeds to fresh ones generated on demand via arc4random. Suggested/fixed by and ok stsp@
Omar Polo 17cfdba6 2022-05-20T21:19:30 include header
Stefan Sperling 411cbec1 2022-05-20T09:31:25 shrink struct got_pack_meta a bit by removing the have_reused_delta flag This flag can be expressed as m->reused_delta_offset != 0 because all deltas in valid pack files will be written at a non-zero offset. We allocate a huge number of these structs during packing, so every little bit helps.
Stefan Sperling adb4bbb2 2022-05-20T08:40:46 reduce the amount of memory used for caching deltas during deltification With files sorted properly for deltification we produce better deltas but end up consuming more memory and risk running into OpenBSD ulimits during packing. To compensate, reduce the threshold for the amount of delta data we store in memory, spooling more deltas into the cache file. ok op@
Stefan Sperling f8174ca5 2022-05-20T08:40:46 store a path hash instead of a verbatim path in pack meta data This reduces memory use by gotadmin pack. The goal is to sort files which share a path next to each other for deltification. A hash of the path is good enough for this purpose and consumes less memory than a verbatim copy of the path. Git does something similar. ok op@
Stefan Sperling 3e6ceea0 2022-05-20T08:40:46 fix paths stored in pack meta data, improving file deltification The old code was broken and stored an empty path or filenames, instead of a repository-relative path. Which means we didn't sort files for deltification as was intended. Fixing this provides much better deltas in large pack files written by gotadmin pack -a. In my test case, pack size changed from 2GB to 1.5GB. ok op@
Stefan Sperling 17259bfa 2022-05-19T09:26:13 plug a small memleak on error in got_pack_create()
Stefan Sperling e93fb944 2022-05-10T11:34:16 map delta cache file into memory if possible while writing a pack file with a fix from + ok op@
Stefan Sperling dc3fe1bf 2022-05-10T11:24:12 fix load_object_ids() such that packing tags works if zero commits are packed reported by jrick and op
Stefan Sperling fae7e038 2022-05-07T11:50:56 run the search for deltas to reuse in got-read-pack This significantly speeds up the deltification step of packing by avoiding imsg traffic. gotadmin no longer requests individual raw deltas from got-read-pack to check whether it can reuse them. Instead, got-read-pack obtains a list of objects we want to pack, and hands back the list of all deltas in its pack file which can be reused. Messages are now batched such that imsg buffers are filled as much as possible. Another advantage is that deltas we are not going to reuse will no longer be written to the delta cache file, saving disk space. Before this patch, any raw delta candidate was written to the delta cache file by got-read-pack, and the decision whether to reuse the delta happened afterwards in the gotadmin process. Code for reading individual raw deltas is now unused and could be removed at some point. ok op@
Stefan Sperling 2f8438b0 2022-05-04T15:39:15 avoid 'remove unused' loop by storing excluded objects in a separate set ok op@
Stefan Sperling f5e78e05 2022-05-04T15:39:15 avoid loop over the ID set which removes objects IDs with reused deltas ok op@
Stefan Sperling 2d9e6abf 2022-05-04T13:43:24 store deltas in compressed form while packing, both in memory and cache file This reduces memory and disk space consumption during packing. with tweaks + memleak on error fix from op@ ok op@
Stefan Sperling 611e8e31 2022-05-01T11:47:21 avoid subtraction of values larger than int in qsort(3) comparison callbacks tweak + ok tb@
Stefan Sperling d7b5a0e8 2022-04-20T14:00:12 inline struct got_object_id in struct got_object_qid Saves us from doing a malloc/free call for every item on the list. ok op@
Stefan Sperling cbc287dc 2022-04-19T20:08:41 reimplement object-ID set data structure on top of a hash table Siphash suggested by jrick as a better alternative to murmurhash for this use case. with small fixes from and ok op@
Stefan Sperling 70f8f24d 2022-04-14T15:05:19 speed up initial stage of packing by adding a "skip" commit color The skip color marks boundary commits and their ancestors. Boundary commits are reachable both via references which we want to exclude from the pack, and via references which we want to include in the pack. We continue processing commit history up to the point we are left with only skip commits on the queue. This can speed up findtwixt() significantly and avoids wrong results produced by the old algorithm which made no distinction between "drop" and "skip". This idea was first implemented by Michael Forney for git9: https://git.9front.org/plan9front/plan9front/2e47badb88312c5c045a8042dc2ef80148e5ab47/commit.html Michael's log message for git9 is reproduced below: git/query: refactor graph painting algorithm (findtwixt, lca) We now keep track of 3 sets during traversal: - keep: commits we've reached from head commits - drop: commits we've reached from tail commits - skip: ancestors of commits in both 'keep' and 'drop' Commits in 'keep' and/or 'drop' may be added later to the 'skip' set if we discover later that they are part of a common subgraph of the head and tail commits. From these sets we can calculate the commits we are interested in: lca commits are those in 'keep' and 'drop', but not in 'skip'. findtwixt commits are those in 'keep', but not in 'drop' or 'skip'. The "LCA" commit returned is a common ancestor such that there are no other common ancestors that can reach that commit. Although there can be multiple commits that meet this criteria, where one is technically lower on the commit-graph than the other, these cases only happen in complex merge arrangements and any choice is likely a decent merge base. Repainting is now done in paint() directly. When we find a boundary commit, we switch our paint color to 'skip'. 'skip' painting does not stop when it hits another color; we continue until we are left with only 'skip' commits on the queue. This fixes several mishandled cases in the current algorithm: 1. If we hit the common subgraph from tail commits first (if the tail commit was newer than the head commit), we ended up traversing the entire commit graph. This is because we couldn't distinguish between 'drop' commits that were part of the common subgraph, and those that were still looking for it. 2. If we traversed through an initial part of the common subgraph from head commits before reaching it from tail commits, these commits were returned from findtwixt even though they were also reachable from tail commits. 3. In the same case as 2, we might end up choosing an incorrect commit as the LCA, which is an ancestor of the real LCA.
Theo Buehler bb6672b6 2022-04-14T11:51:32 make sure callers of got_object_idset_add() free data.
Stefan Sperling fbafdecf 2022-04-10T13:03:29 revert 03c03172 "drop a commit right away if it matches an excluded commit" This change resulted in a full history walk even when no objects will be added to the pack file. Fix this regression by reverting the change.
Stefan Sperling 14dbbf48 2022-04-10T12:15:46 for clarity, move the coloring loop from findtwixt() into a separate function
Stefan Sperling 1d765da3 2022-04-10T12:13:02 remove a pointless object-id dup/free dance in findtwixt()
Stefan Sperling 57bc7b6d 2022-04-10T12:10:52 don't forget to call the cancel callback while coloring commits in findtwixt()
Stefan Sperling 03c03172 2022-04-10T12:08:45 in findtwixt(), drop a commit right away if it matches an excluded commit
Stefan Sperling 912a163e 2022-04-10T11:35:53 the obj_types array in pack_create.c is no longer useful, remove it
Stefan Sperling 29e0594f 2022-04-09T17:34:51 make gotadmin pack -x option work with tag arguments
Stefan Sperling 9d34261e 2022-04-07T20:55:39 in load_object_ids(), process "their" commits and tags in the same loop No functional change, the end result is the same.
Stefan Sperling 6863cbf9 2022-03-21T19:59:03 fix pack progress object counter for loose objects Move pack progres object accounting to a single place. This makes it easier to account for the case were only loose objects are packed. A wrong amount of objects was reported before when packing loose ones.
Stefan Sperling c4e796b2 2022-03-21T16:08:41 in pack progress output, remove excluded objects from 'found' objects counter
Stefan Sperling cdeb891a 2022-03-21T15:52:15 fix a bug where 'gotadmin pack' packed too many objects unless -a was used
Christian Weisgerber bfc73a47 2022-03-19T14:53:07 explicitly include <unistd.h> for close(2)
Stefan Sperling b8af7c06 2022-03-15T10:45:02 print additional progress information while packing ok op@
Stefan Sperling 9b576444 2022-03-14T13:22:20 cache a list of known pack index files when the repository is opened Avoids overhead due to readdir calls while searching a pack index. ok op@
Christian Weisgerber e3f86256 2022-02-18T20:23:32 explicitly include <endian.h> for be32toh()
Stefan Sperling 28526235 2022-02-13T00:12:04 fix pack.sh test failure from reuse-deltas patch by tweaking progress output
Stefan Sperling 67fd6849 2022-02-13T00:10:25 reuse existing deltas when creating pack files tested by thomas, naddy, and myself
Stefan Sperling 72840534 2022-01-19T12:04:58 compress delta data from delta_cache directly into pack file
Stefan Sperling 402a5ec1 2022-01-10T13:13:16 set a cap on the amount of memory we use to store encoded deltas
Stefan Sperling 5060d5a1 2022-01-10T11:09:25 encode short deltas in memory instead of writing them to a temporary file
Stefan Sperling 64a8571e 2022-01-07T23:32:27 map raw object files into memory while packing if possible
Stefan Sperling 59b21794 2022-01-07T14:33:52 only open raw objects if necessary while writing out pack file data significantly speeds up the "writing pack: " step of gotadmin pack
Stefan Sperling 211cfef0 2022-01-05T19:57:10 use time-based rate-limiting for gotadmin progress output Suggested by naddy some time ago. ok tracey
Stefan Sperling 22edbce7 2021-10-24T09:41:04 use up to 128 delta chain elements again; creates smaller packs at same speed
Stefan Sperling 4f4d853e 2021-10-24T09:41:04 try only 3 delta base candidates instead of 10 to speed up packing Tests by kn, thomas_adam and myself made on various repositories indicate that 3 is a good choice. Tyring 10 deltas is much slower and does not result in significantly smaller pack files.
Stefan Sperling a319ca8c 2021-10-15T10:36:12 move encode_delta() in pack_create.c to eliminate a forward declaration
Stefan Sperling 74881701 2021-10-15T10:34:44 while packing, store encoded deltas in a temporary file instead of in memory
Stefan Sperling dc20764a 2021-10-15T09:30:29 limit delta chain length in newly created pack files to 32 deltas Our former limit was 128 which is fairly high. Git uses 50 by default. A smaller limit results in slightly larger pack files but makes both packing and unpacking faster.
Stefan Sperling 94dac27c 2021-10-15T09:24:56 raw object blocksize and read buffer were unused; remove them
Stefan Sperling d3c116bf 2021-10-15T09:10:14 cache raw objects in order to speed up gotadmin pack
Stefan Sperling cc7a354a 2021-10-15T07:15:00 reuse temporary files which were not used by got_object_raw_open()
Stefan Sperling 600b755e 2021-10-14T20:30:26 avoid opening delta base objects in genpack() just to find their size
Stefan Sperling 08347b73 2021-10-14T17:27:26 encode deltas in temporary files to avoid high memory usage
Stefan Sperling 1d19226a 2021-10-13T18:48:15 fix two more error strings in pack_create.c using the wrong function name
Stefan Sperling f8b19efd 2021-10-13T11:09:15 use RB_TREE instead of STAILQ to manage packindex bloom filters; much faster
Stefan Sperling 3af9de88 2021-09-22T13:32:37 fix 'got send' with tree objects which contain symlinks; reported by Omar
Stefan Sperling 26960ff7 2021-09-14T09:52:49 make 'got send' properly send commits which are referenced only by tags Problem reported by Omar Polo.
Stefan Sperling eca70f98 2021-09-03T09:51:31 fix 'got send' adding too many objects to the pack file in some cases Load server-side tags before loading local commits. Otherwise objects which are reachable via server-side tags will not be filtered out.
Stefan Sperling f8a36e22 2021-08-26T12:30:42 add 'got send' command for sending changes to remote repositories Known to work against git-daemon and github Git server implementations. Tests by abieber, naddy, jrick, and myself. Man page additions reviewed by Lucas.
Stefan Sperling dc7edd42 2021-08-22T12:58:34 fix miscalculation of the final pack file size reported by got_pack_create()
Stefan Sperling 07165b17 2021-07-01T14:57:10 cache object type in memory to speed up packing of objects referenced by tags
Stefan Sperling f4a2ff2d 2021-07-01T14:10:33 fix out-of-bounds access in 'gotadmin pack'; wrong array pointer in read_meta()
Christian Weisgerber dbdddfee 2021-06-23T20:48:35 switch from SIMPLEQ to equivalent STAILQ macros The singly-linked tail queue macros were added to OpenBSD 6.9 and are more widely available on other systems. ok stsp
Stefan Sperling 08736cf9 2021-06-23T10:16:23 fix imsg header includes in pack_create.c
Stefan Sperling 05118f5a 2021-06-22T19:37:20 implement gotadmin pack, indexpack, and listpack commands
Stefan Sperling e6bcace5 2021-06-22T19:34:53 initial port of git9's pack file creation code to gameoftrees; thank you, Ori!