Notes on Building Git

Published on Saturday, March 14, 2020

I’m reading “Building Git” by James Coglan along with some fellow and former co-workers. Here’s my notes on each chapter.

Chapter 1: Intro

  • Git as a decentralized version control model
  • Git has a bad rap (hard to learn)
    • Confusing user interface
    • Many ways of doing the same thing
    • Unclear separation of responsibilities between commands
    • Unclear what’s going on under the hood
  • What are we gonna learn?
    • Unix system concepts
      • How to work with files
      • How processes communicate
      • How to present output in a terminal
    • Data structures used by Git to keep history companct + searchable
    • Diffing and merging algorithms
    • How Git implements a form of “distributed database”
    • How Git supports concurrent editing
    • Why merge conflicts occur (and how the prevent race conditions)

Questions

  • What are some of the most confusing parts of the Git interface?

Resources

  • This talk by Emily Xie is basically the cliff notes version of this book: https://youtu.be/Y2Msq90ZknI

Chapter 2: Getting to know .git

  • Contents of a .git repository
    • Works just fine w/o most of the files created on git init
    • So why create extra unnecessary stuff?
    • .git/config
      • filemode: whether the file is executable
      • reflog: log changes to files in .git/refs
        • refs: pointers to objects in .git/objects database
    • .git/HEAD
      • current commit as an id or symbolic reference (symref) to current branch
      • defaults to symref to master branch
    • .git/info/exclude: like a private .gitignore
    • .git/hooks/: custom hooks for custom behavior
    • .git/objects: Git’s database
      • Git compresses every file it stores using DEFLATE compression algorithm (from zlib)
    • .git/logs
      • text files
      • log every time a ref changes value
        • make a commit
        • check out a branch
        • merge
        • rebase
  • Commits
    • “root commit”: commit with no parents
    • .git/index is in binary
    • .git/logs are text files
    • git show command: prints out info about the HEAD commit
    • git cat-file -p <oid> command: show an object from Git database
    • Trees
      • tree id: hex value built from file contents, so same regardless of timestamp, committer, etc
      • Creates one tree for every diretory in your project, including root
      • Each entry is anotehr tree (aka subdirectory) or a blob
      • file mode: numeric representation of its type and permissions
      • Git stores literal contents for blobs (so git cat-file command returns the actual contents, not a tree id)
    • Git stores blobs by:
      • Prepending w/ the word “blob”
      • Then a whitespace character
      • Then length of blob
      • Then a null byte
      • Then compress with zlib
    • Git stores trees by:
      • Prepending w/ the word “tree”
      • Then a whitespace character
      • Then length of string below
      • Then string made up of:
        • Mode in text
        • Then a whitespace character
        • Then filename
        • Then a null byte
        • Then its ID packed into binary
    • Git stores commits as series of headers followed by commit message
  • How does Git calculate the filename for each object?
    • Uses SHA-1 hash of uncompressed file content
    • 2005: theoretical attack against SHA-1 “proven”
    • 2006: started discussing moving to SHA-256
    • 2017: Google announced an attack on SHA-1
      • “SHAttered”: a method for manufacturing collisions (so you could target a specific commit/file)
      • Easy to detect, and Git has issued patches to check for manufactured collisions and reject them
    • Git maintainers insist it doesn’t rely on SHA-1 as “proof of trust”
      • It’s an integrity check, not a security measure
  • Handy command line tools:
    • ruby -r zlib -e "STDOUT.write Zlib::Inflate.inflate(STDIN.read)": inflate zlib compressed file
    • wc: count bytes
    • hexdump: print out hexadecimal representation of bytes in a file

Questions

  • Why does git init create files that’s aren’t necessary for a fully functioning git repository?
  • Do we agree with Git maintainers that “security is a social problem” and not something they need to build into their hashing function?

Discussion notes

  • More on SHA-1 vs SHA-256: https://lwn.net/Articles/811068/

Part I: Storing Changes

Chapter 3: The first commit

OH boy we building stuff.

What we will build:

  • Just enough code to store a valid Git commit.

What we won’t build (yet):

  • Subdirectories
  • Executable fiels
  • Symlinks
  • add command
  • index
  • Command line argument processing
  • Named branches

Only 3 items in .git are essential:

  • objects directory
  • refs directory
  • HEAD file (symref to file in refs)
    • Note: valid to just contain a commit id
  • Helpful to know where those low level Ruby error names come from (ex. Errno::EACCES)
  • Book throws some shade at Ruby: “[Ruby’s filesytem interfaces File, Dir, Pathname, FileUtils design] is a little haphazard, and there is not a clear separation between methods for manipulating paths and methods for reading and writing files…”
    • Why is it designed this way?

Questions

  • What’s another example where you’d receive relative information from a user then want to convert it to an absolute value before further processing?
    • Time (convert to UTC)
  • Why didn’t we TDD this first part?
  • Did anyone else run into issues w/ piping cat ../COMMIT_EDITMSG to a commit message?

Discussion notes

  • Why are we skipping symrefs? Are they difficult to implement?
    • https://git-scm.com/docs/git-symbolic-ref
  • Should I write this in Elixir instead? Then I can’t cheat. Update: see my awkwardly named JitEx project.

Chapter 4: Making history

Goal: put the commits in order

More efficient to use parent > child than timestamps. Also avoids time conflicts.

Updating .git/HEAD:

  • Need to avoid dirty reads / writes
  • Need to at least appear to be atomic

Lockfile only used for HEAD (for now). Used for refs in the future?

Questions

  • How strong is the guarantee that this Lockfile gets us?
    • Only as good as how we check for it
      • Good approach: system level calls (machine can’t lie to itself)
      • Bad approach: put the check at the application level; now you’re at the mercy of race conditions between instances
    • What are other bad approaches you could take?
  • Why was this chapter so short? Why isolate these points in their own chapter?
  • Where did this concept of a Lockfile originate? Something Linus thought up, or insired by something else?
  • How much of a performance hit are we getting from implementing Lockfile?

Chapter 5: Growing trees

Time to run some executables.

Questions

  • Why are we storing things in octals? (and by “things”, I mean file modes). More compact than a string, sure.
  • “Sticky bits”… who named this?
  • Need to research Merkle trees. Behind Git and Bitcoin. Blockchain in general or just Bitcoin?
    • Answer: Blockchain.
  • Did anyone implement the unideal Workspace refactors? Learn anything interesting?
  • Why isn’t lockfile.rb organized under database directory?
  • Why is author.rb organized under database directory?

Chapter 6: The index

  • The add command and index. Why pair these two? How does one relate to the other? Answer: add adds files to the “index” aka staging area.
  • SortedSet: this seems like an object I should be using more often

Questions

  • Naming: why “index” instead of staging area? How is “index” an index in the traditional sense?

Discussion notes

  • Typo at the end of the chapter: OrderedSet instead of SortedSet

Chapter 7: Incremental change

  • Necessary improvements
    • Right now, our index overwrites itself. We need to be able to make incremental changes and nto lose everything
    • Commit command doesn’t read from the index. It still reads from the working tree
  • We initialize a new index every time we call add… seems like that needs to change.
  • But WAIT, we can rely on our Lockfile
  • Oh NOW we start a test suite. Better late than never, I suppose.

Questions

  • I don’t understand why we’re using Index#clear… we initialize a new Index instance everytime we call add, so shouldn’t we already have a fresh state?
  • Why wait until now to start testing? This was the first time we tried to break the system, but surely we could have broken it much sooner (in ways we’d want to protect against with tests). Or was this the “right time” to introduce tests?

Discussion notes

  • Typo on page 89: “wil signal whether…”

Chapter 8: First-class commands

  • Progress check: we have init, add and commit
  • Time to abstract!
  • Abstraction + decoupling from global objects/system calls will also make this more testable
  • Repository:
    • Interface for getting Database, Index, Refs, and Workspace
    • Knows filesystem locations
    • Use memoization for consistency (not performance, in this case) because some of the objects are stateful (Index)
  • DRY is not eliminating literal duplication. It’s about removing opportunity for maintainer to introduce inconsistency (change an implicit agreement that’s not enforced)
  • Hooray! Command classes!

Questions

  • What are other things we could test at this point?
    • Example: added files are written to the database (because we can’t read from the database yet… is that true?)
    • Error paths in add command
  • What assumptions made by the code aren’t expressed in the test suite?

Chapter 9: Status report

  • Using TDD to implement git status --porcelain. Why didn’t we use TDD from the start? Because “[t]he add and commit commands…are fairly straightforward tools whose code we execute frequently while developing the codebase. We wrote tests for the edge cases but by-and-large if these commands were not working we would quickly notice. In contrast, a command like status needs to deal with lots of possible combinations of states, not all of which we’ll just bump into in the normal course of our work.”
  • Remember: only write as much code as you need to make the tests pass. Cleanup later. Bad code that makes the tests pass can also inspire additional tests.
  • Next up: finally reading objects out of the .git/objects database

Questions

  • What was the difference between Index#load and Index#load_for_update again? Answer: locks.

Discussion

  • Many of the cited sources are from Wikipedia. Is that cool these days? Why not link to a non-Wiki source?

Chapter 10: The next commit

  • Previously on Building Git: we got most of the way through a fully functioning git status command.
  • But still need to be able to display difference between last commit and current index
  • To do so, we need to be able to:
    • Read the complete tree associated with HEAD commit
    • aka get the blob id for every item in the HEAD commit tree
    • Then compare to entries in index and detect differences
  • Step 1: script that prints all the files in HEAD commit
  • Finally, we’ll be able to read from the db (.git/objects)
  • Oh boy, escape codes. I’m well familiar with those from this post: How My Bash Color Settings Broke edeliver
  • End of TDD? Huh.

Questions

  • Why do we need the blob ids to detect differences? I feel like I’m missing something fundamental about how this all works. Answer: oh yeah, because the oid changes when there’s a change to the blob.
  • Why haven’t I used Ruby StringScanner before? Seems like a helpful library.
  • Are the “detect changes” methods order dependent? What happens if we change that order?

Discussion notes

  • Building objects from parsed strings always makes me nervous, but I guess in this case it’s fine, because the format is so strictly enforced.
  • Separate Tree objects for Index and Database. Good for duck typing, hiding “write” methods from Index. Any potential for confusion/unexpected breakages?
  • Does Ruby have an OrderedHash? Answer: nope, Rails: https://api.rubyonrails.org/v3.2/classes/ActiveSupport/OrderedHash.html
  • Should I be using more Structs in my Ruby code? Reaching for classes too soon?

Chapter 11: The Myers diff algorithm

  • What we’re missing:
    • See what lines we’ve changed from last commit
    • See commit log
    • Create and merge branches
  • High level diff summary
    • Display smallest number of edits possible
      • Easier to read
      • Less likely to produce conflicts
    • Deletions followed by insertions (minimize interleaving)
    • Align with code structure (example of where the end shows up when adding a new ruby method)
  • Myers algorithm is fast and greedy
    • Prefers deletions over insertions
    • Consumes as many lines that are the same before making a chnage (avoids the “wrong end” problem)
    • “SES”: shortest edit script
      • Can be modeled as a graph search

Questions

  • We need tests for this. More than anything else. Something to confirm we’re returning the diff we expect.

Discussion notes

  • More Git resources: Oh Shit Git zine by Katie Sylor-Miller and Julia Evans
  • Xgit: Git actually implemented in Elixir (with helpful blog posts about the process)

Chapter 12: Spot the difference

Questions

Discussion notes

Part II: Branching and merging

Chapter 13: Branching out

Questions

Discussion notes

Chapter 14: Migrating between trees

Questions

Discussion notes

Chapter 15: Switching branches

Questions

Discussion notes

Chapter 16: Reviewing history

Questions

Discussion notes

Chapter 17: Basic merging

Questions

Discussion notes

Chapter 18: When merges fail

Questions

Discussion notes

Chapter 19: Conflict resolution

Questions

Discussion notes

Chapter 20: Merging inside files

Questions

Discussion notes

Chapter 21: Correcting mistakes

Questions

Discussion notes

Chapter 22: Editing messages

Questions

Discussion notes

Chapter 23: Cherry-picking

Questions

Discussion notes

Chapter 24: Reshaping history

Questions

Discussion notes

Part III: Distribution

Chapter 25: Configuration

Questions

Discussion notes

Chapter 26: Remote repositories

Questions

Discussion notes

Chapter 27: The network protocol

Questions

Discussion notes

Chapter 28: Fetching content

Questions

Discussion notes

Chapter 29: Pushing changes

Questions

Discussion notes

Chapter 30: Delta compression

Questions

Discussion notes

Chapter 31: Compressing packs

Questions

Discussion notes

Chapter 32: Packs in the database

Questions

Discussion notes

Chapter 33: Working with remote branches

Questions

Discussion notes

Chapter 34: …and everything else

Questions

Discussion notes