Implement Git by yourself (1: Introduction)

I’d like to say that Git is the most popular version control system (VCS). As a developer, you probably use git porcelain commands in your daily work and treat it as black box. But

If you can’t make one, you don’t know how it works.

You can access to git source code though, it’s a little bit of challenge to go through this repo now. What’s worse, git is written in C language which is sophisticated for us. Git is also Linus Torvalds’s masterpiece. It contains a lot of tricky code.

So, we are going to implement Git by ourselves. This time, we will choose C# language with .Net Core platform which is cross-platform language.

This blog series has two major references

  1. ugit: DIY Git in Python
  2. Git internal

1 Structure of .git folder

The first step using git is to type git init in the target folder. It will create a sub-folder named .git.

We’re going to dive into HEAD, objects and refs items, which are the core part of git at the first version.

In general, .git folder is file-base database which means we can restore the codebase as long as .git folder is intact.

2 Git Object

Plumbing command git hash-object takes some data and store it in .git/objects directory, then display the unique key which maps to this data object.

Let’s see what happened in the .git/objects directory

It create a file 70460b4b4aece5915caf5c68d12f560a9fe3e4 in d6 folder. Where does the value come from ?

It’s SHA-1 digest of data which consists of content and header.

The type could be

  • blob: The common file
  • tree: The folder
  • commit: the commit log

With SHA-1 value, we can also restore the blob file easily.

As we known, the blob file doesn’t include any file name and attributes information. All of them are kept track in tree object.

From the above output, the structure would like be that.

With this nest structure, it’s possible the restore the file with correct file name and folders.

Every commit command will create an object as well, which includes

  1. Current work directory tree
  2. Previous commit object
  3. Committer user information
  4. Commit message.

2 Git Reference

We can travel through the commit history by the commit id (the SHA-1 value). But it looks like too difficult to remember such long value. Git provides the readable mechanism to reach specific commit.

It has two directories in the .git/refs directory. Each file in the .git/refs/heads means a individual branch. And files in the .git/refs/tags represents each tag you create.

Let’s look into the content of the .git/refs/head/master .

second commit

The same with .git/ref/tag

first commit

How does git know which branch are in? The answer is the HEAD file.

When you advance the commit history, HEAD file would point out which head needs to be updated.

What if you type git checkout <commit id> ?

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
git checkout -b <new-branch-name>HEAD is now at 28cd27c first commit$ cat .git/HEAD
28cd27cccf7ed33b4556e2ea66d06cdbbac038fc

Now the HEAD doesn’t point to any refs any more just a commit id. We call this case as detached HEAD . It’s dangerous since you cannot come back if you switch to other branch.

3 Conclusion

They are basic knowledge for git internal. It’s good beginning for us to implement basic feature by ourselves.

A software developer in Microsoft at Suzhou. Most articles spoken language is Chinese. I will try with English when I’m ready