Git an introduction for programers
If your like me, you just want to know how something works and some basic understanding of the motivation behind the decisions that went into the design of the software. I could not find a good introduction to Git written this way so I will try. I am not going to walk through how to install Git you can find good instructions on the Internet or just download it here.
Git is not like CVS, Subversion, Pee, Bazaar, etc. It was trying to solve the problem of being a distributed version control system so there was no need to have a centralized server. The way Junio Hamano and Linus Torvalds solved this problem was to create a database that stores all the commits as atomic files with an acyclic graph that holds the commit states. I will describe this in detail later but for now the key point is that Git does not store individual changes to files or deltas like traditional version control systems.
Traditional systems store information as delta points. For example, file A has two delta points one at version 2 and another at version 4. Each delta stores how the file was changed like any new, deleted, or modified lines not the entire file. If you want the file A version 5, you have to replay the two deltas in order. Making each change stored in the deltas at version 2 then version 4.
Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini file system. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored. Git thinks about its data more like the Figure below.
So all the nodes with dotted borders are not stored but are links that point to the contents of the file that has not changed. For example, A1 with the dotted border points to the node labeled A1 with the solid border. All the green nodes with a solid border are stored in Git repo. Each version is just a list of pointers which we will explain later.
Each repository has a full copy of this database and graph that is why Git is so fast. When you create a new repo or clone a repo it creates a hidden directory named .git that contains the complete database and all supporting files like configuration. Lets start by creating an empty repo and add a file.
Now its important to talk about how Git stores its versions. All the magic is in the .git/objects directory. You can think of this directory as a key-value store where the key is a SHA-1 hash of the value and the value is a file. When you commit, or save the state Git will hash all the files it is tracking and check if it needs to add them to the key-value store. So the store will always have a complete copy of the original file and you can always look it up with its hash code.
For example, in the scenario above when the file A is committed to version 1 Git hashs the file A and check if it is in the store. In this case, its new so it adds it to the store. In the next commit file A is modified and committed to version 2 so file A’s hash is different so it needs to be added to the store because the hash code is new. In version 3 file, A is not modified so its hash is the same as version 2’s hash and Git does not need to store the file because Git already has a copy of that hash.
If you look at the directory, you will also see that the SHA-1 hash are partition in subdirectories. If you take the directory name and the file name an join them you will have the full hash. The files are compressed so you can not just view them but you can use the command below to view the files contents.
$ git cat-file -p {sha-1-hash}
Now that we understand how Git stores all its files lets talk about how it tracks versions. As you can see everything in git is referenced by a SHA-1 hash so if you want to build an acyclic graph that represents a sequence of commits its easy. Just look at the Figure below.
As you can see each commit points to a tree that has a list of pointers to the files Git stored in its key-value store. The tree node holds additional meta data because Git stores the files contents not its path and meta data so we have to save this information elsewhere. If your observant you will notice that the commit and tree nodes link to each other using hash code. This is because Git stores the version graph the same way it does for files. So each node in the graph are just files and each file/node can reference other nodes so all you need to start walking the graph is a hash code.
When we need to go back to a snapshot we just locate the tree and get a list of all its pointers and start copying the files from the Git store to the disk. Now this comes to another big difference with the way Git works over traditional version control systems. Git needs to understand what it is tracking. Git is just a database with the ability to take snapshots of your work so you need to let it know what to track. Unlike traditional systems if you have modified 5 files in your work area you can tell Git to only track or snapshot 2 of the files in a commit. This is very useful but confusing sometimes, so lets take an example.
If you create an empty directory and add 5 files A, B, C, D, and E and you ask Git that status of the directory you will find Git will list all 5 files as untracked. In-other words you have not told Git that you care about these files and do not want any history tracked. To start tracking a file you need to add it to a snapshot. Once you have added a file to a snapshot Git will start tracking it. If you add A, B, E to a snapshot or add them to a commit Git will start tracking the files. After that if you modify files A and E and ask Git the status it will tell you that out of all the tracked files A, B, E only A and E have changed. It is important to understand the status of tracked files is different then what is in a snapshot. At any point you can ask Git if it sees an differences from its active snapshot and the current directory by comparing hash codes. Git will not automatically add changes into the next snapshot but requires you to explicitly add what is in the next snapshot. So in the previous example even though you edited A and E you can tell Git to only add E’s change to the snapshot.
$ cd /tmp/
$ git init my-empty-repo2
$ cd my-empty-repo2
$ touch A B C D E
$ git status
$ git add A B E # tell Git to add A, B, E to the next snapshot
$ git commit -m'version 1' # Git will start tracking A, B, E (its now commit)
$ git status
$ echo "EDIT A" > A
$ echo "EDIT E" > E
$ git status
$ git status -s # watch the M it is in the second place i.e " M" not "M "
$ git add E
$ git status -s # now the modified flag is in commit place i.e. "M " not tracked " M"
$ git commit -m'version 2' # Git only save E change
$ git status
At this point I hope you are seeing how Git is like an acyclic graph and the graph structure is your history. Each snapshot is just a node in the graph. Now this brings us to the real power of Git branching because, your history is just a graph if you want to branch it is just a dag in the graph. You can just create a new node that points to its parent. That is it! Lets briefly talk about sharing your work. Git is just a key-value store with files names with SHA-1 hash codes so if you have two different developers that have forked the same Git repo and they need to merge them its easy. Remember files in Git are just SHA-1 hash and the graph are also stored in SHA-1 so if they are not the same file/node they will have a different hash. So its safe to merge/sync the two Git stores and if you just have a pointer to a node in the graph you can start working the graph.
I plan to write more on Git so stay tuned! If you have an request or things you would like me to cover next please leave comments on this post.
2 Notes/ Hide
- demetriusj-blog posted this