This blog post explains how Git uses content address to efficiently manage versions.
The principle of software version management using Git
Collaboration is essential in modern software development. In the process of multiple developers modifying and updating code at the same time, conflicts between the changes or unwanted results can occur frequently. In this case, a version control system acts as a key tool for collaboration. A version control system does more than simply record changes to files, it minimizes conflicts that occur through collaboration and even provides the ability to restore to a certain point in time. Tracking and recording all changes in the software development process is an essential function, without which large-scale development projects cannot be carried out systematically.
When writing a long document on a computer, there are times when you want to go back to the previous state. If you are using Hangul, you can press the Ctrl+Z shortcut key to undo. However, if you write more after undoing, you may feel that the state before undoing was better. However, even if you want to go back to the state before the change, you cannot do so because the history of the state at that point in time is not retained.
This problem occurs when developing software, just as it does when creating documents. This problem becomes more serious as software is developed to meet requirements that change from time to time. To solve this, software engineers use version control systems. A version control system is software that manages the history of documents (files), and Git is very popular among programmers. In this article, we will look at the characteristics of software version control and how Git’s internal structure is designed.
Version control systems and file systems
To understand version management systems, you first need to know what form software exists in. Software is written in a programming language that computers can understand. The documentation written in this way is called code, and it may be written in a single file, but when the software grows, it is divided into multiple files. At this point, you need to manage the history of multiple files.
Computers use a file system to manage files. A file system is a part of a computer’s operating system that stores and manages files. In the Windows operating system that we commonly use, we can easily understand the structure of files and folders through the desktop or Explorer. Files are located in specific folders and have names. The combination of the location and name of a file is called the path of that file, and when there are multiple files, the path of each file can be used to distinguish them. This file system is called the location-address file system. In other words, the location of a file acts as the address that points to that file.
Git’s content-address file system
Git introduces the content-addressed file system to overcome the limitations of the previously described location-address file system. In the location-address file system, the location where a file is stored is the only way to distinguish the file. In other words, the location of the file is fixed, and the contents of the file stored in that location may change over time. In this case, the previous contents of the file in the same path are completely overwritten, so there is no history.
On the other hand, in the content address file system used by Git, the content of the file itself acts as the ID that distinguishes the file. When the content of the file is changed, a new ID is created and each change is saved as a separate file. This leaves a change history for all files. This is an important technology that allows for thorough version control.
However, if all the changes to the file are saved, a large number of files will be created, which will take up a lot of storage space. Git uses compression technology to manage storage space efficiently.
Hash functions and how Git works
Git uses hash functions to implement a content-addressed file system. A hash function is a function that returns a string of a fixed length for an arbitrary input value. Git uses the SHA-1 hash function, which takes the contents of a file as input to the hash function and outputs a 40-digit hexadecimal string as the file’s ID. If this string changes, it means that the contents of the file have changed.
Hash functions do more than just return a string. They return completely different results when the input values are slightly different, so you can track exactly what changes have been made to a file. For example, if you put the string “Hello” and the string “hello” into a hash function, they will produce completely different results. This feature allows Git to keep a strict record of file changes.
Let’s take a look at how Git works. When you run the git init command in a specific folder, Git creates a content address file system in that folder. When you create a file and save it with the git add command, Git saves the ID of the file’s contents in a hash function. If you modify the file and use the git add command again, the file’s history will be accumulated.
At this point, you can save the changes to multiple files at once, which is called a commit. The git commit command allows you to save the changes you have made so far as a single bundle, and you can then revert the file to a specific point in time based on this commit.
Reference and commit management
Commits store the change history of each file and assign a unique ID to each commit. However, when using Git, the ID of a commit can be as long as 40 digits, which makes it difficult to remember. To solve this problem, Git introduced the concept of references. A reference is a short, friendly name for a commit, and by default, the reference master is used. Using references in this way allows you to manage commits using easily memorable names instead of commit IDs. Git is also designed so that each commit references the previous commit. This means that even if you only know the reference to the last commit, you can see all the previous commits referenced by that commit.
Usability and scalability of Git
Git can be used in various fields, not just software development. It is very useful for tasks that require collaboration with multiple people, such as writing research reports and managing project materials, to record changes in each version and revert to the desired point in time when needed. As such, Git is a tool that can be effectively used not only by software developers, but also in various tasks such as document work and data management.
Conclusion
Git overcomes the limitations of the file system of location address through the content address file system and can efficiently manage the history of files. It also provides features such as commit and reference to help users easily manage versions. Although it operates in a complex manner internally, if you understand the principle, you will find that Git is a powerful and useful tool for software version management.