Blueprint Series #4: Revision Control and Git
In this the fourth episode of TEN7's Blueprint for Operations series, we’re going to deep dive into revision control itself and Git more specifically, a process with which you take code that starts as an idea on a developer’s computer and promote it, check it, rework it, promote it again and finally land it for use by a user somewhere, anywhere.
Here's what we're discussing in this podcast:
- Managing revision control with Git
- It all started with Zardoz
- Writing, revising and sharing code safely
- CVS, concurrent version system
- VCS, version control system
- Managing a history of your code
- RCS, revision control system
- IDE, integrated development environment
- Push & Pull
- Data integrity laws
- GitHub & GitLab
- Nix the binary files
- Nix the file directory
- Distributing risk
- Etymology of Git
IVAN STEGIC: Hey Everybody! You’re listening to the TEN7 Podcast, where we get together to talk about technology, business and the humans in it. I’m your host Ivan Stegic. TEN7 is at its core a technology company. We create and care for Drupal powered websites and the bulk of what we do, amongst other things like strategy and design, is somehow code related. As sites grow and become more complex, and as our own capabilities as a firm become deeper, it naturally becomes harder to keep track of the changes that a whole team of people make to a code base. The more people work on a project, the increased likelihood you could have one offs and custom code, and the chance that you’ll stray from an optimal solution. We like to talk about this particular problem as technical debt. We know we’re creating it, but it’s our responsibility to minimize the amount. So, when we talk about things like version control and branching and releases, we’re really looking to put a repeatable formalism around the creative process of making a website or a web app honestly, anything that a user sees. It’s these ideas and the tools that go with it that form some components of TEN7’s Blueprint for Operations. Today we’re going to dive a little deeper into revision control itself and Git more specifically. A process with which you take code that starts as an idea on a developer’s computer and promote it, check it, rework it, promote it again and finally land it for use by a user somewhere, requires us to know not only about revision control, but deeper ideas like branches and environments and even releases. We’re going to start with the fundamental part, the revision control itself. Joining me once more is Tess Flynn, our DevOps Engineer here at TEN7. Tess thanks for joining me again.
TESS FLYNN: Yep.
IVAN: It’s lovely to have you on these. I feel like you’re our star guest.
TESS: Well, I’ve done podcasts before, so I’m very comfortable with the format. (laughter).
IVAN: Ok, good.
TESS: I used to run my own for a while, but that fell apart years ago.
IVAN: What was that called?
TESS: It was a bad movie review podcast. The only claim to fame that we had is that we reviewed Zardoz before reviewing Zardoz became a thing, which is impressive because apparently no one knew about that film until then.
IVAN: I don’t know about that film. Should I ask about it (laughter) or should I refer to a podcast episode of yours?
TESS: Briefly, Zardoz is a high concept, early seventies, Sean Connery sci-fi film.
IVAN: Sean Connery.
TESS: You probably never heard of it because it was universally panned. It was really long. It feels slower than 2001 A Space Odyssey and the concept is patently ridiculous, and it wouldn’t even get an R rating today, it would have to be even more censored than that given what goes on in the film.
IVAN: Oh my.
TESS: It’s just a ridiculous post-apocalyptic film, that was actually kind of fascinating in a train wrecky kind of way. Yea (laughter).
IVAN: Is it on Netflix?
TESS: No, it is not. I don’t think that it’s on any streaming service, but you can get the DVD’s pretty cheaply, because they have a batch of them and they want to get rid of them all. (laughter)
IVAN: Nobody’s buying them.
TESS: (laughter) Exactly.
IVAN: (laughter) Wow. Well I might take a look at Amazon then. Is that Zardoz with a Z or an X?
TESS: With a Z.
IVAN: Ok. We’ll link to it...
TESS: Fortunately not an X, that would’ve been an eighties film. (laughter)
IVAN: We’ll link to it in the transcripts for those of you that are listening. Ok. Let’s get back to revision control. So, I want to kind of take a step back and remind ourselves what we’re actually trying to accomplish. The idea here is that we are writing software that we want to get into the hands of a user. We’re either updating our website, or creating a web app of some sort. Maybe it’s an iOS app. We’re doing something creative, writing some sort of code and pushing it to some environment somewhere where a user can interact with it. Doing that as a single developer, you could probably come up with your own workflow to do that. But as soon as there’s more than one person, things start to get a little hairy, or a lot hairy. And, so, revision control turns out to be something that’s really important because it allows us to reduce the complexity of working together. What would you Tess, define revision control or version control as? We’ll start with the basics.
TESS: I would like to actually back up a little bit and take more of a story approach, which is why would you even need to care about this “version control” thing you might have heard of? If you’re a single developer and you’re only working with yourself, you’re only beholden to yourself, you probably have never needed anything like a version control system or Git or CVS or Subversion or any of these other weirdo acronyms that they keep throwing at you when you work with computers a lot. And, that’s probably fine, but then you start running into some problems. It’s Friday, it’s late, you’re working on some change really quickly. You go and you bang that change out on your code, and then you up and push it to the server. You use SFTP to upload it, and then you walk away, and you’re fine. And, Monday morning you discover that your site has been broken for the whole weekend, and you don’t remember what you did on Friday. That’s a bit of a problem.
IVAN: I’ll say.
TESS: This can actually happen a lot. That’s a big glaring example but there are lots of little examples. You might spend days troubleshooting a problem, only to discover it's this one line, and you don’t remember who added it, why it was added, when it was added, what was the motivation behind it and all this other stuff. You can’t situationally place it, because it’s just a line of code in your code base. And human beings can have pretty good memories, but most human beings have a natural limitation to the things that they can remember and the things that they can associate with particular things, particularly lines of code. So, it gets to be a bit of a problem. Then there are other situations where you’re working on a project, and you’re making changes throughout the project in multiple different locations. And then halfway through it you’re like “I hate this. I don’t like the way this is going. I want to undo all of this.” There is no undo button. If you’re using an IDE, an Integrated Development Environment, you might be fortunate enough to be able to Control Z all the way back to where you were at the beginning of the day. But if you don’t have that, or if your IDE crashes, you’re out of luck. You’ll have to download the code again from your server, renormalize it with your local environment and set the whole thing up all over again. And this is another case where it’s just a mess, when you’re just working with just files. So, the problem gets more compounded when you decide “well, I’ve been working with this site for a while. Everything’s going great, but we have a crunch. I need to bring in another person.” So you bring in another person and they start working on the site as well. Then you start running into a new problem that you’ve never had before. You’ll try making a change to one piece of code and upload it, and then your partner will do a change to some other code which happens in the same file and upload their version, and suddenly your changes on that one file are gone and then everything is broken. How do you resolve this? And, this became such as problem that we invented something called a version control system. It’s usually called a VCS, it’s just one of those acronyms that you hear a lot of, but a version control system attempts to solve all of these problems by looking at these files in the sense of what was changed and creating a historical record of changes, because the world does not have Control Z.
IVAN: I wish it did. (laughter)
TESS: Oh man, do I wish it did right now. (laughter). So, that’s kind of the idea behind version control, is to solve these problems, to create a history of your code so that you know what’s changed, when it was changed and who changed it. But it also does a few other things. It can help you work with multiple people by allowing the tool to perform file specific merges. So if two people change the same file at the same time, but on different lines, the tool can resolve those changes, and you can get both of them in the same place. That’s another useful tactic. It also provides an undo mechanism. You make a bunch of changes, then you go “nah, I don’t like any of those.” You reset, and you go all the way back to where you were before you started and everything’s fine again. Version control let’s you do all of these things.
IVAN: You talked about different kinds of version controls, and you mentioned some acronyms. You said Subversion and CVS. We’ve almost used the word Git synonymously as version control, because it is so ubiquitous. Can you describe the differences between these acronyms? What they are and why are we using Git as the de facto version control standard?
TESS: Would you like a history lesson?
IVAN: Yea, why not!
TESS: So, let’s go all the way back in time to I think the seventies. Zardoz was still in theatres…
IVAN: (laughter) Nice way to bring it around.
TESS: …and we had RCS, the revision control system. That particular one was the original version control system. I believe it was created by IBM to handle a lot of these problems, because as a large software company they were working with multiple individuals that had to deal with multiple problems. They also needed to do one more thing that we haven’t talked about, which is they needed to create a de facto point of what is the canonical source code. Who says this is the real code and everybody else is just copies? And that was one of the motivations behind it. Now, RCS got updated and fixed and I think rewritten a few times, and they eventually called it the concurrent version system or CVS.
IVAN: Oh, and I remember RCS. That must be revision control system right? That’s the one you were talking about.
TESS: It was a simpler mechanism before concurrent versions.
IVAN: So CVS is actually a rebranded RCS?
TESS: I think it might be a rewrite and an expansion.
IVAN: I see.
TESS: If I’m recalling my history correctly. It’s been a long time, and I don’t have Wikipedia in front of me. I could be entirely misremembering, see point earlier about human beings have limited memory capacity. (laughter) Now, CVS was the standard with a big "S" for decades afterwards. They had us use CVS in college, and I hated it. But CVS was a well-known standard, had a well-defined list of features, worked in small and large teams, and was reimplemented so many times that it became open source. So we have this version control system that generally worked. But, it had problems. CVS was never particularly good at handling two different changes to the same file at the same time. It could resolve them, but more often than not you would have to manually merge those changes.
IVAN: I recall that. I especially recall that using visual basic with colleagues at a former job in CVS. It was not fun. You literally had to do a manual merge, like you said, every time.
TESS: Mhm! And it also had another problem, which was that it was very server centric. You didn’t get a whole copy of the repository, you checked out the repository, and even checked out individual files, in order to make those changes and then push those changes back up. This was not a very good way of handling things, but it dealt with some limitations with technology that we had at the time, and it worked. So, fine. But CVS had another big failing that was REALLY hurting people, especially as websites became a thing, which was CVS sucked for binary files. It hated them. It couldn’t stand them. It didn’t know what to do with them. So then someone came along and they created Subversion, sometimes called SVN. Now Subversion is mostly the same thing as CVS, but it also handled binary files, and it was also generally open source, and a lot of projects used it. SourceForge.net was the largest hosting provider of Subversion Control repositories as a service and became the de facto standard in the late nineties, early 2000’s for open source distribution for that reason. That was still the standard for a few years until you get to about the mid-2000’s, and there are other alternate version control systems that are still out there and one of them is called Mercurial.
IVAN: Oh, yea.
TESS: Mercurial was a very different version control system. It was a distributed version control system. It did all the same things as CVS and SVN, but you didn’t really check out individual files from a server, you cloned an entire repository, and then you worked with all the code there, with all the changes you need to make. If your internet goes out, it doesn’t matter you can make commit, after commit, after commit and it doesn’t matter because you don’t need to talk to the server until you do a push, where you push any changes from your repository up to some remote, somewhere. We don’t care where it is. We don’t care if there’s one or more of them. Just somewhere. Now that was fine, and it became really popular in use of the Linux Kernel Community because it was used to power the Linux Kernel repository, until the company behind Mercurial wanted to charge some exorbitant licensing fees. That’s when Linus Torvalds in a very "Torvalds-ian" style decided to write his own, over a weekend and largely pulled it off, and I admit to his credit, it’s still a very good system, to the point it has basically taken over as the de facto standard. It, like CVS, it allows you to check out files. It allows you to get code, store code, create historical record of code, merge changes together. It handles binary files very well, and it’s distributed so that you don’t have to have a central canonical server, unless if you want to. You don’t have to check out one file as if you had a constant access to the server, 'cause it’s literally down the floor that you’re on, because you’re a university, or a large company, it’s somewhere out there on the internet somewhere, and you’re working on a laptop. So, Git is kind of an evolution of all of these different changes, but it had the advantage that it was open source from the start. Its modality worked really, really well for developers, because if your connection goes down, doesn’t matter. If you want to make five or six commits and then push all of those up as one block, you can do that. If you want to look at the entire history of code, it’s fine. It also has a different modality in how it resolves changes, because Git doesn’t see files. Git sees changes. So, if you add a file to Git, you’re adding lines, that’s it. The exception is binary files but that’s another problem. (laughter)
IVAN: So you’ve kind of described the history of how we got to using Git as the de facto standard and you referred to CVS and to RCS as essentially client-server products. And you first talked about Mercurial as a distributed product, and of course Git is as well. Clear up for the audience, if you will, why we call it a distributed system, when we effectively use it in a client-server way. And, I think this leads into a question of what is GitHub and how is that different from Git and GitLab and all these other things? I think that’s where my question is going.
TESS: CVS and Subversion were mostly designed around the university model, where you have a bunch of students or researchers that are working in a computer lab, and the computer is literally on the same network, if not on the same physical floor, as everyone else. So, you can assume interactivity in which you just go “I need to work on X file.” “I check out X file.” “No one else can check out X file until I am done and push all of my version of X file back up to the server.” A Git control system works in a very different modality. A distributed version control system says “you want a repo, here, download the entire repository, all of its history and everything. You now have the entire project's history on your local system. You don’t need to talk to me anymore. Bye.” This is great in the internet era, because if you’re on a laptop, if you’re going to be on a plane, you can keep using commits instead of having to wait until you land, wait until you get to your hotel, wait until you check in, wait until you connect to the wi-fi and then do your push or changes. You can put them to your individual repository, because you are in essence an island now. If you want to share that code, there’s kind of a regressive standard. We go back to the university model, kind of. What happens is that we designate a particular Git repository as the canonical Git repository. That one is going to be the source of truth for your entire team, and then anything that happens in your local copies of the repositories should be pushed from your repository up to that canonical repository. And then other team members can pull those changes from that canonical repository down to their local copies.
IVAN: You’re using those words very carefully – push and pull.
TESS: Yea, I am. (laughter)
IVAN: Maybe we can get to those in a little bit here. You didn’t answer my question about GitHub and GitLab.
TESS: With distributed companies, with companies that have clients that need to look at their own code and with open source projects, you run into the problem of “well, we need to put the canonical repository somewhere.” Who’s going to do that? Who’s going to own that? Where does it go? And then when they do that, well “how do I get authority?” “Who owns the system.” “How does that work?” And then it gets really complicated and messy. A lot of organizations, particularly the one behind GitHub went “you know what? What if we were to create a site where you have a nice web-based UI that gets you to your projects, that can package your projects, but also deal with other things that are related to software projects, like dealing with particular issues, or making Wiki pages for documentation, or bundling up and making releases available for download, or storing things like change logs so that people can read those in a standardized format. Having a nice website to go to that handles that as well as – and this is key – identity management, having a user account that you can sign up for and have the service transparently manage that for you, so that you don’t have to ask Bob who has the server in his basement that’s running Git, to do all these things. It makes it a lot easier for these companies, organizations and open source projects to continue to exist. Because now you’re asking another party to go “can you be our source of truth?” “Can you be our identity manager?” And then after that, the project no longer has to worry about those concerns. They have outsourced those concerns to a third-party, and GitHub was the most popular, it was not the first. SourceForge was before it and largely maligned because it had a lot of garbage on it too. I know because I put some stuff on there too. (laughter) The problem with GitHub is that it’s a single instance. It’s one of those unfortunately 'too big to fail' pieces of infrastructure now, and it’s really frustrating. The problem is that it’s free to use, if your code is public. There are good reasons to not have your code public, or to actually keep it so that other users have to have a particular access credential in order to get to the code. And, while GitHub can provide that, it doesn’t necessarily meet all the data integrity laws, depending on what organization you are, what kind of company you are, what regulations you need to follow and what country or economic zone you’re in. It’s arguable that if you are a company in Europe, you probably shouldn’t push code to GitHub, unless you can be assured that it goes to a server located in the EU.
IVAN: Is that a result of the recent changes in the European law?
TESS: That kind of data retention law is actually even older than GDPR.
TESS: But, for some companies they’re like “no, we don’t want to do this.” Very traditionally IT managed companies, they want to self-host everything, because to them self-hosting is security, and you’ll get companies like HP and all of these other organizations that are fairly large that would just not be able to handle putting their code on GitHub. Or they don’t own the server that runs on. And, that’s where you get another competitor to GitHub – GitLab. GitLab does pretty much all the same things as GitHub does. In fact, it does some things better in my opinion. And, I happen to like GitLab a lot, but GitLab does one other thing that GitHub does not. It is, by default, an open source, self-hostable product.
TESS: Everything that happens is code that you can get, that you could put on your own server, that you manage yourself, and you could even keep it behind a corporate firewall and everything if you want to. But, this becomes an internal source of truth, that it has a nice UI that provides a whole bunch of other integrations, which are very common and also standardized with software projects. And that’s why these things exist. There are plenty of other proprietary and for pay solutions to this, like, I forget what Microsoft’s was called, but they had one. Google even had one for a while but they kind of gave up on that. And, a lot of other companies had these different products, but GitLab has been steadily eating them all, because Git is much more a preferred mental modality for a lot of developers. And if they could migrate their projects from CVS or Subversion to Git, they generally try to do so.
IVAN: Thank you for clearing up the differences and talking about GitHub and GitLab and Git. So, now we understand that we’re using Git, and that GitHub and GitLab are products that we might use that are built on top of Git. It sounds to me like one of the big changes that Git brought from a conceptual point of view was to think about the changes in files, not the files themselves, and to track change sets. So, maybe we should talk about how that aspect works. The fact that a change set is what we’re tracking, and how would a developer make a change set and account for one?
TESS: Ok. So, let’s talk workflow. That’s what we’re really getting at here, is “alright, I have this Git thing. What do I do with it?” Well, the first thing you need to do is you need to initialize a repository. And there are generally two different ways that you can do this. You can do it the old fashioned, Unix Beard way by doing git init, on an existing directory. But more often than not, a lot of people tend to do it the cheap way, which is they go to their canonical repository, like GitHub or GitLab, they look for wherever the repository link is and they copy it, and then they do an initial git clone. That repository could have code in it already or it could be completely empty. I have to admit, I happen to like this method a lot better than the git init version, because I never could remember the format for how to set a Git remote.
TESS: I keep forgetting every time.
IVAN: I would’ve thought you had been hardcore git init on an empty directory?
TESS: Nah, nah. I do it the other way around. I like that part, because in one command, in a single git clone, it already has the remote server set up, it already has the default branch name set up, it already has everything set up as it should be on the server. And, it’s kind of a pain, because the sources of truth, like GitHub and GitLab, have some built-in default assumptions, and if you try to use just raw Git at first, you can quickly go outside those default assumptions and make things more complicated for yourself. So it’s a lot easier to use a UI to create the repo first, and then just clone it. But we’re already talking about a few different operations. So, a git init takes a regular directory, that’s on your system, and turns it into a version control system, a version control managed directory. And, if you’re using Git, what it really seems to do is add a .git directory. It doesn’t seem like it does anything else, but that .git directory has a lot going on inside of it, and most of the time you don’t ever want to see what’s going on in there, because it’s a mess. (laughter) If you already created the repository on GitHub or GitLab, you want to do a clone. And what a clone does is it goes “I don’t have a copy of your repository, but I know which URL it’s available on. Can you copy it from that remote source to my system?” So a lot of projects start out with going to UI and making a project or repository, a something rather, that somewhere in it ends up with a Git repository. And then you clone that to copy it from that remote server down to your local system. So that’s initialization. Now that we’ve started that, we have our Git directory, our repository, on our system. The next thing is we need to add files. So, how do you add a file? How do you make Git aware that something is “tracked”? What you do is you use the git add command, and the add command says “take this file and I want you to pay attention to it.” That’s it. That’s what git add does. There’s a few different switches to add multiple files at once, and you might also wonder. "why don’t I want to track everything?" That seems like a natural assumption at first, when you’re first using Git, but there are numerous places where that’s actually a bad idea. If you’re doing a Drupal site, you don’t want to track your settings.local.php file, because it has your database credentials in it. That’s not something you want to end up on GitHub or GitLab. Another problem is you might have your files directory. You don’t want to add your files directory to your repository. That’s a completely different thing. It’s not something that should be tracked in code, it’s something that should just be a directory somewhere. And so what you end up doing is you only selectively track files. In order to keep Git from trying to re-add these files when they appear in a repository, you can git ignore them. And, there’s a whole other topic about how to do a git ignore, and those are two sides of the same coin – add and ignore. When you do adds, what happens is that you are adding change sets. You are adding changes to files. It could be that the file just wasn’t known to the repository before add. So, the entire contents of the file become a huge add change, a huge insert operation that gets added to the repository. Once you actually have all the files added, all your changes added, you want to say “ok, I’m done with whatever I’m doing. I want you to remember everything as it is right now and assign it some unique identifier, so I can recall it later.” This is what we have when we have a change set. So all those files we’ve added become a change set, and we need to take that change set and say “this is one point in time that I want to remember.” And, you do that with a git commit. A git commit takes a change set, and says “this is a defined point in history that I am remembering. This is a snapshot of these files in time.” That’s what a commit is. So, that’s the basic operations for how you work locally. Then the problem is “well I do all these adds and changes on my system. What happens if I change a file that I already added and committed before?” Well, Git is actually pretty smart and it will go “oh, hey that file that you added earlier, it’s different now. Here’s what changed. Do you want me to remember that?” And then you do the same operation again. An add to add the changes, and then a commit to save those changes. And, again, it’s all about changes, not about files. The files don’t really exist in terms of Git. It’s really a path and a series of inserts or removes. That’s what really Git does. Once you understand that, it gets a little easier to understand its mindset, that it doesn’t see files, just changes. Alright, so we have all these changes. We’ve done a bunch of commits, and now you want to share those changes with your team. Or, you want to back up your code to some third-party server just in case your cat decides to throw up on your laptop.
IVAN: (laughter) It’s been known to happen.
TESS: It has been known to happen. (laughter) What you do then is you need the opposite of a clone. You need to be able to say “I have changes locally, remote repository, can you copy them to you and incorporate them into yourself?” And this is called a "push". What a push does is it finds out the last commit on the server, on the remote repository, and compares it to your current commit on your local repository. It creates another change set and goes “oh, hey, all these commits have it” and it pushes those changes up to the remote server, and the remote server incorporates them into itself. And that is where the code gets stored. And now, when you have your other assistant or another coworker who needs to download your changes from the remote repository, what do they do? Do they need to reclone and start all over again? No, no, that’s ridiculous! Instead, what they do is they do the opposite of a push. They do a "pull". So, a push takes code from your local system up to the server, and a pull takes any changes that are on the remote server and pulls them down to your local repository, and then reincorporates the history to your local repository. And this is how code can be shared between multiple team members who are working asynchronously, geographically distributed, time differentiated, in order to actually create a unique and canonical single source of code truth between everyone who’s working in the same team.
IVAN: So, even though you have a distributed version control system, you’re effectively using the client- server model because it’s an easier way to get things to all of your team members. It’s not really a client server relationship, you’re just taking advantage of the internet and the fact that we can do that?
TESS: Right. This could also work if it’s a server in a closet inside the same house. It could be Bob’s laptop is the source of truth of everyone else – hopefully Bob does not go on vacation (laughter) or lose his laptop, cause that would be bad. But, more often than not, it’s one of these third-party services, which make it their business to provide a code repository system with value-added features.
IVAN: In theory, if you Tess and I were working on our laptops, in a conference room somewhere, and we were all in the same local subnet, and my machine could see your machine, and we didn’t have firewall rules set up that would block particular Git activity…
TESS: and we operated in a Utopia where we didn’t need to worry about security (laughter)…
IVAN: Right, if we did. If those things were true, in theory I wouldn’t really have to push to that server, right? I could push to your machine and you could push to mine, and I could pull from yours...
IVAN: ... it would become a little messy but it would be a way of doing it, if we really needed to.
TESS: Right. That’s correct. But that’s why it’s really popular to have one of these third-party services. Now, it’s important to note that you can actually have multiple “remotes”. That’s what these third-party canonical servers are called. They’re called a “remote” in Git lingo, and you could have multiples of them. You could have one that’s GitHub, one that’s Drupal.org, one that’s BitBucket, Pantheon would be another one. Those are different ways that you can do that, and you can push to those as well and pull from those as well, as needed.
IVAN: And in that case you distribute your risk across many different servers as well, and perhaps you have different reasons for having different systems.
TESS: Mm hmm.
IVAN: Ok. So it sounds like we’ve gone through the major commands that you need to share your own code as a single developer. And, to also work with another team member. What do you think the threshold is of actually starting to use Git as a single developer?
TESS: Do you have more than two files?
IVAN: Yes. (laughter) Ok.
TESS: It’s a very different answer than you were probably expecting. You were probably thinking “oh, it’s the number of people” or “It’s going to be the complexity of the code.” Honestly, if it’s more than two files, you might as well use it. There’s little reason to not is the problem. It’s a really versatile tool. It’s also a really lightweight and fast tool to even manage very large projects. So, as a result, there’s not a lot of reason to not use it. In fact the biggest barrier to entry for most people is they don’t know about it, or understand it, or are intimidated by it.
IVAN: I think intimidation is a big part of it as well as you’ve eluded to. I think just not knowing about it and thinking that it’s this big, ugly monster that you’re never going to get a hold of, that you’re never going to grok, that makes it intimidating, and it doesn’t have to be. It doesn't have to be.
TESS: It’s also possible they could’ve been like me, and they had horror stories of CVS and SVN in college and hated all revision control systems for ages, because they were taught really badly as a thing that you needed to learn, but not given any motivation or reason or framing as to why this was a thing.
IVAN: But like you, they have the opportunity to get past that and to become an expert. I realize now how much of the things that we work on everyday are text files. It sounds ridiculous to think about it, but, honestly everything we’re doing is a text file, or likely an image. And if it’s not an image, it’s another binary file that’s an application that you really shouldn’t put in Git anyway. So, Git is keeping track of the change sets in text files. What’s the deal with images and change sets?
TESS: Images and other binary files are treated as this weird exception in Git. It doesn’t really track the individual changes in them, it treats them kind of as a whole unit. I think there’s a separate binary file mode, and I believe that what it does is either tracks the file size, or the MD5 sum, or something of the individual files. And if they change it assumes the whole file needs to be uploaded. And this is good, because it preserves file integrity, but it’s also bad if you’ve done something like added Photoshop files to a repository that are multiple gigabytes in size. That was probably not the best choice in that case, because the problem is Git remembers. And even if you remove something from the repository now, it’s still in the history, and it will always be there, unless if you do some invasive hacking, in order to completely get rid of it. And so that could make the process very, very difficult. And this is why you want to be careful about what binary files you add, because it’s a complete copy each time.
IVAN: And even if you delete it, it’s still there!
TESS: Mm hmm.
IVAN: Yea, that is definitely something to keep in mind. Okay, so, we’ve talked about kind of the history of Git, we’ve defined some terms, we’ve talked about ways that you can manage code locally and push code, and we’ve even gone through some of the very basic commands that would start you off as a developer, being able to work with team members. There are a number of concepts that we haven’t talked about, and I think I want to save those for the next episode in the Blueprint series. Branches are one thing. How we separate code between ourselves, so we can work in different things at the same time. Releases and merging are different concepts as well that I think we should talk about. Tags, yet another thing we should mention. I think we’ll talk about this in the next episode. What else should we talk about?
TESS: Probably branch strategies is going to be the most important thing. You might’ve heard of something called GitFlow and that’s something that we’ll talk about in the next episode.
IVAN: I think that’s a great idea. Let’s talk about GitFlow in the next episode, and we’ll go from there. Anything else we should say about Git before we wrap it up here?
TESS: Anyone wonder where the name came from?
IVAN: Oh, that’s a good story, yes tell us. (laughter) I love this story by the way.
TESS: It was a bit of a self-deprecating comment on the part of Torvalds. (laughter)
IVAN: So, the etymology was something to do with Linus Torvalds talking about the thing he was building and it being...
TESS: Git is part of a Commonwealth slang for someone who’s kind of either not particularly bright, or a jerk or a few different things. I forget exactly.
IVAN: An unpleasant person.
TESS: Unpleasant. That was it. (laughter) So, yes it was a bit of a self-deprecating comment on Torvald's part.
IVAN: Well, he has the right to do that, so good on him. Thank you Tess for being a non-git and helping me with this Git episode. I’m looking forward to talking to you in the next one.
TESS: Mm hmm.
IVAN: You’ve been listening to the TEN7 Podcast. Find us online at ten7.com/podcast. And if you have a second, do send us a message, we love hearing from you. Our email address is firstname.lastname@example.org. Until next time, this is Ivan Stegic. Thank you for listening.