Insights

The Open-Source Approach To Collaboration

By Bryce Chamberlain, ASA

This article was originally published in Contingencies, March/April 2019 and first appeared on LinkedIn, January 2, 2019.

The Github Octocat logo:

Have you heard of Git? Github, maybe? Git was created by Linus Torvald in 2005 to support the development of Linux[i], which in 2016 ran an estimated 67 percent of web servers. Android runs on Linux. Facebook, Google, and Wikipedia all run on Linux.[ii] It is hard to understate the impact of Git, Github and the open-source revolution. To give another sense of the scope of this space, Github was recently purchased by Microsoft for $7.5 billion.[iii] There are 96 million projects—repositories, in Git lingo—and 31 million developers on Github[iv], myself included.[v]

My purpose with this article is to share my experience with Git and open source, and encourage you to consider using it at your company to enhance collaboration.

A year ago, I had only heard of Git. I hadn’t used it. Today I rely on it heavily at work for multiple projects daily. I use it to manage my homework between my Linux-enabled Chromebook and home PC. I can feel the cortisol flowing—that’s the stress hormone—any time I think of not having it as a tool.

I credit two influences for my reliance on Git and passion for sharing the message: Rajesh Sahasrabuddhe and an open-source hackathon by Uptake[vi] in Chicago.

Raj is a thought leader in our company and was one of the few people there using Git when I joined in 2017. He introduced me to Bitbucket. Raj is a Bitbucket evangelist because he understands the power of using it. It wasn’t easy at first. Using Git is very different from the typical way of managing project files on network drives. It took me a few months to build the intuition and start trusting the technology. Today, it is second nature and requires much less thinking than my prior approach.

After using Git at the office, I felt like I could take on my first hackathon. I got word through the University of Chicago that an event was taking place. Uptake sent an open invite: Come to their office for free snacks (and beer) and make some contributions to open source. After a brief introduction to working with open source, we started looking for ways to contribute.

It wasn’t hard: Go to the Github page of your favorite project and look through the issues page for something you’d like to work on. This feature alone is a game-changer.

I found an open issue on an R package I use daily, lubridate. The task was to remove dependence on another package, stringr, so that lubridate could load faster and not have the risk of requiring separate code. My first submission was a partial solution, which as it turns out isn’t really that helpful, but it was all I had time to finish. I was really nervous—who am I to contribute to this important project?

Figure 1 

The project admins were kind and patient and educated me on how to be a better contributor. I worked with them through a few iterations and finally settled on a complete solution a few weeks later.

I was so excited when my pull request was merged and my code became part of the software that is the go-to for working with dates for more than 2 million R users!

I highly recommend attempting a contribution. You can see this whole process including our back-and-forth and the pull request on Github.[vii] I’d recommend taking a few minutes to skim it; it will give you a sense for how this works in the wild, and how it might work at your company.

Wait, What Just Happened

Now, a moment for reflection. This is wild, right? I’m some random person, and I just logged on to Github, forked a copy of the code, made some changes, and requested the admins pull my code into a project that potentially 2 million people rely on for their code to work in businesses, at universities, and on personal computers.

How does this not break?

This is the power of open source. It is built around this need for things to work when they are built by people who don’t necessarily know each other or the history of the project. I didn’t review all the code in the project. I just found the specific spots that needed work.

This is why open source is taking over. It removes every barrier to collaboration.

My code will work because I made changes to a separate copy of the code; before my changes were even reviewed by a person, over a thousand automated tests were run to ensure I didn’t break or change anything I wasn’t supposed to. It took me a few tries to get past that part. Then it was reviewed by an expert who made suggestions using the Github platform. We went back and forth, and once it looked good and passed all the tests, it was easy to bring the changes in without any ambiguity or mistakes, thanks to Git.

Collaborating the Open-Source Way

Keep the above story in mind as I discuss how collaboration works with open source. 

  • Code is stored online (remote) in a repository (repo) with one or more users set up as owners. This is typically on Github or Bitbucket. These can be private or public. No one else can change the code but everyone with access can see it, review documentation, and add known issues. People can follow the code to get emailed when the code changes or a new issue is added.
  • Anyone with access can copy (fork) the code to their personal account and download (clone) it to their computer to work on it. They create a new branch to clearly separate this code as they attempt to make a specific change, and then go to town making changes. There is zero risk they’ll affect the main project. Git technology ensures this. Changes are tracked so when they are committed online, others can easily review just the lines that changed. A commit is a set of changes that gets its own page online so people can review and comment. A .gitignore file identifies what shouldn’t be tracked, like passwords and keys.
  • When a change is ready, a pull request to the original project is created. This is a request for the project owner to pull the change into the source code. The request goes through automated testing (if it is set up) and an admin is alerted to review, approve, and merge the change in. The change is easy to review, thanks to Git/Github. When the code is merged, it happens as a commit to the original project and anyone following it gets an email that something changed, with a link to the commit page in case they want to review the specific file changes.
 

Complete and efficient transparency of changes. Reporting of issues, including email alerts, and resolution tracking. Anyone can safely implement fixes. Easy reverting back to prior versions. Automatic updates for anyone interested in being alerted when a change happens. Full and mature toolset for collaboration. And all of this with almost no extra work on the part of the developer. This is why I love Git. I don’t have to think about any of this—it just happens as I work on my code.

Compare this to the traditional way of managing files on a network drive where multiple people have access and anyone can silently make changes. When a change is made, it requires the developer to alert users. Often the full change is not communicated, and not always to the right people. The developer is stuck managing a file of version notes, a mailing list of users, and an ever-growing folder of archives and prior versions. When an issue is found, the developer gets an email and other users aren’t made aware until the issue is resolved.

You can see why this stresses me out. And this is how I used to manage my projects. Sure, it worked, but it can be so much better.

Ready, Set, Collaborate!

The open-source approach has the potential to invite the entire organization to collaborate on projects. Anyone can add new features, add an issue, or resolve one, or create a copy of the project for their own use, get alerted when changes happen, and have the ability to easily bring these changes into their code. Talk about breaking down silos. It’s the data-lake approach to code with a full collaboration toolbox.

I get really excited when I think about the potential. In my organization, I want to get to the point where we are all working on the same code, hosted in Bitbucket, and everyone is contributing features from their daily work doing customizations for clients. We are starting to make headway—more on this shortly.

Git Alone 

A few notes about Git. Git is the technology for tracking changes that is used by Github, but the two are quite different. Git is like HTML in that it is a technology, not software. You can use Git through many programs. I use a combination of Git Bash and Visual Studio Code.

Even if you don’t push the code online or are working a project solo, you can use Git on your projects. You make changes, then review and commit them to keep each change separate. Each commit provides a backup point for files and cleans your workspace, ready to start the next change. You focus on the change at hand, ignoring all the other code, without having to worry about breaking anything. If you decide you don’t like a change, just revert or copy/paste from the prior version. 

Figure 2 shows the view of my latest changes to my Fantasy Football project in Visual Studio Code.  

Figure 2 

The green box is the files I’ve changed since the last commit, each with a button to undo the change. Purple is the prior version of the code with red highlights for what I’ve removed. I can copy/paste this across if I want to get it back. Yellow is the current version including new parts highlighted in green. 

You can undo entire commits, or just revert a specific file to a prior commit. When I am done with a change, I quickly review the specific changes, undo any that were for testing only, and commit and push the code. Followers get emailed about the change, with a link to the commit to review exactly what changed. They can add or drop themselves from alerts. If they have a copy of my code, they can merge the change in to their code with a few lines of code in Git Bash. 

Git Together

The Git technology makes it easy to collaborate without worrying about stepping on one another’s toes. 

When collaborating on code, each person has their own copy of the project. When one person pushes a change to the online repository, others must download the change and resolve any conflicts before they can push their own changes. Each user is forced to keep their version up to date! 

Commits are saved online and you can see who changed what and when, and revert changes to the whole project or a specific file. All while automated emails are going out to keep everyone up to speed. 

This the power of Git—it won’t let you break the source code (easily). With Git, you can move fast and make changes knowing it’ll alert you if there is a conflict with someone else’s work. Developers are incentivised to keep the online repo updated, otherwise they may have to resolve more conflicts. 

If an issue is identified, anyone can add it, everyone gets alerted about it, and anyone can fix it. 

Not a Silver Bullet

There are cases where the open source approach will not work. Git is made for code. It can’t track specific changes, only that a change to a file occurred, for proprietary or compiled files like Microsoft Office, PDFs, pictures, etc. It is made for relatively small, text-based code files, not large data files, so I save those somewhere else and refer to them in the code instead of saving them in the repo. This also prevents sensitive data from being stored in the repo.

This doesn’t mean that if you rely on Excel you can’t use the open-source approach. You definitely can—you just need to transition your work from Excel to a code-based approach. I could write a whole other article on the benefits of a code-based approach. Using R, I capture all my work from raw data to finished product in a format that is easy to follow, modify, and replicate. It’s faster, once you are through the learning curve, and honestly just more fun. That’s an aspect of code we don’t talk about enough. It’s creative and flow-y and really a blast.

My strong preference for using Git has led me to abandon compiled formats for those based on text. Excel for CSV. Word and PDF for markdown, so that I can complete projects and easily track changes using Git.

Git can be difficult to learn. It is a paradigm shift—which is part of the reason it is so impactful—so there is a learning curve that takes about a month to get through.

Git is open-ended; you choose how you want to use it. There is no best way to use Git, although many helpful guides can be found online. It requires some creativity to tailor it to your workflow and culture. As an actuarial consultant, I use Git quite differently than someone at a software company. And that’s OK because with Git, best practice depends on context.

The Future of Collaboration, and How I Use Git

With all these features, I expect it is clear that the open-source approach is a game-changer. Look for projects where the approach could be used. It works best with code-based projects, but each year more and more projects rely on code. I’m an actuary and I write code all day. Granted, that isn’t normal for an actuary. Many college graduates today know and want to use R and Python. It’s just more fun to have the freedom to code custom solutions. 

If you are working in a software company, I expect you are already working this way. If you are in a traditional office, you have an uphill battle pushing adoption of this new technology. It’s like any change; people will resist it and it will take extra time at first.

You’ll need to start using it and get very comfortable with it yourself. You’ll need to create guides and make yourself available for questions, and identify super-users and early adopters to start with. You’ll need to clearly identify the value proposition and sell it with real-world examples from your work.

In my work, it started slow, with me working on Bitbucket alone and using it as a way to track issues for people who rely on my projects, and to learn the technology. Things really sped up when I stumbled onto my killer app: a project I built that many other people in the organization use and customize. I made the decision to only publish on Bitbucket, effectively forcing users to start using the open-source approach (unknowingly). This may sound harsh, but for this project it was by far the best approach both for me and my collaborators. For me: I get to use Git and don’t have manage a list of email addresses or versions. Because we are using code forks on Bitbucket, I can see where my code has been copied, and review and follow the copies as they change. For them: They can easily and safely merge fixes and new features as I add them. Just run three lines of code in Git Bash when you get an email that I’ve made a change.

I hope in the future that these collaborators will add their own features and send me a pull request. We’ll merge these into the source code and allow all the other people using that code to bring it in. Colleagues will add issues and anyone working with the code can go in and make fixes.

How to Get Started

It’s easy: Pick a project and start using Git. Then push it to Github or Bitbucket. Github offers unlimited private repos and has a wider feature set.

Eventually you’ll need to collaborate with someone on a project. Introduce them to open source! Project by project, the culture will organically move to the open-source approach. It’s just that effective.

 

Endnotes

 

[i] https://en.wikipedia.org/wiki/Git

[ii] https://www.wired.com/2016/08/linux-took-web-now-taking-world/

[iii] https://news.microsoft.com/2018/06/04/microsoft-to-acquire-github-for-7-5-billion/

[iv] https://octoverse.github.com/

[v] https://github.com/superchordate

[vi] https://www.linkedin.com/company/uptake/

[vii] https://github.com/tidyverse/lubridate/pull/725