How GitHub uses CodeQL to secure GitHub

GitHub’s Product Security Engineering team writes code and implements tools that help secure the code that powers GitHub. We use GitHub Advanced Security (GHAS) to discover, track, and remediate vulnerabilities and enforce secure coding standards at scale. One tool we rely heavily on to analyze our code at scale is CodeQL.

CodeQL is GitHub’s static analysis engine that powers automated security analyses. You can use it to query code in much the same way you would query a database. It provides a much more robust way to analyze code and uncover problems than an old-fashioned text search through a codebase.

The following post will detail how we use CodeQL to keep GitHub secure and how you can apply these lessons to your own organization. You will learn why and how we use:

Custom query packs (and how we create and manage them). Custom queries. Variant analysis to uncover potentially insecure programming practices.

Enabling CodeQL at scale

We employ CodeQL in a variety of ways at GitHub.

Default setup

default and security-extended query suites

Advanced setup with a custom query pack

query pack

Multi-repository variant analysis (MRVA)

The specific custom Actions workflow step we use on our monolith is pretty simple. It looks like this:

- name: Initialize CodeQL    uses: github/codeql-action/init@v3    with:      languages: ${{ matrix.language }}      config-file: ./.github/codeql/${{ matrix.language }}/codeql-config.yml

Our Ruby configuration is pretty standard, but advanced setup offers a variety of configuration options using custom configuration files. The interesting part is the packs option, which is how we enable our custom query pack as part of the CodeQL analysis. This pack contains a collection of CodeQL queries we have written for Ruby, specifically for the GitHub codebase.

So, let’s dive deeper into why we did that—and how!

Publishing our CodeQL query pack

Initially, we published CodeQL query files directly to the GitHub monolith repository, but we moved away from this approach for several reasons:

not pre-compiled

test suite for CodeQL queries

By switching to publishing a query pack to GitHub Container Registry (GCR), we’ve simplified our process and eliminated many of these pain points, making it easier to ship and maintain our CodeQL queries. So while it’s possible to deploy custom CodeQL query files directly to a repository, we recommend publishing CodeQL queries as a query pack to the GCR for easier deployment and faster iteration.

Creating our query pack

When setting up our custom query pack, we faced several considerations, particularly around managing dependencies like the ruby-all package.

To ensure our custom queries remain maintainable and concise, we extend classes from the default query suite, such as the ruby-all library. This allows us to leverage existing functionality rather than reinventing the wheel, keeping our queries concise and maintainable. However, changes to the CodeQL library API can introduce breaking changes, potentially deprecating our queries or causing errors. Since CodeQL runs as part of our CI, we wanted to minimize the chance of this happening, as this can lead to frustration and loss of trust from developers.

We develop our queries against the latest version of the ruby-all package, ensuring we’re always working with the most up-to-date functionality. To mitigate the risk of breaking changes affecting CI, we pin the ruby-all version when we’re ready to release, locking it in the codeql-pack.lock.yml file. This guarantees that when our queries are deployed, they will run with the specific version of ruby-all we’ve tested, avoiding potential issues from unintentional updates.

Here’s how we manage this setup:

ruby-all

pulls in the latest version

ruby-all

codeql pack init

// Our custom query pack's qlpack.ymllibrary: falsename: github/internal-ruby-codeqlversion: 0.2.3extractor: 'ruby'dependencies:  codeql/ruby-all: "*"tests: 'test'description: "Ruby CodeQL queries used internally at GitHub"

codeql-pack.lock.yml

// Our custom query pack's codeql-pack.lock.ymllockVersion: 1.0.0dependencies: ... codeql/ruby-all:   version: 1.0.6

This approach allows us to balance developing against the latest features of the ruby-all package while ensuring stability when we release.

We also have a set of CodeQL unit tests that exercise our queries against sample code snippets, which helps us quickly determine if any query will cause errors before we publish our pack. These tests are run as part of the CI process in our query pack repository, providing an early check for issues. We strongly recommend writing unit tests for your custom CodeQL queries to ensure stability and reliability.

Altogether, the basic flow for releasing new CodeQL queries via our pack is as follows:

codeql pack init

We have found this flow balances our team’s development experience while ensuring stability in our published query pack.

Configuring our repository to use our custom query pack

We won’t provide a general recommendation on configuration here, given that it ultimately depends on how your organization deploys code. We opted against locking our pack to a particular version in our CodeQL configuration file (see above). Instead, we chose to manage our versioning by publishing the CodeQL package in GCR. This results in the GitHub monolith retrieving the latest published version of the query pack. To roll back changes, we simply have to republish the package. In one instance, we released a query that had a high number of false positives and we were able to publish a new version of the pack that removed that query in less than 15 minutes. This is faster than the time it would have taken us to merge a pull request on the monolith repository to roll back the version in the CodeQL configuration file.

One of the problems we encountered with publishing the query pack in GCR was how to easily make the package available to multiple repositories within our enterprise. There are several approaches we explored.

Grant access permissions for individual repositories.

Mint a personal access token for the CodeQL action runner.

all

Provide access permissions via a linked repository.

link a repository to the package

inherit access permissions from the linked repository

CodeQL query pack queries

We write a variety of custom queries to be used in our custom query packs. These cover GitHub-specific patterns that aren’t included in the default CodeQL query pack. This allows us to tailor the analysis to patterns and preferences that are specific to our company and codebase. Some of the types of things we alert on using our custom query pack include:

High-risk APIs specific to GitHub’s code that can be dangerous if they receive unsanitized user input. Use of specific built-in Rails methods for which we have safer, custom methods or functions. Required authorization methods not being used in our REST API endpoint definitions and GraphQL object/mutation definitions. REST API endpoints and GraphQL mutations that require engineers to define access control methods to determine which actors can access them. (Specifically, the query detects the absence of this method definition to ensure that the actors’ permissions are being checked for these endpoints.) Use of signed tokens so we can nudge engineers to include Product Security as a reviewer when using them.

Custom queries can be used more for educational purposes rather than being blockers to shipping code. For example, we want to alert engineers when they use the ActiveRecord::decrypt method. This method should generally not be used in production code, as it will cause an encrypted column to become decrypted. We use the recommendation severity in the query metadata so these alerts are treated as more of an informational alert. That means this may trigger an alert in a pull request, but it won’t cause the CodeQL CI job to fail. We use this lower severity level to allow engineers to assess the impact of new queries without immediate blocking. Additionally, this alert level isn’t tracked through our Fundamentals program, meaning it doesn’t require immediate action, reflecting the query’s maturity as we continue to refine its relevance and risk assessment.

/** * @id rb/github/use-of-activerecord-decrypt * @description Do not use the .decrypt method on AR models, this will decrypt all encrypted attributes and save * them unencrypted, effectively undoing encryption and possibly making the attributes inaccessible. * If you need to access the unencrypted value of any attribute, you can do so by calling my_model.attribute_name. * @kind problem * @severity recommendation * @name Use of ActiveRecord decrypt method * @tags security *      github-internal */import rubyimport DataFlowimport codeql.ruby.DataFlowimport codeql.ruby.frameworks.ActiveRecord/** Match against .decrypt method calls where the receiver may be an ActiveRecord object */class ActiveRecordDecryptMethodCall extends ActiveRecordInstanceMethodCall {  ActiveRecordDecryptMethodCall() { this.getMethodName() = "decrypt" }}from ActiveRecordDecryptMethodCall callselect call,  "Do not use the .decrypt method on AR models, this will decrypt all encrypted attributes and save them unencrypted.

Another educational query is the one mentioned above in which we detect the absence of the `control_access` method in a class that defines a REST API endpoint. If a pull request introduces a new endpoint without `control_access`, a comment will appear on the pull request saying that the `control_access` method wasn’t found and it’s a requirement for REST API endpoints. This will notify the reviewer of a potential issue and prompt the developer to fix it.

/** * @id rb/github/api-control-access * @name Rest API Without 'control_access' * @description All REST API endpoints must call the 'control_access' method, to ensure that only specified actor types are able to access the given endpoint. * @kind problem * @tags security * github-internal * @precision high * @problem.severity recommendation */import codeql.ruby.ASTimport codeql.ruby.DataFlowimport codeql.ruby.TaintTrackingimport codeql.ruby.ApiGraphs// Api::App REST API endpoints should generally call the control_access methodprivate DataFlow::ModuleNode appModule() {  result = API::getTopLevelMember("Api").getMember("App").getADescendentModule() and  not result = protectedApiModule() and  not result = staffAppApiModule()}// Api::Admin, Api::Staff, Api::Internal, and Api::ThirdParty REST API endpoints do not need to call the control_access methodprivate DataFlow::ModuleNode protectedApiModule() {  result =    API::getTopLevelMember(["Api"])        .getMember(["Admin", "Staff", "Internal", "ThirdParty"])        .getADescendentModule()}// Api::Staff::App REST API endpoints do not need to call the control_access methodprivate DataFlow::ModuleNode staffAppApiModule() {  result =    API::getTopLevelMember(["Api"]).getMember("Staff").getMember("App").getADescendentModule()}private class ApiRouteWithoutControlAccess extends DataFlow::CallNode {  ApiRouteWithoutControlAccess() {    this = appModule().getAModuleLevelCall(["get", "post", "delete", "patch", "put"]) and    not performsAccessControl(this.getBlock())  }}predicate performsAccessControl(DataFlow::BlockNode blocknode) {  accessControlCalled(blocknode.asExpr().getExpr())}predicate accessControlCalled(Block block) {  // the method `control_access` is called somewhere inside `block`  block.getAStmt().getAChild*().(MethodCall).getMethodName() = "control_access"}from ApiRouteWithoutControlAccess apiselect api.getLocation(),  "The control_access method was not detected in this REST API endpoint. All REST API endpoints must call this method to ensure that the endpoint is only accessible to the specified actor types."

Variant analysis

Variant analysis (VA) refers to the process of searching for variants of security vulnerabilities. This is particularly useful when we’re responding to a bug bounty submission or a security incident. We use a combination of tools to do this, including GitHub’s code search functionality, custom scripts, and CodeQL. We will often start by using code search to find patterns similar to the one that caused a particular vulnerability across numerous repositories. This is sometimes not good enough, as code search is not semantically aware, meaning that it cannot determine whether a given variable is an Active Record object or whether it is being used in an `if` expression. To answer those types of questions we turn to CodeQL.

When we write CodeQL queries for variant analysis we are much less concerned about false positives, since the goal is to provide results for security engineers to analyze. The quality of the code is also not quite as important, as these queries will only be used for the duration of the VA effort. Some of the types of things we use CodeQL for during VAs are:

Where are we using SHA1 hashes? One of our internal API endpoints was vulnerable to SQLi according to a recent bug bounty report. Where are we passing user input to that API endpoint? There is a problem with how some HTTP request libraries in Ruby handle the proxy setting. Can we look at places we are instantiating our HTTP request libraries with a proxy setting?

One recent example involved a subtle vulnerability in Rails. We wanted to detect when the following condition was present in our code:

A parameter was used to look up an Active Record object. That parameter is later reused after the Active Record object is looked up.

The concern with this condition is that it could lead to an insecure direct object reference (IDOR) vulnerability because Active Record finder methods can accept an array. If the code looks up an Active Record object in one call to determine if a given entity has access to a resource, but later uses a different element from that array to find an object reference, that can lead to an IDOR vulnerability. It would be difficult to write a query to detect all vulnerable instances of this pattern, but we were able to write a query that found potential vulnerabilities that gave us a list of code paths to manually analyze. We ran the query against a large number of our Ruby codebases using CodeQL’s MRVA.

The query, which is a bit hacky and not quite production grade, is below:

/** * @name wip array query * @description an array is passed to an AR finder object */import rubyimport codeql.ruby.ASTimport codeql.ruby.ApiGraphsimport codeql.ruby.frameworks.Railsimport codeql.ruby.frameworks.ActiveRecordimport codeql.ruby.frameworks.ActionControllerimport codeql.ruby.DataFlowimport codeql.ruby.Frameworksimport codeql.ruby.TaintTracking// Gets the "final" receiver in a chain of method calls.// For example, in `Foo.bar`, this would give the `Foo` access, and in// `foo.bar.baz("arg")` it would give the `foo` variable accessprivate Expr getUltimateReceiver(MethodCall call) {  exists(Expr recv |    recv = call.getReceiver() and    (      result = getUltimateReceiver(recv)      or      not recv instanceof MethodCall and result = recv    )  )}// Names of class methods on ActiveRecord models that may return one or more// instances of that model. This also includes the `initialize` method.// See https://api.rubyonrails.org/classes/ActiveRecord/FinderMethods.htmlprivate string staticFinderMethodName() {  exists(string baseName |    baseName = ["find_by", "find_or_create_by", "find_or_initialize_by", "where"] and    result = baseName + ["", "!"]  )  // or  // result = ["new", "create"]}private class ActiveRecordModelFinderCall extends ActiveRecordModelInstantiation, DataFlow::CallNode{  private ActiveRecordModelClass cls;  ActiveRecordModelFinderCall() {    exists(MethodCall call, Expr recv |      call = this.asExpr().getExpr() and      recv = getUltimateReceiver(call) and      (        // The receiver refers to an `ActiveRecordModelClass` by name        recv.(ConstantReadAccess).getAQualifiedName() = cls.getAQualifiedName()        or        // The receiver is self, and the call is within a singleton method of        // the `ActiveRecordModelClass`        recv instanceof SelfVariableAccess and        exists(SingletonMethod callScope |          callScope = call.getCfgScope() and          callScope = cls.getAMethod()        )      ) and      (        call.getMethodName() = staticFinderMethodName()        or        // dynamically generated finder methods        call.getMethodName().indexOf("find_by_") = 0      )    )  }  final override ActiveRecordModelClass getClass() { result = cls }}class FinderCallArgument extends DataFlow::Node {  private ActiveRecordModelFinderCall finderCallNode;  FinderCallArgument() { this = finderCallNode.getArgument(_) }}class ParamsHashReference extends DataFlow::CallNode {  private Rails::ParamsCall params;  // TODO: only direct element references against `params` calls are considered  ParamsHashReference() { this.getReceiver().asExpr().getExpr() = params }  string getArgString() {    result = this.getArgument(0).asExpr().getConstantValue().getStringlikeValue()  }}class ArrayPassedToActiveRecordFinder extends TaintTracking::Configuration {  ArrayPassedToActiveRecordFinder() { this = "ArrayPassedToActiveRecordFinder" }  override predicate isSource(DataFlow::Node source) { source instanceof ParamsHashReference }  override predicate isSink(DataFlow::Node sink) {    sink instanceof FinderCallArgument  }  string getParamsArg(DataFlow::CallNode paramsCall) {    result = paramsCall.getArgument(0).asExpr().getConstantValue().getStringlikeValue()  }  // this doesn't check for anything fancy like whether it's reuse in a if/else  // only intended for quick manual audit filtering of interesting candidates  // so remains fairly broad to not induce false negatives  predicate paramsUsedAfterLookups(DataFlow::Node source) {    exists(DataFlow::CallNode y | y instanceof ParamsHashReference    and source.getEnclosingMethod() = y.getEnclosingMethod()    and source != y    and getParamsArg(source) = getParamsArg(y)    // we only care if it's used again AFTER an object lookup    and y.getLocation().getStartLine() > source.getLocation().getStartLine())  }}from ArrayPassedToActiveRecordFinder config, DataFlow::Node source, DataFlow::Node sinkwhere config.hasFlow(source, sink) and config.paramsUsedAfterLookups(source)select source, sink.getLocation()

Conclusion

CodeQL can be very useful for product security engineering teams to detect and prevent vulnerabilities at scale. We use a combination of queries that run in CI using our query pack and one-off queries run through MRVA to find potential vulnerabilities and communicate them to engineers. CodeQL isn’t only useful for finding security vulnerabilities, though; it is also useful for detecting the presence or absence of security controls that are defined in code. This saves our security team time by surfacing certain security problems automatically, and saves our engineers time by detecting them earlier in the development process.

Writing custom CodeQL queries

Tips for getting started

We have a large number of articles and resources for writing custom CodeQL queries. If you haven’t written custom CodeQL queries before, here are some resources to help get you started:

CodeQL zero to hero part 1: The fundamentals of static analysis for vulnerability research - The GitHub Blog

Writing CodeQL queries

Use CodeQL inside Visual Studio Code - GitHub Docs

CodeQL workshops for GitHub Universe

GitHub Satellite 2020 workshops on finding security vulnerabilities with CodeQL for Java/JavaScript.

A beginner’s guide to running and managing custom CodeQL queries

Improve the security of your applications today by enabling CodeQL for free on your public repositories, or try GitHub Advanced Security for your organization.

Michael Recachinas, GitHub Staff Security Engineer, also contributed to this blog post.

The post How GitHub uses CodeQL to secure GitHub appeared first on The GitHub Blog.