The GitHub Blog 2024年12月18日
CodeQL zero to hero part 4: Gradio framework case study
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了如何利用CodeQL对Gradio这一流行的Python Web框架进行安全分析,并成功发现了11个潜在漏洞。通过模拟Gradio框架,作者详细介绍了如何识别框架中的sources(用户输入点)和sinks(潜在危险的功能),并通过动态测试和代码审查来寻找安全隐患。文章还回顾了CodeQL的基本概念,强调了数据流分析在漏洞挖掘中的作用。文章不仅展示了作者在AUTOMATIC1111/stable-diffusion-webui等热门开源项目中发现的安全问题,还为读者提供了如何利用CodeQL建模新框架和库的实用指南。

💡Gradio是一个用于展示机器学习应用的Python Web框架,其受欢迎程度日益增长。本文通过对Gradio框架的建模,揭示了其潜在的安全风险。

🛠️CodeQL通过数据流分析来发现漏洞,它需要对框架中的sources(如HTTP请求参数)和sinks(如执行SQL查询的函数)进行建模。若存在数据流路径且无安全过滤,CodeQL将报告漏洞。

🔍研究过程包括:审查文档和代码,识别sources和sinks;动态测试关键元素;以及检查Gradio相关的安全问题。Gradio的Interface和Blocks类是攻击面的重要组成部分,其输入参数和事件处理机制值得关注。

⚠️在Interface示例中,文本框可以接受任意字符串,而滑动条虽然理论上应限制为整数,但实际上可以接受非整数值,这可能导致运行时错误。这些输入验证的不足为潜在漏洞提供了机会。

Gradio is a Python web framework for demoing machine learning applications, which in the past few years has exploded in popularity. In this blog, you’ll will follow along with the process, in which I modeled (that is, added support for) Gradio framework, finding 11 vulnerabilities to date in a number of open source projects, including AUTOMATIC1111/stable-diffusion-webui—one of the most popular projects on GitHub, that was included in the 2023 Octoverse report and 2024 Octoverse report. Check out the vulnerabilities I’ve found on the GitHub Security Lab’s website.

Following the process outlined in this blog, you will learn how to model new frameworks and libraries in CodeQL and scale your research to find more vulnerabilities.

This blog is written to be read standalone; however, if you are new to CodeQL or would like to dig deeper into static analysis and CodeQL, you may want to check out the previous parts of my CodeQL zero to hero blog series. Each deals with a different topic: status analysis fundamentals, writing CodeQL, and using CodeQL for security research.

    CodeQL zero to hero part 1: The fundamentals of static analysis for vulnerability research CodeQL zero to hero part 2: Getting started with CodeQLCodeQL zero to hero part 3: Security research with CodeQL

Each also has accompanying exercises, which are in the above blogs, and in the CodeQL zero to hero repository.

Quick recap

CodeQL uses data flow analysis to find vulnerabilities. It uses models of sources (for example, an HTTP GET request parameter) and sinks in libraries and frameworks that could cause a vulnerability (for example, cursor.execute from MySQLdb, which executes SQL queries), and checks if there is a data flow path between the two. If there is, and there is no sanitizer on the way, CodeQL will report a vulnerability in the form of an alert.

Check out the first blog of the CodeQL zero to hero series to learn more about sources, sinks and data flow analysis,. See the second blog to learn about the basics of writing CodeQL. Head on to the third blog of the CodeQL zero to hero series to implement your own data flow analysis on certain code elements in CodeQL.

Motivation

CodeQL has models for the majority of most popular libraries and frameworks, and new ones are continuously being added to improve the detection of vulnerabilities.

One such example is Gradio. We will go through the process of analyzing it and modeling it today. Looking into a new framework or a library and modeling it with CodeQL is a perfect chance to do research on the projects using that framework, and potentially find multiple vulnerabilities at once.

Hat tip to my coworker, Alvaro Munoz, for giving me a tip about Gradio, and for his guide on researching Apache Dubbo and writing models for it, which served as inspiration for my research (if you are learning CodeQL and haven’t checked it out yet, you should!).

Research and CodeQL modeling process

Frameworks have sources and sinks, and that’s what we are interested in identifying, and later modeling in CodeQL. A framework may also provide user input sanitizers, which we are also interested in.

The process consisted, among others, of:

Let me preface here that it’s natural that a framework like Gradio has sources and sinks, and there’s nothing inherently wrong about it. All frameworks have sources and sinks—Django, Flask, Tornado, and so on. The point is that if the classes and functions provided by the frameworks are not used in a secure way, they may lead to vulnerabilities. And that’s what we are interested in catching here, for applications that use the Gradio framework.

Gradio

Gradio is a Python web framework for demoing machine learning applications, which in the past few years has become increasingly popular.

Gradio’s documentation is thorough and gives a lot of good examples to get started with using Gradio. We will use some of them, modifying them a little bit, where needed.

Gradio Interface

We can create a simple interface in Gradio by using the Interface class.

import gradio as grdef greet(name, intensity):    return "Hello, " + name + "!" * int(intensity)demo = gr.Interface(    fn=greet,    inputs=[gr.Textbox(), gr.Slider()],    outputs=[gr.Textbox()])demo.launch()

In this example, the Interface class takes three arguments:

Running the code will start an application with the following interface. We provide example inputs, “Sylwia” in the textbox and “3” in the slider, and submit them, which results in an output, “Hello, Sylwia!!!”.

Example app written using Gradio Interface class

Gradio Blocks

Another popular way of creating applications is by using the gr.Blocks class with a number of components; for example, a dropdown list, a set of radio buttons, checkboxes, and so on. Gradio documentation describes the Blocks class in the following way:

Blocks offers more flexibility and control over: (1) the layout of components (2) the events that trigger the execution of functions (3) data flows (for example, inputs can trigger outputs, which can trigger the next level of outputs).

With gr.Blocks, we can use certain components as event listeners, for example, a click of a given button, which will trigger execution of functions using the input components we provided to them.

In the following code we create a number of input components: a slider, a dropdown, a checkbox group, a radio buttons group, and a checkbox. Then, we define a button, which will execute the logic of the sentence_builder function on a click of the button and output the results of it as a textbox.

import gradio as grdef sentence_builder(quantity, animal, countries, place, morning):    return f"""The {quantity} {animal}s from {" and ".join(countries)} went to the {place} in the {"morning" if morning else "night"}"""with gr.Blocks() as demo:    gr.Markdown("Choose the options and then click **Run** to see the output.")    with gr.Row():        quantity = gr.Slider(2, 20, value=4, label="Count", info="Choose between 2 and 20")        animal = gr.Dropdown(["cat", "dog", "bird"], label="Animal", info="Will add more animals later!")        countries = gr.CheckboxGroup(["USA", "Japan", "Pakistan"], label="Countries", info="Where are they from?")        place = gr.Radio(["park", "zoo", "road"], label="Location", info="Where did they go?")        morning = gr.Checkbox(label="Morning", info="Did they do it in the morning?")    btn = gr.Button("Run")    btn.click(        fn=sentence_builder,        inputs=[quantity, animal, countries, place, morning],        outputs=gr.Textbox(label="Output")    )if __name__ == "__main__":    demo.launch(debug=True)

Running the code and providing example inputs will give us the following results:

Example app written using Gradio Blocks

Identifying attack surface in Gradio

Given the code examples, the next step is to identify how a Gradio application might be written in a vulnerable way, and what we could consider a source or a sink. A good point to start is to run a few code examples, use the application the way it is meant to be used, observe the traffic, and then poke at the application for any interesting areas.

The first interesting point that stood out for me for investigation are the variables passed to the inputs keyword argument in the Interface example app above, and the on click button event handler in the Blocks example.

Let’s start by running the above Gradio Interface example app in your favorite proxy (or observing the traffic in your browser’s DevTools), and filling out the form with example values (string "Sylwia" and integer 3) shows the data being sent:

Traffic in the Gradio Interface example app, observed in Firefox DevTools

The values we set are sent as a string “Sylwia” and an integer 3 in a JSON, in the value of the “data” key. Here are the values as seen in Burp Suite:

JSON in the request to the example Interface application

The text box naturally allows for setting any string value we would like, and that data will later be processed by the application as a string. What if I try to set it to something else than a string, for example, an integer 1000?

Testing setting the textbox value to an integer

Turns out that’s allowed. What about a slider? You might expect that we can only set the values that are restricted in the code (so, here these should be integer values from 2 to 20). Could we send something else, for example, a string "high”?

Testing setting the slider value to a string

We see an error:

File "/**/**/**/example.py", line 4, in greet    return "Hello, " + name + "!" * int(intensity)                                    ^^^^^^^^^^^^^^ValueError: invalid literal for int() with base 10: 'high'

That’s very interesting. ?

The error didn’t come up from us setting a string value on a slider (which should only allow integer values from 2 to 20), but from the int function, which converts a value to an integer in Python. Meaning, until that point, the value from the slider can be set to anything, and can be used in any dangerous functions (sinks) or otherwise sensitive functionality. All in all, perfect candidate for sources.

We can do a similar check with a more complex example with gr.Blocks. We run the example code from the previous section and observe our data being sent:

Observed values in the request

The values sent correspond to values given in the inputs list, which comes from the components:

Then, we test what we can send to the application. We pass values that are not expected for the given components:

Observed values in the request

No issues reported—we can set the source values to anything no matter which component they come from. That makes them perfect candidates for modeling as sources and later doing research at scale.

Except.

Echo of sources past: gr.Dropdown example

Not long after I had a look at Gradio version 4.x.x and wrote models for its sources, the Gradio team asked Trail of Bits (ToB) to conduct a security audit of the framework, which resulted in Maciej Domański and Vasco Franco‬ creating a report on Gradio’s security with a lot of cool findings. The fixes for the issues reported were incorporated into Gradio 5.0, which was released on October 9, 2024. One of the issues that ToB’s team reported was TOB-GRADIO-15: Dropdown component pre-process step does not limit the values to those‬‭ in the dropdown list‬.

TOB-GRADIO-15 from Trail of Bits’ Gradio review.

The issue was subsequently fixed in Gradio 5.0 and so, submitting values that are not valid choices in gr.Dropdown, gr.Radio, and gr.CheckboxGroup results in an error, namely:

gradio.exceptions.Error: "Value: turtle is not in the list of choices: ['cat', 'dog', 'bird']"

In this case, the vulnerabilities which may result from these sources, may not be exploitable in Gradio version 5.0 and later. There were also a number of other changes regarding security to the Gradio framework in version 5.0, which can be explored in the ToB report on Gradio security.

The change made me ponder whether to update the CodeQL models I have written and added to CodeQL. However, since the sources can still be misused in applications running Gradio versions below 5.0, I decided to leave them as they are.

Modeling Gradio with CodeQL

We have now identified a number of potential sources. We can then write CodeQL models for them, and later use these sources with existing sinks to find vulnerabilities in Gradio applications at scale. But let’s go back to the beginning: how do we model these Gradio sources with CodeQL?

Preparing example CodeQL database for testing

Recall that to run CodeQL queries, we first need to create a CodeQL database from the source code that we are interested in. Then, we can run our CodeQL queries on that database to find vulnerabilities in the code.

How CodeQL works – create a CodeQL database, and run queries on that database

We start with intentionally vulnerable source code using gr.Interface that we want to use as our test case for finding Gradio sources. The code is vulnerable to command injection via both folder and logs arguments, which end in the first argument to an os.system call.

import gradio as grimport osdef execute_cmd(folder, logs):    cmd = f"python caption.py --dir={folder} --logs={logs}"    os.system(cmd)folder = gr.Textbox(placeholder="Directory to caption")logs = gr.Checkbox(label="Add verbose logs")demo = gr.Interface(fn=execute_cmd, inputs=[folder, logs])if __name__ == "__main__":    demo.launch(debug=True)

Let’s create another example using gr.Blocks and gr.Button.click to mimic our earlier examples. Similarly, the code is vulnerable to command injection via both folder and logs arguments. The code is a simplified version of a vulnerability I found in an open source project.

import gradio as grimport osdef execute_cmd(folder, logs):    cmd = f"python caption.py --dir={folder} --logs={logs}"    os.system(cmd)with gr.Blocks() as demo:    gr.Markdown("Create caption files for images in a directory")    with gr.Row():        folder = gr.Textbox(placeholder="Directory to caption")        logs = gr.Checkbox(label="Add verbose logs")    btn = gr.Button("Run")    btn.click(fn=execute_cmd, inputs=[folder, logs])if __name__ == "__main__":    demo.launch(debug=True)

I also added two more code snippets which use positional arguments instead of keyword arguments. The database and the code snippets are available in the CodeQL zero to hero repository.

Now that we have the code, we can create a CodeQL database for it by using the CodeQL CLI. It can be installed either as an extension to the gh tool (recommended) or as a binary. Using the gh tool with the CodeQL CLI makes it much easier to update CodeQL.

CodeQL CLI installation instructions
    Install GitHub’s command line tool, gh, using the installation instructions for your system.Install CodeQL CLI:
    gh extensions install github/gh-codeql
    (optional, but recommended) To use the CodeQL CLI directly in the terminal (without having to type gh), run:
    gh codeql install-stub
    Make sure to regularly update the CodeQL CLI with:
    codeql set-version latest

After installing CodeQL CLI, we can create a CodeQL database. First, we move to the folder, where all our source code is located. Then, to create a CodeQL database called gradio-cmdi-db for the Python code in the folder gradio-tests, run:

codeql database create gradio-cmdi-db --language=python --source-root='./gradio-tests'

This command creates a new folder gradio-cmdi-db with the extracted source code elements. We will use this database to test and run CodeQL queries on.

I assume here you already have the VS Code CodeQL starter workspace set up. If not, follow the instructions to create a starter workspace and come back when you are finished.

To run queries on the database we created, we need to add it to the VS Code CodeQL extension. We can do it with the “Choose Database from Folder” button and pointing it to the gradio-cmdi-db folder.

Now that we are all set up, we can move on to actual CodeQL modeling.

Identifying code elements to query for

Let’s have another look at our intentionally vulnerable code.

import gradio as grimport osdef execute_cmd(folder, logs):    cmd = f"python caption.py --dir={folder} --logs={logs}"    os.system(cmd)    return f"Command: {cmd}"folder = gr.Textbox(placeholder="Directory to caption")logs = gr.Checkbox(label="Save verbose logs")output = gr.Textbox()demo = gr.Interface(    fn=execute_cmd,    inputs=[folder, logs],    outputs=output)if __name__ == "__main__":    demo.launch(debug=True)

In the first example with the Interface class, we had several components on the website. In fact, on the left side we have one source component which takes user input – a textbox. On the right side, we have an output text component. So, it’s not enough that a component is, for example, of a gr.Textbox() type to be considered a source.

To be considered a source of an untrusted input, a component has to be passed to the inputs keyword argument, which takes an input component or a list of input components that will be used by the function passed to fn and processed by the logic of the application in a potentially vulnerable way. So, not all components are sources. In our case, any values passed to inputs in the gr.Interface class are sources. We could go a step further, and say that if gr.Interface is used, then anything passed to the execute_cmd function is a source – so here the folder and logs.

The same situation happens in the second example with the gr.Blocks class and the gr.Button.click event listener. Any arguments passed to inputs in the Button.onlick method, so, to the execute_cmd function, are sources.

Modeling gr.Interface

Let’s start by looking at the gr.Interface class.

Since we are interested in identifying any values passed to inputs in the gr.Interface class, we first need to identify any calls to gr.Interface.

CodeQL for Python has a library for finding reference classes and functions defined in library code called ApiGraphs, which we can use to identify any calls to gr.Interface. See CodeQL zero to hero part 2 for a refresher on writing CodeQL and CodeQL zero to hero part 3 for a refresher on using ApiGraphs.

We can get all references to gr.Interface calls with the following query. Note that:

/** * @id codeql-zero-to-hero/4-1 * @severity error * @kind problem */import pythonimport semmle.python.ApiGraphsfrom API::CallNode nodewhere node =    API::moduleImport("gradio").getMember("Interface").getACall()select node, "Call to gr.Interface"

Run the query by right-clicking and selecting “CodeQL: Run Query on Selected Database”. If you are using the same CodeQL database, you should see two results, which proves that the query is working as expected (recall that I added more code snippets to the test database. The database and the code snippets are available in the CodeQL zero to hero repository).

Query results in two alerts

Next, we want to identify values passed to inputs in the gr.Interface class, which are passed to the execute_cmd function. We could do it in two ways—by identifying the values passed to inputs and then linking them to the function referenced in fn, or by looking at the parameters to the function referenced in fn directly. The latter is a bit easier, so let’s focus on that solution. If you’d be interested in the second solution, check out the Taint step section.

To sum up, we are interested in getting the folder and logs parameters.

import gradio as grimport osdef execute_cmd(folder, logs):    cmd = f"python caption.py --dir={folder} --logs={logs}"    os.system(cmd)    return f"Command: {cmd}"folder = gr.Textbox(placeholder="Directory to caption")logs = gr.Checkbox(label="Save verbose logs")output = gr.Textbox()demo = gr.Interface(    fn=execute_cmd,    inputs=[folder, logs],    outputs=output)if __name__ == "__main__":    demo.launch(debug=True)

We can get folder and logs with the query below:

/** * @id codeql-zero-to-hero/4-2 * @severity error * @kind problem */ import python import semmle.python.ApiGraphs from API::CallNode node where node =     API::moduleImport("gradio").getMember("Interface").getACall() select node.getParameter(0, "fn").getParameter(_), "Gradio sources"

To get the first the function reference in fn (or in the 1st positional argument) we use the getParameter(0, "fn") predicate – 0 refers to the 1st positional argument and "fn" refers to the fn keyword argument. Then, to get the parameters themselves, we use the getParameter(_) predicate. Note that an underscore here is a wildcard, meaning it will output all of the parameters to the function referenced in fn. Running the query results in 3 alerts.

Query results in three alerts

We can also encapsulate the logic of the query into a class to make it more portable. This query will give us the same results. If you need a refresher on classes and using the exists mechanism, see CodeQL zero to hero part 2.

/** * @id codeql-zero-to-hero/4-3 * @severity error * @kind problem */ import python import semmle.python.ApiGraphs import semmle.python.dataflow.new.RemoteFlowSources class GradioInterface extends RemoteFlowSource::Range {    GradioInterface() {        exists(API::CallNode n |        n = API::moduleImport("gradio").getMember("Interface").getACall() |        this = n.getParameter(0, "fn").getParameter(_).asSource())    }    override string getSourceType() { result = "Gradio untrusted input" } }from GradioInterface inpselect inp, "Gradio sources"

Note that for GradioInterface class we start with the RemoteFlowSource::Range supertype. This allows us to add the sources contained in the query to the RemoteFlowSource abstract class.

An abstract class is a union of all its subclasses, for example, the GradioInterface class we just modeled as well as the sources already added to CodeQL, for example, from Flask, Django or Tornado web frameworks. An abstract class is useful if you want to group multiple existing classes together under a common name.

Meaning, if we now query for all sources using the RemoteFlowSource class, the results will include the results produced from our class above. Try it!

/** * @id codeql-zero-to-hero/4-4 * @severity error * @kind problem */ import python import semmle.python.ApiGraphs import semmle.python.dataflow.new.RemoteFlowSourcesclass GradioInterface extends RemoteFlowSource::Range {    GradioInterface() {        exists(API::CallNode n |        n = API::moduleImport("gradio").getMember("Interface").getACall() |        this = n.getParameter(0, "fn").getParameter(_).asSource())    }    override string getSourceType() { result = "Gradio untrusted input" } }from RemoteFlowSource rfsselect rfs, "All python sources"

For a refresher on RemoteFlowSource, and how to use it in a query, head to CodeQL zero to hero part 3.

Note that since we modeled the new sources using the RemoteFlowSource abstract class, all Python queries that already use RemoteFlowSource will automatically use our new sources if we add them to library files, like I did in this pull request to add Gradio models. Almost all CodeQL queries use RemoteFlowSource. For example, if you run the SQL injection query, it will also include vulnerabilities that use the sources we’ve modeled. See how to run prewritten queries in CodeQL zero to hero part 3.

Modeling gr.Button.click

We model gr.Button.click in a very similar way.

/** * @id codeql-zero-to-hero/4-5 * @severity error * @kind problem */ import python import semmle.python.ApiGraphs from API::CallNode node where node =     API::moduleImport("gradio").getMember("Button").getReturn()     .getMember("click").getACall()select node.getParameter(0, "fn").getParameter(_), "Gradio sources"

Note that in the code we first create a Button object with gr.Button() and then call the click() event listener on it. Due to that, we need to use API::moduleImport("gradio").getMember("Button").getReturn() to first get the nodes representing the result of calling gr.Button() and then we continue with .getMember("click").getACall() to get all calls to gr.Button.click. Running the query results in 3 alerts.

Query results in three alerts

We can also encapsulate the logic of this query into a class too.

/** * @id codeql-zero-to-hero/4-6 * @severity error * @kind problem */ import python import semmle.python.ApiGraphs import semmle.python.dataflow.new.RemoteFlowSourcesclass GradioButton extends RemoteFlowSource::Range {    GradioButton() {        exists(API::CallNode n |        n = API::moduleImport("gradio").getMember("Button").getReturn()        .getMember("click").getACall() |        this = n.getParameter(0, "fn").getParameter(_).asSource())    }    override string getSourceType() { result = "Gradio untrusted input" } }from GradioButton inpselect inp, "Gradio sources"

Vulnerabilities using Gradio sources

We can now use our two classes as sources in a taint tracking query, to detect vulnerabilities that have a Gradio source, and, continuing with our command injection example, an os.system sink (the first argument to the os.system call is the sink). See CodeQL zero to hero part 3 to learn more about taint tracking queries.

The os.system call is defined in the OsSystemSink class and the sink, that is the first argument to the os.system sink call, is defined in the isSink predicate.

/** * @id codeql-zero-to-hero/4-7 * @severity error * @kind path-problem */ import python import semmle.python.dataflow.new.DataFlow import semmle.python.dataflow.new.TaintTracking import semmle.python.ApiGraphs import semmle.python.dataflow.new.RemoteFlowSources import MyFlow::PathGraphclass GradioButton extends RemoteFlowSource::Range {    GradioButton() {        exists(API::CallNode n |        n = API::moduleImport("gradio").getMember("Button").getReturn()        .getMember("click").getACall() |        this = n.getParameter(0, "fn").getParameter(_).asSource())    }    override string getSourceType() { result = "Gradio untrusted input" } } class GradioInterface extends RemoteFlowSource::Range {    GradioInterface() {        exists(API::CallNode n |        n = API::moduleImport("gradio").getMember("Interface").getACall() |        this = n.getParameter(0, "fn").getParameter(_).asSource())    }    override string getSourceType() { result = "Gradio untrusted input" } }class OsSystemSink extends API::CallNode {    OsSystemSink() {        this = API::moduleImport("os").getMember("system").getACall()    }} private module MyConfig implements DataFlow::ConfigSig {   predicate isSource(DataFlow::Node source) {     source instanceof GradioButton     or     source instanceof GradioInterface   }   predicate isSink(DataFlow::Node sink) {    exists(OsSystemSink call |        sink = call.getArg(0)        )   } } module MyFlow = TaintTracking::Global<MyConfig>; from MyFlow::PathNode source, MyFlow::PathNode sink where MyFlow::flowPath(source, sink) select sink.getNode(), source, sink, "Data Flow from a Gradio source to `os.system`"

Running the query results in 6 alerts, which show us the path from source to sink. Note that the os.system sink we used is already modeled in CodeQL, but we are using it here to illustrate the example.

Query results in six alerts

Similarly to the GradioInterface class, since we have already written the sources, we can actually use them (as well as the query we’ve written above) on any Python project. We just have to add them to library files, like I did in this pull request to add Gradio models.

We can actually run any query on up to 1000 projects at once using a tool called Multi repository Variant Analysis (MRVA).

But before that.

Other sources in Gradio

Based on the tests we did in the Identifying attack surface in Gradio section, we can identify other sources that behave in a similar way. For example, there’s gr.LoginButton.click, an event listener that also takes inputs and could be considered a source. I’ve modeled these cases and added them to CodeQL, which you can see in the pull request to add Gradio models. The modeling of these event listeners is very similar to what we’ve done in the previous section.

Taint step

We’ve mentioned that there are two ways to model gr.Interface and other Gradio sources—by identifying the values passed to inputs and then linking them to the function referenced in fn, or by looking at the parameters to the function referenced in fn directly.

import gradio as grimport osdef execute_cmd(folder, logs):    cmd = f"python caption.py --dir={folder} --logs={logs}"    os.system(cmd)    return f"Command: {cmd}"folder = gr.Textbox(placeholder="Directory to caption")logs = gr.Checkbox(label="Save verbose logs")output = gr.Textbox()demo = gr.Interface(    fn=execute_cmd,    inputs=[folder, logs],    outputs=output)if __name__ == "__main__":    demo.launch(debug=True)

As it turns out, machine learning applications written using Gradio often use a lot of input variables, which are later processed by the application. In this case, inputs argument gets a list of variables, which at times can be very long. I’ve found several cases which used a list with 10+ elements.

In these cases, it would be nice to be able to track the source all the way to the component that introduces it— in our case, gr.Textbox and gr.Checkbox.

To do that, we need to use a taint step. Taint steps are usually used in case taint analysis stops at a specific code element, and we want to make it propagate forward. In our case, however, we are going to write a taint step to track a variable in inputs, that is an element of a list, and track it back to the component.

The Gradio.qll file in CodeQL upstream contains all the Gradio source models and the taint step if you’d like to see the whole modeling.

We start by identifying the variables passed to inputs in, for example, gr.Interface:

class GradioInputList extends RemoteFlowSource::Range {    GradioInputList() {      exists(GradioInput call |        // limit only to lists of parameters given to `inputs`.        (          (            call.getKeywordParameter("inputs").asSink().asCfgNode() instanceof ListNode            or            call.getParameter(1).asSink().asCfgNode() instanceof ListNode          ) and          (            this = call.getKeywordParameter("inputs").getASubscript().getAValueReachingSink()            or            this = call.getParameter(1).getASubscript().getAValueReachingSink()          )        )      )    }    override string getSourceType() { result = "Gradio untrusted input" }  }

Next, we identify the function in fn and link the elements of the list of variables in inputs to the parameters of the function referenced in fn.

class ListTaintStep extends TaintTracking::AdditionalTaintStep {  override predicate step(DataFlow::Node nodeFrom, DataFlow::Node nodeTo) {    exists(GradioInput node, ListNode inputList |       inputList = node.getParameter(1, "inputs").asSink().asCfgNode() |       exists(int i |          nodeTo = node.getParameter(0, "fn").getParameter(i).asSource() |         nodeFrom.asCfgNode() =         inputList.getElement(i))       )    }  }

Let’s explain the taint step, step by step.

exists(GradioInput node, Listnode inputList |

In the taint step, we define two temporary variables in the exists mechanism—node of type GradioInput and inputList of type ListNode.

inputList = node.getParameter(1, "inputs").asSink().asCfgNode() |

Then, we set our inputList to the value of inputs. Note that because inputList has type ListNode, we are looking only for lists.

exists(int i | nodeTo = node.getParameter(0, "fn").getParameter(i).asSource() |nodeFrom.asCfgNode() = inputList.getElement(i))

Next, we identify the function in fn and link the parameters of the function referenced in fn to the elements of the list of variables in inputs, by using a temporary variable i.

All in all, the taint step provides us with a nicer display of the paths, from the component used as a source to a potential sink.

Scaling the research to thousands of repositories with MRVA

CodeQL zero to hero part 3 introduced Multi-Repository Variant Analysis (MRVA) and variant analysis. Head over there if you need a refresher on the topics.

In short,MRVA allows you to run a query on up to 1000 projects hosted on GitHub at once. It comes preconfigured with dynamic lists for most popular repositories 10, 100, and 1000 for each language. You can configure your own lists of repositories to run CodeQL queries on and potentially find more variants of vulnerabilities that use our new models. @maikypedia wrote a neat case study about using MRVA to find SSTI vulnerabilities in Ruby and Unsafe Deserialization vulnerabilities in Python.

MRVA is used together with the VS Code CodeQL extension and can be configured in the extension, in the “Variant Analysis” section. It uses GitHub Actions to run, so you need a repository, which will be used as a controller to run these actions. You can create a public repository, in which case running the queries will be free, but in this case, you can run MRVA only on public repositories. The docs contain more information about MRVA and its setup.

Using MRVA, I’ve found 11 vulnerabilities to date in several Gradio projects. Check out the vulnerability reports on GitHub Security Lab’s website.

Reach out!

Today, we learned how to model a new framework in CodeQL, using Gradio as an example, and how to use those models for finding vulnerabilities at scale. I hope that this post helps you with finding new cool vulnerabilities! ?

If CodeQL and this post helped you to find a vulnerability, we would love to hear about it! Reach out to us on GitHub Security Lab on Slack or tag us @ghsecuritylab on X.

If you have any questions, issues with challenges or with writing a CodeQL query, feel free to join and ask on the GitHub Security Lab server on Slack. The Slack server is open to anyone and gives you access to ask questions about issues with CodeQL, CodeQL modeling or anything else CodeQL related, and receive answers from a number of CodeQL engineers. If you prefer to stay off Slack, feel free to ask any questions in CodeQL repository discussions or in GitHub Security Lab repository discussions.

The post CodeQL zero to hero part 4: Gradio framework case study appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gradio CodeQL 漏洞挖掘 安全分析 Python
相关文章