TVM 中 NVIDIA GPU 的算子实现是以 template 形式编写的,该 template 有许多可调参数(tile 因子,unrolling 等)。对神经网络中的所有卷积和深度卷积算子调优后,会生成一个日志文件,它存储所有必需算子的最佳参数值。当 TVM 编译器编译这些算子时,会查询这个日志文件,从而获取最佳参数值。

我们还发布了一些 NVIDIA GPU 的预调参数,可以前往 NVIDIA GPU Benchmark 查看详细信息。

注意,本教程无法在 Windows 或最新版本的 macOS 上运行。如需运行,请将本教程的主体放在 if __name__ == "__main__": 代码块中。

要在 TVM 中使用 autotvm 包,需要安装额外的依赖(如果用的是 Python2,请将「3」更改为「2」):

pip3 install --user psutil xgboost tornado cloudpickle

为了让 TVM 在调优过程中运行更快,推荐使用 Cython 作为 TVM 的 FFI。在 TVM 的根目录下,执行如下命令:

pip3 install --user cythonsudo make cython3
import osimport numpy as npimport tvmfrom tvm import relay, autotvmimport tvm.relay.testingfrom tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTunerimport tvm.contrib.graph_executor as runtime

首先要在 Relay 前端 API 中定义网络,可以从 tvm.relay.testing 加载一些预定义的网络。也可以从 MXNet、ONNX 和 TensorFlow 加载模型。

def get_network(name, batch_size):    """获取网络的符号定义和随机权重"""    input_shape = (batch_size, 3, 224, 224)    output_shape = (batch_size, 1000)    if "resnet" in name:        n_layer = int(name.split("-")[1])        mod, params = relay.testing.resnet.get_workload(            num_layers=n_layer, batch_size=batch_size, dtype=dtype        )    elif "vgg" in name:        n_layer = int(name.split("-")[1])        mod, params = relay.testing.vgg.get_workload(            num_layers=n_layer, batch_size=batch_size, dtype=dtype        )    elif name == "mobilenet":        mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)    elif name == "squeezenet_v1.1":        mod, params = relay.testing.squeezenet.get_workload(            batch_size=batch_size, version="1.1", dtype=dtype        )    elif name == "inception_v3":        input_shape = (batch_size, 3, 299, 299)        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)    elif name == "mxnet":        # MXNet 模型的示例        from mxnet.gluon.model_zoo.vision import get_model        block = get_model("resnet18_v1", pretrained=True)        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)        net = mod["main"]        net = relay.Function(            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs        )        mod = tvm.IRModule.from_expr(net)    else:        raise ValueError("Unsupported network: " + name)    return mod, params, input_shape, output_shape
#### 设备配置 ####target = tvm.target.cuda()#### 调优 OPTION ####network = "resnet-18"log_file = "%s.log" % networkdtype = "float32"tuning_option = {    "log_filename": log_file,    "tuner": "xgb",    "n_trial": 2000,    "early_stopping": 600,    "measure_option": autotvm.measure_option(        builder=autotvm.LocalBuilder(timeout=10),        runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),    ),}
/workspace/python/tvm/target/target.py:389: UserWarning: Try specifying cuda arch by adding 'arch=sm_xx' to your target.  warnings.warn("Try specifying cuda arch by adding 'arch=sm_xx' to your target.")

如果调优时间充足,可以把 n_trialearly_stopping 设置得大一些,就可以让调优运行的时间更长。

若有多个设备,可以用所有设备进行测试,以加快调优过程。(参阅下面的 Scale up measurement 部分)。

现在从网络中提取调优任务,并开始调优。接下来我们提供一个简单的实用函数。它只是一个初始实现,按顺序对任务列表进行调优。未来会引入更复杂的调优 scheduler。

# 可跳过此函数的实现。def tune_tasks(    tasks,    measure_option,    tuner="xgb",    n_trial=1000,    early_stopping=None,    log_filename="tuning.log",    use_transfer_learning=True,):    # 创建 tmp 日志文件    tmp_log_file = log_filename + ".tmp"    if os.path.exists(tmp_log_file):        os.remove(tmp_log_file)    for i, tsk in enumerate(reversed(tasks)):        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))        # 创建调优器        if tuner == "xgb":            tuner_obj = XGBTuner(tsk, loss_type="reg")        elif tuner == "xgb_knob":            tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="knob")        elif tuner == "xgb_itervar":            tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="itervar")        elif tuner == "xgb_curve":            tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="curve")        elif tuner == "xgb_rank":            tuner_obj = XGBTuner(tsk, loss_type="rank")            elif tuner == "xgb_rank_knob":            tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="knob")        elif tuner == "xgb_rank_itervar":            tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="itervar")        elif tuner == "xgb_rank_curve":            tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="curve")        elif tuner == "xgb_rank_binary":            tuner_obj = XGBTuner(tsk, loss_type="rank-binary")        elif tuner == "xgb_rank_binary_knob":            tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="knob")        elif tuner == "xgb_rank_binary_itervar":            tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="itervar")        elif tuner == "xgb_rank_binary_curve":            tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="curve")        elif tuner == "ga":            tuner_obj = GATuner(tsk, pop_size=100)        elif tuner == "random":            tuner_obj = RandomTuner(tsk)        elif tuner == "gridsearch":            tuner_obj = GridSearchTuner(tsk)        else:            raise ValueError("Invalid tuner: " + tuner)        if use_transfer_learning:            if os.path.isfile(tmp_log_file):                tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))        # 开始调优        tsk_trial = min(n_trial, len(tsk.config_space))        tuner_obj.tune(            n_trial=tsk_trial,            early_stopping=early_stopping,            measure_option=measure_option,            callbacks=[                autotvm.callback.progress_bar(tsk_trial, prefix=prefix),                autotvm.callback.log_to_file(tmp_log_file),            ],        )    # 选择最佳记录到缓存文件    autotvm.record.pick_best(tmp_log_file, log_filename)    os.remove(tmp_log_file)
def tune_and_evaluate(tuning_opt):    # 从 relay 程序中提取工作负载    print("Extract tasks...")    mod, params, input_shape, out_shape = get_network(network, batch_size=1)    tasks = autotvm.task.extract_from_program(        mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)    )    # 运行调优任务    print("Tuning...")    tune_tasks(tasks, **tuning_opt)    # 编译具有历史最佳记录的内核    with autotvm.apply_history_best(log_file):        print("Compile...")        with tvm.transform.PassContext(opt_level=3):            lib = relay.build_module.build(mod, target=target, params=params)        # 加载参数        dev = tvm.device(str(target), 0)        module = runtime.GraphModule(lib["default"](dev))        data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))        module.set_input("data", data_tvm)        # 评估        print("Evaluate inference time cost...")        print(module.benchmark(dev, number=1, repeat=600))# 不在网页服务器中运行调优,因为它需要的时间太长。# 取消注释运行下一行# tune_and_evaluate(tuning_option)

调优需要编译许多程序并从中提取特征,所以推荐使用高性能的 CPU。下面列出了一个输出示例。在 32T AMD Ryzen Threadripper 上大约耗时 4 小时才能看到以下输出,调优 target 是 NVIDIA 1080 Ti。(编译时会看到一些报错,若调优继续运行则可以忽略。)

Extract tasks...Tuning...[Task  1/12]  Current/Best:  541.83/3570.66 GFLOPS | Progress: (960/2000) | 1001.31 s Done.[Task  2/12]  Current/Best:    0.56/ 803.33 GFLOPS | Progress: (704/2000) | 608.08 s Done.[Task  3/12]  Current/Best:  103.69/1141.25 GFLOPS | Progress: (768/2000) | 702.13 s Done.[Task  4/12]  Current/Best: 2905.03/3925.15 GFLOPS | Progress: (864/2000) | 745.94 sterminate called without an active exception[Task  4/12]  Current/Best: 2789.36/3925.15 GFLOPS | Progress: (1056/2000) | 929.40 s Done.[Task  5/12]  Current/Best:   89.06/1076.24 GFLOPS | Progress: (704/2000) | 601.73 s Done.[Task  6/12]  Current/Best:   40.39/2129.02 GFLOPS | Progress: (1088/2000) | 1125.76 s Done.[Task  7/12]  Current/Best: 4090.53/5007.02 GFLOPS | Progress: (800/2000) | 903.90 s Done.[Task  8/12]  Current/Best:    4.78/1272.28 GFLOPS | Progress: (768/2000) | 749.14 s Done.[Task  9/12]  Current/Best: 1391.45/2325.08 GFLOPS | Progress: (992/2000) | 1084.87 s Done.[Task 10/12]  Current/Best: 1995.44/2383.59 GFLOPS | Progress: (864/2000) | 862.60 s Done.[Task 11/12]  Current/Best: 4093.94/4899.80 GFLOPS | Progress: (224/2000) | 240.92 sterminate called without an active exception[Task 11/12]  Current/Best: 3487.98/4909.91 GFLOPS | Progress: (480/2000) | 534.96 sterminate called without an active exception[Task 11/12]  Current/Best: 4636.84/4912.17 GFLOPS | Progress: (1184/2000) | 1381.16 sterminate called without an active exception[Task 11/12]  Current/Best:   50.12/4912.17 GFLOPS | Progress: (1344/2000) | 1602.81 s Done.[Task 12/12]  Current/Best: 3581.31/4286.30 GFLOPS | Progress: (736/2000) | 943.52 s Done.Compile...Evaluate inference time cost...Mean inference time (std dev): 1.07 ms (0.05 ms)

参考基线为 MXNet + TensorRT 在 ResNet-18 上的时间成本为 1.30ms,所以我们更快一点。

自调优模块容易出错,若总是看到「0.00/ 0.00 GFLOPS」,则表明存在问题。

首先确保设置了正确的设备配置,然后,在脚本开头添加如下行来打印调试信息,它将打印每个测试结果,可从中找到有用的错误消息。

import logginglogging.getLogger('autotvm').setLevel(logging.DEBUG)

随时在 https://discuss.tvm.apache.org 上向社区寻求帮助。

若有多个设备,可用所有设备进行测试。 TVM 使用 RPC Tracker(集中的控制器节点)来管理分布式设备,若有 10 个 GPU 卡,可以将它们全部注册到 tracker,并行运行 10 次测试,从而加快调优过程。

要启动 RPC tracker,在主机上运行如下命令。整个调优过程中都需要 tracker,因此需要为此命令打开一个新终端:

python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
INFO:RPCTracker:bind to 0.0.0.0:9190

需要为每个设备打开一个新终端,启动一个 RPC 专用服务器。使用 key 来区分设备的类型。(注意:对于 rocm 后端,编译器存在一些内部错误,需要在参数列表中添加 --no-fork。)

python -m tvm.exec.rpc_server --tracker=127.0.0.1:9190 --key=1080ti

注册设备后,可以通过查询 rpc_tracker 来确认是否注册成功

python -m tvm.exec.query_rpc_tracker --host=127.0.0.1 --port=9190

比如有 4 个 1080 ti,2 个 titanx,1 个 gfx900,输出如下:

Queue Status----------------------------------key          total  free  pending----------------------------------1080ti       4      4     0titanx       2      2     0gfx900       1      1     0----------------------------------

最后,更改调优选项来使用 RPCRunner。用下面的代码替换上面的相应部分。

tuning_option = {    "log_filename": log_file,    "tuner": "xgb",    "n_trial": 2000,    "early_stopping": 600,    "measure_option": autotvm.measure_option(        builder=autotvm.LocalBuilder(timeout=10),        runner=autotvm.RPCRunner(            "1080ti",  # change the device key to your key            "127.0.0.1",            9190,            number=20,            repeat=3,            timeout=4,            min_repeat_ms=150,        ),    ),}

下载 Python 源代码:tune_relay_cuda.py

下载 Jupyter Notebook:tune_relay_cuda.ipynb

内容中包含的图片若涉及版权问题,请及时与我们联系删除