[Previous] [Up] [Next]

This example describes the API of the code generated by the "PTX" backend for compiled materials and shows how a renderer can call this generated code to evaluate sub-expressions of multiple materials using CUDA.

New Topics

MDL material state (PTX)
Execution of generated code (PTX)
Loading textures (PTX)
Texture access functions (PTX)

Detailed Description

MDL material state (PTX)

The MDL material state structure mi::neuraylib::Shading_state_material is a representation of the renderer state as defined in section 19 "Renderer state" in the MDL specification. It is used to make the state of the renderer (like the position of an intersection point on the surface, the shading normal and the texture coordinates) available to the generated code.

Here's a version of the material state structure making use of the types from CUDA's vector_types.h:

struct Shading_state_material {
    float3                 normal;                  // state::normal() result
    float3                 geom_normal;             // state::geom_normal() result
    float3                 position;                // state::position() result
    float                  animation_time;          // state::animation_time() result
    const float3          *text_coords;             // state::texture_coordinate() table
    const float3          *tangent_u;               // state::texture_tangent_u() table
    const float3          *tangent_v;               // state::texture_tangent_v() table
    float4                *text_results;            // texture results lookup table
    const char            *ro_data_segment;         // read-only data segment
    const float4          *world_to_object;         // world-to-object transform matrix
    const float4          *object_to_world;         // object-to-world transform matrix
    int                    object_id;               // state::object_id() result
};

Please refer to the structure documentation for more information.

In this example, we fill the material state structure with some example values and only use one texture space. For the world-to-object and object-to-world transformation matrices we use identity matrices. We will iterate the position and text_coords fields over a 2x2 quad around the center of the world with position x and y coordinates ranging from -1 to 1 and the texture uv-coordinates ranging from 0 to 1, respectively.

Execution of generated code (PTX)

For the non-native backends, the generated code has to be called directly from the corresponding framework, so we need to know the prototypes of the functions generated via mi::neuraylib::IMdl_backend::translate_material_expression(). With "NAME" being the function name you provided as fname parameter and "T" being the result type, they look like this:

void NAME(
    T                             *result,
    Shading_state_material const  *state,
    Resource_data const           *res_data,
    void const                    *exception_state,
    char const                    *captured_args);

or written as a PTX prototype:

.visible .func NAME(
        .param .b64 result,
        .param .b64 state,
        .param .b64 res_data,
        .param .b64 exception_state,
        .param .b64 captured_args
);

The res_data parameter is used to provide access to resources like textures depending on the way how those resources are accessed (see "Texture access functions" below). If it is not used, the pointers inside the structure may be nullptr:

struct Resource_data {
    const void                 *shared_data;
    const Texture_handler_base *texture_handler;
};

The exception_state parameter allows to provide handlers for out-of-bounds array access exceptions and division-by-zero exceptions. But for the PTX backend, this is not supported and the parameter should be set to nullptr.

The captured_args parameter is used to provide the data of the mi::neuraylib::ITarget_argument_block object for class-compiled materials. The data can either be manually created using the information from mi::neuraylib::ITarget_value_layout or by using mi::neuraylib::ITarget_code::get_argument_block() or mi::neuraylib::ITarget_code::create_argument_block(). For instance-compiled materials, this parameter should be set to nullptr. See Instance-compilation and class-compilation for more details about instance and class compilation.

To make the generated functions available to our CUDA kernel, we have to link them with the kernel. We could just declare the generated functions with the corresponding names as extern "C" in the CUDA source code of the kernel, but we may want to decide at runtime how many materials will be available. So we will add an indirection through an array of pointers to the generated functions which we provide as an additional PTX source code buffer to the CUDA linker.

Note: We currently have to add a dummy function to the PTX code containing this function pointer array, because the CUDA linker will otherwise just resolve the function addresses to zero.; Also, we need to compile the CUDA kernel with the option -rdc=true (relocatable device code), otherwise the extern declared function array will be treated as a definition resulting in two arrays.

In this example, we bake multiple materials into a texture with a user-configurable checkerboard pattern by executing a material for every texel updating the material state accordingly. At the end, we write the texture to disk.

Loading textures (PTX)

When the nv_openimageio plugin has been loaded via mi::neuraylib::IPlugin_configuration::load_plugin_library() before starting the MDL SDK, the SDK will automatically load textures on the host side for many common image formats and make them available via mi::neuraylib::ITarget_code::get_texture(). Note, that the first texture is always the invalid texture, so only if there is more than just one texture according to mi::neuraylib::ITarget_code::get_texture_count(), there will be real referenced textures available.

Here's a small code snippet showing how to access the mi::neuraylib::ICanvas of the texture at index i.

mi::base::Handle<const mi::neuraylib::ITexture> texture(
    transaction->access<mi::neuraylib::ITexture>(target_code->get_texture(i)));
mi::base::Handle<const mi::neuraylib::IImage> image(
    transaction->access<mi::neuraylib::IImage>(texture->get_image()));
mi::base::Handle<const mi::neuraylib::ICanvas> canvas(image->get_canvas());

The textures still have to be copied to the GPU and possibly they have to be gamma corrected and converted to a format understood by the texture access functions you provide. In this example, we use the mi::neuraylib::IImage_api to apply the gamma correction and to convert the image format to a float32 RGBA format.

Depending on the texture shape returned by mi::neuraylib::ITarget_code::get_texture_shape() the texture image data has to be copied to a CUDA array (2D textures), a 3D array (3D textures) or a 3D array with the cudaArrayCubemap flag set (cube textures). The textures could then be made available via CUDA texture objects.

Note: For cube textures, you should use the cudaAddressModeClamp address mode for the texture objects to avoid visual artifacts in the corners.

Texture access functions (PTX)

For non-native backends, the generated code requires a set of methods implementing texture access functionality:

tex_lookup_float4_2d
tex_lookup_float3_2d
tex_texel_float4_2d
tex_lookup_float4_3d
tex_lookup_float3_3d
tex_texel_float4_3d
tex_lookup_float4_cube
tex_lookup_float3_cube
tex_resolution_2d

Except for the last one, these correspond directly to the functions described in section 20.3 "Standard library functions - Texture" in the MDL specification.

The tex_lookup_* functions receive floating-point texture coordinates and should provide a sampled value, whereas tex_texel_* functions receive integer texture coordinates and should provide a raw texture value. There are variants for texture lookups with and without alpha channel (float4 / float3) and for the different texture shapes (2d / 3d / cube) as described in section 6.12 "Variables and data types - Textures" in the MDL specification. Note, that PTEX textures are currently not supported by the backends.

tex_resolution_2d retrieves the width and height at the given uv-tile coordinates for a texture_2d. For non-uv-tile textures, the uv-tile coordinates are always (0, 0).

For the PTX backend, there are different ways how the methods can be provided, which can be chosen by setting the "tex_lookup_call_mode" option via the mi::neuraylib::IMdl_backend::set_option() method to the corresponding mode. Please refer to Texture lookup call modes of the PTX backend for more details.

In this example, you can switch from the "direct_call" mode to the slower "vtable" mode by commenting out this line in example_cuda_shared.h:

check_success(be_cuda_ptx->set_option("tex_lookup_call_mode", "direct_call") == 0);

Example Source

To compile the source code, you need to install the "CUDA Toolkit 9" available at https://developer.nvidia.com/cuda-downloads.

For Windows, you should install the toolkit with enabled Visual Studio integration to be able to use the provided project files.

For Linux and Mac OS X, you have to provide the path to the CUDA Toolkit installation via a "CUDA_PATH" environment variable or by setting the path in the Makefile.

Source Code Location: examples/mdl_sdk/execution_cuda/example_execution_cuda.cpp

/******************************************************************************
 * Copyright 2024 NVIDIA Corporation. All rights reserved.
 *****************************************************************************/
 
// examples/mdl_sdk/execution_cuda/example_execution_cuda.cpp
//
// Introduces execution of the generated code for compiled material sub-expressions
// for the PTX backend with CUDA.
 
#include <iostream>
#include <vector>
 
// Enable this to dump the generated PTX code to files in the current directory.
// #define DUMP_PTX
 
#include "example_cuda_shared.h"
 
// Command line options structure.
struct Options {
    // The CUDA device ID.
    int cuda_device;
 
    // An result output file name.
    std::string outputfile;
 
    // The pattern number representing the combination of materials to display.
    unsigned material_pattern;
 
    // The resolution of the display / image.
    unsigned res_x, res_y;
 
    // Whether class compilation should be used for the materials.
    bool use_class_compilation;
 
    // Disables pixel oversampling.
    bool no_aa;
 
    // Whether derivative support should be enabled.
    bool enable_derivatives;
 
    // Whether terninary operators on *df types are executed at runtime or folded at compile time.
    bool fold_ternary_on_df;
 
    // List of materials to use.
    std::vector<std::string> material_names;
 
    // The constructor.
    Options()
        : cuda_device(0)
        , outputfile()
        , material_pattern(0)
        , res_x(700)
        , res_y(520)
        , use_class_compilation(false)
        , no_aa(false)
        , enable_derivatives(false)
        , fold_ternary_on_df(false)
    {
    }
};
 
// Bake the material sub-expressions created with the PTX backend into a canvas with the given
// resolution and the given number of samples for super-sampling.
mi::neuraylib::ICanvas *bake_expression_cuda_ptx(
    mi::neuraylib::ITransaction       *transaction,
    mi::neuraylib::IImage_api         *image_api,
    std::vector<mi::base::Handle<const mi::neuraylib::ITarget_code> > const &target_codes,
    std::vector<size_t> const         &arg_block_indices,
    Options                           &options,
    mi::Uint32                         num_samples)
{
    // Build the full CUDA kernel with all the generated code
    CUfunction  cuda_function;
    char const *ptx_name = options.enable_derivatives ?
        "example_execution_cuda_derivatives.ptx" : "example_execution_cuda.ptx";
    CUmodule    cuda_module = build_linked_kernel(
        target_codes,
        (mi::examples::io::get_executable_folder() + "/" + ptx_name).c_str(),
        "evaluate_mat_expr",
        &cuda_function);
 
    // Prepare the needed data of all target codes for the GPU
    Material_gpu_context material_gpu_context(options.enable_derivatives);
    for (size_t i = 0, num_target_codes = target_codes.size(); i < num_target_codes; ++i) {
        if (!material_gpu_context.prepare_target_code_data(
                transaction, image_api, target_codes[i].get(), arg_block_indices))
            return nullptr;
    }
    CUdeviceptr device_tc_data_list = material_gpu_context.get_device_target_code_data_list();
    CUdeviceptr device_arg_block_list =
        material_gpu_context.get_device_target_argument_block_list();
 
    // Allocate GPU output buffer
    CUdeviceptr device_outbuf;
    check_cuda_success(cuMemAlloc(&device_outbuf, options.res_x * options.res_y * sizeof(float3)));
 
    // Launch kernel for the whole image
    dim3 threads_per_block(16, 16);
    dim3 num_blocks((options.res_x + 15) / 16, (options.res_y + 15) / 16);
    void *kernel_params[] = {
        &device_outbuf,
        &device_tc_data_list,
        &device_arg_block_list,
        &options.res_x,
        &options.res_y,
        &num_samples
    };
 
    check_cuda_success(cuLaunchKernel(
        cuda_function,
        num_blocks.x, num_blocks.y, num_blocks.z,
        threads_per_block.x, threads_per_block.y, threads_per_block.z,
        0, nullptr, kernel_params, nullptr));
 
    // Create a canvas (with only one tile) and copy the result image to it
    mi::base::Handle<mi::neuraylib::ICanvas> canvas(
        image_api->create_canvas("Rgb_fp", options.res_x, options.res_y));
    mi::base::Handle<mi::neuraylib::ITile> tile(canvas->get_tile());
    float3 *data = static_cast<float3 *>(tile->get_data());
    check_cuda_success(cuMemcpyDtoH(
        data, device_outbuf, options.res_x * options.res_y * sizeof(float3)));
 
    // Cleanup resources not handled by Material_gpu_context
    check_cuda_success(cuMemFree(device_outbuf));
    check_cuda_success(cuModuleUnload(cuda_module));
 
    canvas->retain();
    return canvas.get();
}
 
void usage(char const *prog_name)
{
    std::cout
        << "Usage: " << prog_name << " [options] [(<material_pattern | (<material_name1> ...)]\n"
        << "Options:\n"
        << "  --device <id>        run on CUDA device <id> (default: 0)\n"
        << "  --res <x> <y>        resolution (default: 700x520)\n"
        << "  --cc                 use class compilation\n"
        << "  --noaa               disable pixel oversampling\n"
        << "  -d                   enable use of derivatives\n"
        << "  -o <outputfile>      image file to write result to\n"
        << "                       (default: example_cuda_<material_pattern>.png)\n"
        << "  --mdl_path <path>    mdl search path, can occur multiple times.\n"
        << "  --fold_ternary_on_df fold all ternary operators on *df types\n"
        << "  <material_pattern>   a number from 1 to 2 ^ num_materials - 1 choosing which\n"
        << "                       material combination to use (default: 2 ^ num_materials - 1)\n"
        << "  <material_name*>     qualified name of materials to use. The example will try to\n"
        << "                       access the path \"surface.scattering.tint\"."
        << std::endl;
    exit_failure();
}
 
 
//------------------------------------------------------------------------------
//
// Main function
//
//------------------------------------------------------------------------------
 
int MAIN_UTF8(int argc, char* argv[])
{
    // Parse command line options
    Options options;
    mi::examples::mdl::Configure_options configure_options;
 
    for (int i = 1; i < argc; ++i) {
        char const *opt = argv[i];
        if (opt[0] == '-') {
            if (strcmp(opt, "-o") == 0 && i < argc - 1) {
                options.outputfile = argv[++i];
            } else if (strcmp(opt, "--device") == 0 && i < argc - 2) {
                options.cuda_device = atoi(argv[++i]);
            } else if (strcmp(opt, "--res") == 0 && i < argc - 2) {
                options.res_x = std::max(atoi(argv[++i]), 1);
                options.res_y = std::max(atoi(argv[++i]), 1);
            } else if (strcmp(opt, "--cc") == 0) {
                options.use_class_compilation = true;
            } else if (strcmp(opt, "--noaa") == 0) {
                options.no_aa = true;
            } else if (strcmp(opt, "-d") == 0) {
                options.enable_derivatives = true;
            } else if (strcmp(opt, "--mdl_path") == 0 && i < argc - 1) {
                configure_options.additional_mdl_paths.push_back(argv[++i]);
            } else if (strcmp(opt, "--fold_ternary_on_df") == 0) {
                options.fold_ternary_on_df = true;
            } else {
                std::cout << "Unknown option: \"" << opt << "\"" << std::endl;
                usage(argv[0]);
            }
        } else if (opt[0] >= '0' && opt[0] <= '9') {
            options.material_pattern = unsigned(atoi(opt));
        } else
            options.material_names.push_back(std::string(opt));
    }
 
    // Use default materials, if none was provided via command line
    if (options.material_names.empty()) {
        options.material_names.push_back("::nvidia::sdk_examples::tutorials::example_execution1");
        options.material_names.push_back("::nvidia::sdk_examples::tutorials::example_execution2");
        options.material_names.push_back("::nvidia::sdk_examples::tutorials::example_execution3");
    }
 
    if (options.material_pattern == 0)
        options.material_pattern = (1 << options.material_names.size()) - 1;
    else if (options.material_pattern < 1 ||
            options.material_pattern > unsigned(1 << options.material_names.size()) - 1) {
        std::cerr << "Invalid material_pattern parameter." << std::endl;
        usage(argv[0]);
    }
 
    if (options.outputfile.empty())
        options.outputfile = "example_cuda_" + to_string(options.material_pattern) + ".png";
 
    // Access the MDL SDK
    mi::base::Handle<mi::neuraylib::INeuray> neuray(mi::examples::mdl::load_and_get_ineuray());
    if (!neuray.is_valid_interface())
        exit_failure("Failed to load the SDK.");
 
    // Configure the MDL SDK
    if (!mi::examples::mdl::configure(neuray.get(), configure_options))
        exit_failure("Failed to initialize the SDK.");
 
    // Start the MDL SDK
    mi::Sint32 ret = neuray->start();
    if (ret != 0)
        exit_failure("Failed to initialize the SDK. Result code: %d", ret);
 
    {
        // Create a transaction
        mi::base::Handle<mi::neuraylib::IDatabase> database(
            neuray->get_api_component<mi::neuraylib::IDatabase>());
        mi::base::Handle<mi::neuraylib::IScope> scope(database->get_global_scope());
        mi::base::Handle<mi::neuraylib::ITransaction> transaction(scope->create_transaction());
 
        // Access needed API components
        mi::base::Handle<mi::neuraylib::IMdl_factory> mdl_factory(
            neuray->get_api_component<mi::neuraylib::IMdl_factory>());
 
        mi::base::Handle<mi::neuraylib::IMdl_impexp_api> mdl_impexp_api(
            neuray->get_api_component<mi::neuraylib::IMdl_impexp_api>());
 
        mi::base::Handle<mi::neuraylib::IMdl_backend_api> mdl_backend_api(
            neuray->get_api_component<mi::neuraylib::IMdl_backend_api>());
 
        mi::base::Handle<mi::neuraylib::IMdl_execution_context> context(
            mdl_factory->create_execution_context());
 
        {
            // Generate code for material sub-expressions of different materials
            // according to the requested material pattern
            std::vector<mi::base::Handle<const mi::neuraylib::ITarget_code> > target_codes;
 
            Material_compiler mc(
                mdl_impexp_api.get(),
                mdl_backend_api.get(),
                mdl_factory.get(),
                transaction.get(),
                /*num_texture_results=*/ 0,
#if !defined(MDL_SOURCE_RELEASE) && defined(MDL_ENABLE_INTERPRETER)
                /*use_df_interpreter=*/ false,
#endif
                options.enable_derivatives,
                options.fold_ternary_on_df,
                /*enable_axuiliary_output=*/ false,
                /*enable_pdf=*/ true,
                /*use_adapt_normal=*/ false,
                /*enable_bsdf_flags=*/ false,
                /*enable_mdl_next=*/ false,
                /*df_handle_mode=*/ "none",
                /*lambda_return_mode=*/ "sret");
 
            for (std::size_t i = 0, n = options.material_names.size(); i < n; ++i) {
                if ((options.material_pattern & (1 << i)) != 0) {
 
                    // split module and material name
                    std::string module_name, material_simple_name;
                    if (!mi::examples::mdl::parse_cmd_argument_material_name(
                        options.material_names[i], module_name, material_simple_name, true))
                            continue;
 
                    // load the module.
                    mdl_impexp_api->load_module(transaction.get(), module_name.c_str(), context.get());
                    if (!print_messages(context.get()))
                        exit_failure("Loading module '%s' failed.", module_name.c_str());
 
                    // get the database name for the module we loaded
                    mi::base::Handle<const mi::IString> module_db_name(
                        mdl_factory->get_db_module_name(module_name.c_str()));
                    mi::base::Handle<const mi::neuraylib::IModule> module(
                        transaction->access<mi::neuraylib::IModule>(module_db_name->get_c_str()));
                    if (!module)
                        exit_failure("Failed to access the loaded module.");
 
                    // Attach the material name
                    std::string material_db_name
                        = std::string(module_db_name->get_c_str()) + "::" + material_simple_name;
                    material_db_name = mi::examples::mdl::add_missing_material_signature(
                        module.get(), material_db_name);
                    if (material_db_name.empty())
                        exit_failure("Failed to find the material %s in the module %s.",
                            material_simple_name.c_str(), module_name.c_str());
 
                    // add the sub expression
                    mc.add_material_subexpr(
                        module_name, material_db_name,
                        "surface.scattering.tint", ("tint_" + to_string(i)).c_str(),
                        options.use_class_compilation);
                }
            }
 
            // Generate target code for link unit
            target_codes.push_back(mc.generate_cuda_ptx());
 
            // Acquire image API needed to prepare the textures and to create a canvas for baking
            mi::base::Handle<mi::neuraylib::IImage_api> image_api(
                neuray->get_api_component<mi::neuraylib::IImage_api>());
 
            // Bake the material sub-expressions into a canvas
            CUcontext cuda_context = init_cuda(options.cuda_device);
            mi::base::Handle<mi::neuraylib::ICanvas> canvas(
                bake_expression_cuda_ptx(
                    transaction.get(),
                    image_api.get(),
                    target_codes,
                    mc.get_argument_block_indices(),
                    options,
                    options.no_aa ? 1 : 8));
            uninit_cuda(cuda_context);
 
            // Export the canvas to an image on disk
            if (canvas)
                mdl_impexp_api->export_canvas(options.outputfile.c_str(), canvas.get());
        }
 
        transaction->commit();
    }
 
    // Shut down the MDL SDK
    if (neuray->shutdown() != 0)
        exit_failure("Failed to shutdown the SDK.");
 
    // Unload the MDL SDK
    neuray = nullptr;
    if (!mi::examples::mdl::unload())
        exit_failure("Failed to unload the SDK.");
 
    exit_success();
}
 
// Convert command line arguments to UTF8 on Windows
COMMANDLINE_TO_UTF8

Source Code Location: examples/mdl_sdk/shared/texture_support_cuda.h

/******************************************************************************
 * Copyright 2024 NVIDIA Corporation. All rights reserved.
 *****************************************************************************/
 
// examples/mdl_sdk/shared/texture_support_cuda.h
//
// This file contains the implementations and the vtables of the texture access functions.
 
#ifndef TEXTURE_SUPPORT_CUDA_H
#define TEXTURE_SUPPORT_CUDA_H
 
#include <cuda.h>
#include <cuda_runtime.h>
#include <math.h>
 
#include <mi/neuraylib/target_code_types.h>
 
#define USE_SMOOTHERSTEP_FILTER
#ifndef M_PI
    #define M_PI            3.14159265358979323846
#endif
#define M_ONE_OVER_PI       0.318309886183790671538
 
 
typedef mi::neuraylib::tct_deriv_float                     tct_deriv_float;
typedef mi::neuraylib::tct_deriv_float2                    tct_deriv_float2;
typedef mi::neuraylib::tct_deriv_arr_float_2               tct_deriv_arr_float_2;
typedef mi::neuraylib::tct_deriv_arr_float_3               tct_deriv_arr_float_3;
typedef mi::neuraylib::tct_deriv_arr_float_4               tct_deriv_arr_float_4;
typedef mi::neuraylib::Shading_state_material_with_derivs  Shading_state_material_with_derivs;
typedef mi::neuraylib::Shading_state_material              Shading_state_material;
typedef mi::neuraylib::Texture_handler_base                Texture_handler_base;
typedef mi::neuraylib::Tex_wrap_mode                       Tex_wrap_mode;
typedef mi::neuraylib::Mbsdf_part                          Mbsdf_part;
 
 
// Custom structure representing an MDL texture, containing filtered and unfiltered CUDA texture
// objects and the size of the texture.
struct Texture
{
    explicit Texture()
        : filtered_object(0)
        , unfiltered_object(0)
        , size(make_uint3(0, 0, 0))
        , inv_size(make_float3(0.0f, 0.0f, 0.0f))
    {}
 
    explicit Texture(
        cudaTextureObject_t  filtered_object,
        cudaTextureObject_t  unfiltered_object,
        uint3                size)
        : filtered_object(filtered_object)
        , unfiltered_object(unfiltered_object)
        , size(size)
        , inv_size(make_float3(1.0f / size.x, 1.0f / size.y, 1.0f / size.z))
    {}
 
    cudaTextureObject_t  filtered_object;    // uses filter mode cudaFilterModeLinear
    cudaTextureObject_t  unfiltered_object;  // uses filter mode cudaFilterModePoint
    uint3                size;               // size of the texture, needed for texel access
    float3               inv_size;           // the inverse values of the size of the texture
};
 
// Custom structure representing an MDL BSDF measurement.
struct Mbsdf
{
    unsigned            has_data[2];            // true if there is a measurement for this part
    cudaTextureObject_t eval_data[2];           // uses filter mode cudaFilterModeLinear
    float               max_albedo[2];          // max albedo used to limit the multiplier
    float*              sample_data[2];         // CDFs for sampling a BSDF measurement
    float*              albedo_data[2];         // max albedo for each theta (isotropic)
 
    uint2           angular_resolution[2];      // size of the dataset, needed for texel access
    float2          inv_angular_resolution[2];  // the inverse values of the size of the dataset
    unsigned        num_channels[2];            // number of color channels (1 or 3)
};
 
// Structure representing a Light Profile
struct Lightprofile
{
    explicit Lightprofile()
        : angular_resolution(make_uint2(0, 0))
        , inv_angular_resolution(make_float2(0.0f, 0.0f))
        , theta_phi_start(make_float2(0.0f, 0.0f))
        , theta_phi_delta(make_float2(0.0f, 0.0f))
        , theta_phi_inv_delta(make_float2(0.0f, 0.0f))
        , candela_multiplier(0.0f)
        , total_power(0.0f)
        , eval_data(0)
    {
    }
 
    uint2           angular_resolution;     // angular resolution of the grid
    float2          inv_angular_resolution; // inverse angular resolution of the grid
    float2          theta_phi_start;        // start of the grid
    float2          theta_phi_delta;        // angular step size
    float2          theta_phi_inv_delta;    // inverse step size
    float           candela_multiplier;     // factor to rescale the normalized data
    float           total_power;
 
    cudaTextureObject_t eval_data;          // normalized data sampled on grid
    float*              cdf_data;           // CDFs for sampling a light profile
};
 
 
// The texture handler structure required by the MDL SDK with custom additional fields.
struct Texture_handler : Texture_handler_base {
    // additional data for the texture access functions can be provided here
 
    size_t         num_textures;        // the number of textures used by the material
                                        // (without the invalid texture)
    Texture const *textures;            // the textures used by the material
                                        // (without the invalid texture)
 
    size_t         num_mbsdfs;          // the number of mbsdfs used by the material
                                        // (without the invalid mbsdf)
    Mbsdf const    *mbsdfs;             // the mbsdfs used by the material
                                        // (without the invalid mbsdf)
 
    size_t          num_lightprofiles;  // number of elements in the lightprofiles field
                                        // (without the invalid light profile)
    Lightprofile const *lightprofiles;  // a device pointer to a list of mbsdfs objects, if used
                                        // (without the invalid light profile)
};
 
// The texture handler structure required by the MDL SDK with custom additional fields.
struct Texture_handler_deriv : mi::neuraylib::Texture_handler_deriv_base {
    // additional data for the texture access functions can be provided here
 
    size_t         num_textures;        // the number of textures used by the material
                                        // (without the invalid texture)
    Texture const *textures;            // the textures used by the material
                                        // (without the invalid texture)
 
    size_t         num_mbsdfs;          // the number of mbsdfs used by the material
                                        // (without the invalid texture)
    Mbsdf const    *mbsdfs;             // the mbsdfs used by the material
                                        // (without the invalid texture)
 
    size_t          num_lightprofiles;  // number of elements in the lightprofiles field
                                        // (without the invalid light profile)
    Lightprofile const *lightprofiles;  // a device pointer to a list of mbsdfs objects, if used
                                        // (without the invalid light profile)
};
 
 
#if defined(__CUDACC__)
 
// Stores a float4 in a float[4] array.
__device__ inline void store_result4(float res[4], const float4 &v)
{
    res[0] = v.x;
    res[1] = v.y;
    res[2] = v.z;
    res[3] = v.w;
}
 
// Stores a float in all elements of a float[4] array.
__device__ inline void store_result4(float res[4], float s)
{
    res[0] = res[1] = res[2] = res[3] = s;
}
 
// Stores the given float values in a float[4] array.
__device__ inline void store_result4(
    float res[4], float v0, float v1, float v2, float v3)
{
    res[0] = v0;
    res[1] = v1;
    res[2] = v2;
    res[3] = v3;
}
 
// Stores a float3 in a float[3] array.
__device__ inline void store_result3(float res[3], float3 const&v)
{
    res[0] = v.x;
    res[1] = v.y;
    res[2] = v.z;
}
 
// Stores a float4 in a float[3] array, ignoring v.w.
__device__ inline void store_result3(float res[3], const float4 &v)
{
    res[0] = v.x;
    res[1] = v.y;
    res[2] = v.z;
}
 
// Stores a float in all elements of a float[3] array.
__device__ inline void store_result3(float res[3], float s)
{
    res[0] = res[1] = res[2] = s;
}
 
// Stores the given float values in a float[3] array.
__device__ inline void store_result3(float res[3], float v0, float v1, float v2)
{
    res[0] = v0;
    res[1] = v1;
    res[2] = v2;
}
 
 
// ------------------------------------------------------------------------------------------------
// Textures
// ------------------------------------------------------------------------------------------------
 
// Applies wrapping and cropping to the given coordinate.
// Note: This macro returns if wrap mode is clip and the coordinate is out of range.
#define WRAP_AND_CROP_OR_RETURN_BLACK(val, inv_dim, wrap_mode, crop_vals, store_res_func)    \
  do {                                                                                       \
    if ( (wrap_mode) == mi::neuraylib::TEX_WRAP_REPEAT &&                                    \
        (crop_vals)[0] == 0.0f && (crop_vals)[1] == 1.0f ) {                                 \
      /* Do nothing, use texture sampler default behavior */                                 \
    }                                                                                        \
    else                                                                                     \
    {                                                                                        \
      if ( (wrap_mode) == mi::neuraylib::TEX_WRAP_REPEAT )                                   \
        val = val - floorf(val);                                                             \
      else {                                                                                 \
        if ( (wrap_mode) == mi::neuraylib::TEX_WRAP_CLIP && (val < 0.0f || val >= 1.0f) ) {  \
          store_res_func(result, 0.0f);                                                      \
          return;                                                                            \
        }                                                                                    \
        else if ( (wrap_mode) == mi::neuraylib::TEX_WRAP_MIRRORED_REPEAT ) {                 \
          float floored_val = floorf(val);                                                   \
          if ( (int(floored_val) & 1) != 0 )                                                 \
            val = 1.0f - (val - floored_val);                                                \
          else                                                                               \
            val = val - floored_val;                                                         \
        }                                                                                    \
        float inv_hdim = 0.5f * (inv_dim);                                                   \
        val = fminf(fmaxf(val, inv_hdim), 1.f - inv_hdim);                                   \
      }                                                                                      \
      val = val * ((crop_vals)[1] - (crop_vals)[0]) + (crop_vals)[0];                        \
    }                                                                                        \
  } while ( 0 )
 
 
#ifdef USE_SMOOTHERSTEP_FILTER
// Modify texture coordinates to get better texture filtering,
// see http://www.iquilezles.org/www/articles/texture/texture.htm
#define APPLY_SMOOTHERSTEP_FILTER()                                                         \
    do {                                                                                    \
        u = u * tex.size.x + 0.5f;                                                          \
        v = v * tex.size.y + 0.5f;                                                          \
                                                                                            \
        float u_i = floorf(u), v_i = floorf(v);                                             \
        float u_f = u - u_i;                                                                \
        float v_f = v - v_i;                                                                \
        u_f = u_f * u_f * u_f * (u_f * (u_f * 6.f - 15.f) + 10.f);                          \
        v_f = v_f * v_f * v_f * (v_f * (v_f * 6.f - 15.f) + 10.f);                          \
        u = u_i + u_f;                                                                      \
        v = v_i + v_f;                                                                      \
                                                                                            \
        u = (u - 0.5f) * tex.inv_size.x;                                                    \
        v = (v - 0.5f) * tex.inv_size.y;                                                    \
    } while ( 0 )
#else
#define APPLY_SMOOTHERSTEP_FILTER()
#endif
 
 
// Implementation of tex::lookup_float4() for a texture_2d texture.
extern "C" __device__ void tex_lookup_float4_2d(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[2],
    Tex_wrap_mode const         wrap_u,
    Tex_wrap_mode const         wrap_v,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
    float u = coord[0], v = coord[1];
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result4);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result4);
 
    APPLY_SMOOTHERSTEP_FILTER();
 
    store_result4(result, tex2D<float4>(tex.filtered_object, u, v));
}
 
// Implementation of tex::lookup_float4() for a texture_2d texture.
extern "C" __device__ void tex_lookup_deriv_float4_2d(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    tct_deriv_float2 const     *coord,
    Tex_wrap_mode const         wrap_u,
    Tex_wrap_mode const         wrap_v,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
    float u = coord->val.x, v = coord->val.y;
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result4);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result4);
 
    APPLY_SMOOTHERSTEP_FILTER();
 
    store_result4(result, tex2DGrad<float4>(tex.filtered_object, u, v, coord->dx, coord->dy));
}
 
// Implementation of tex::lookup_float3() for a texture_2d texture.
extern "C" __device__ void tex_lookup_float3_2d(
    float                       result[3],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[2],
    Tex_wrap_mode const         wrap_u,
    Tex_wrap_mode const         wrap_v,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result3(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
    float u = coord[0], v = coord[1];
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result3);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result3);
 
    APPLY_SMOOTHERSTEP_FILTER();
 
    store_result3(result, tex2D<float4>(tex.filtered_object, u, v));
}
 
// Implementation of tex::lookup_float3() for a texture_2d texture.
extern "C" __device__ void tex_lookup_deriv_float3_2d(
    float                       result[3],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    tct_deriv_float2 const     *coord,
    Tex_wrap_mode const         wrap_u,
    Tex_wrap_mode const         wrap_v,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result3(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
    float u = coord->val.x, v = coord->val.y;
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result3);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result3);
 
    APPLY_SMOOTHERSTEP_FILTER();
 
    store_result3(result, tex2DGrad<float4>(tex.filtered_object, u, v, coord->dx, coord->dy));
}
 
// Implementation of tex::texel_float4() for a texture_2d texture.
// Note: uvtile and/or animated textures are not supported
extern "C" __device__ void tex_texel_float4_2d(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    int const                   coord[2],
    int const                   /*uv_tile*/[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    store_result4(result, tex2D<float4>(
        tex.unfiltered_object,
        float(coord[0]) * tex.inv_size.x,
        float(coord[1]) * tex.inv_size.y));
}
 
// Implementation of tex::lookup_float4() for a texture_3d texture.
extern "C" __device__ void tex_lookup_float4_3d(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[3],
    Tex_wrap_mode               wrap_u,
    Tex_wrap_mode               wrap_v,
    Tex_wrap_mode               wrap_w,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float const                 crop_w[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    float u = coord[0], v = coord[1], w = coord[2];
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result4);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result4);
    WRAP_AND_CROP_OR_RETURN_BLACK(w, tex.inv_size.z, wrap_w, crop_w, store_result4);
 
    store_result4(result, tex3D<float4>(tex.filtered_object, u, v, w));
}
 
// Implementation of tex::lookup_float3() for a texture_3d texture.
extern "C" __device__ void tex_lookup_float3_3d(
    float                       result[3],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[3],
    Tex_wrap_mode               wrap_u,
    Tex_wrap_mode               wrap_v,
    Tex_wrap_mode               wrap_w,
    float const                 crop_u[2],
    float const                 crop_v[2],
    float const                 crop_w[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result3(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    float u = coord[0], v = coord[1], w = coord[2];
    WRAP_AND_CROP_OR_RETURN_BLACK(u, tex.inv_size.x, wrap_u, crop_u, store_result3);
    WRAP_AND_CROP_OR_RETURN_BLACK(v, tex.inv_size.y, wrap_v, crop_v, store_result3);
    WRAP_AND_CROP_OR_RETURN_BLACK(w, tex.inv_size.z, wrap_w, crop_w, store_result3);
 
    store_result3(result, tex3D<float4>(tex.filtered_object, u, v, w));
}
 
// Implementation of tex::texel_float4() for a texture_3d texture.
extern "C" __device__ void tex_texel_float4_3d(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    const int                   coord[3],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    store_result4(result, tex3D<float4>(
        tex.unfiltered_object,
        float(coord[0]) * tex.inv_size.x,
        float(coord[1]) * tex.inv_size.y,
        float(coord[2]) * tex.inv_size.z));
}
 
// Implementation of tex::lookup_float4() for a texture_cube texture.
extern "C" __device__ void tex_lookup_float4_cube(
    float                       result[4],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[3])
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result4(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    store_result4(result, texCubemap<float4>(tex.filtered_object, coord[0], coord[1], coord[2]));
}
 
// Implementation of tex::lookup_float3() for a texture_cube texture.
extern "C" __device__ void tex_lookup_float3_cube(
    float                       result[3],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float const                 coord[3])
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        store_result3(result, 0.0f);
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
 
    store_result3(result, texCubemap<float4>(tex.filtered_object, coord[0], coord[1], coord[2]));
}
 
// Implementation of resolution_2d function needed by generated code.
// Note: uvtile and/or animated textures are not supported
extern "C" __device__ void tex_resolution_2d(
    int                         result[2],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    int const                   /*uv_tile*/[2],
    float                       /*frame*/)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if ( texture_idx == 0 || texture_idx - 1 >= self->num_textures ) {
        // invalid texture returns zero
        result[0] = 0;
        result[1] = 0;
        return;
    }
 
    Texture const &tex = self->textures[texture_idx - 1];
    result[0] = tex.size.x;
    result[1] = tex.size.y;
}
 
// Implementation of resolution_3d function needed by generated code.
extern "C" __device__ void tex_resolution_3d(
    int                         result[3],
    Texture_handler_base const *self_base,
    unsigned                    texture_idx,
    float                       /*frame*/)
{
    Texture_handler const* self = static_cast<Texture_handler const*>(self_base);
 
    if (texture_idx == 0 || texture_idx - 1 >= self->num_textures) {
        // invalid texture returns zero
        result[0] = 0;
        result[1] = 0;
        result[2] = 0;
        return;
    }
 
    Texture const& tex = self->textures[texture_idx - 1];
    result[0] = tex.size.x;
    result[1] = tex.size.y;
    result[2] = tex.size.z;
}
 
// Implementation of texture_isvalid().
extern "C" __device__ bool tex_texture_isvalid(
    Texture_handler_base const *self_base,
    unsigned                    texture_idx)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    return texture_idx != 0 && texture_idx - 1 < self->num_textures;
}
 
// Implementation of frame function needed by generated code.
// Note: frames are not implemented for the MDL SDK examples.
extern "C" __device__ void tex_frame(
    int                         result[2],
    Texture_handler_base const */*self_base*/,
    unsigned                    /*texture_idx*/)
{
    result[0] = 0;
    result[1] = 0;
}
 
 
// ------------------------------------------------------------------------------------------------
// Light Profiles
// ------------------------------------------------------------------------------------------------
 
 
// Implementation of light_profile_power() for a light profile.
extern "C" __device__ float df_light_profile_power(
    Texture_handler_base const *self_base,
    unsigned                    light_profile_idx)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (light_profile_idx == 0 || light_profile_idx - 1 >= self->num_lightprofiles)
        return 0.0f; // invalid light profile returns zero
 
    const Lightprofile& lp = self->lightprofiles[light_profile_idx - 1];
    return lp.total_power;
}
 
// Implementation of light_profile_maximum() for a light profile.
extern "C" __device__ float df_light_profile_maximum(
    Texture_handler_base const *self_base,
    unsigned                    light_profile_idx)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (light_profile_idx == 0 || light_profile_idx - 1 >= self->num_lightprofiles)
        return 0.0f; // invalid light profile returns zero
 
    const Lightprofile& lp = self->lightprofiles[light_profile_idx - 1];
    return lp.candela_multiplier;
}
 
// Implementation of light_profile_isvalid() for a light profile.
extern "C" __device__ bool df_light_profile_isvalid(
    Texture_handler_base const *self_base,
    unsigned                    light_profile_idx)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    return light_profile_idx != 0 && light_profile_idx - 1 < self->num_lightprofiles;
}
 
// binary search through CDF
__device__ inline unsigned sample_cdf(
    const float* cdf,
    unsigned cdf_size,
    float xi)
{
    unsigned li = 0;
    unsigned ri = cdf_size - 1;
    unsigned m = (li + ri) / 2;
    while (ri > li)
    {
        if (xi < cdf[m])
            ri = m;
        else
            li = m + 1;
 
        m = (li + ri) / 2;
    }
 
    return m;
}
 
 
// Implementation of df::light_profile_evaluate() for a light profile.
extern "C" __device__ float df_light_profile_evaluate(
    Texture_handler_base const  *self_base,
    unsigned                    light_profile_idx,
    float const                 theta_phi[2])
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (light_profile_idx == 0 || light_profile_idx - 1 >= self->num_lightprofiles)
        return 0.0f;  // invalid light profile returns zero
 
    const Lightprofile& lp = self->lightprofiles[light_profile_idx - 1];
 
    // map theta to 0..1 range
    float u = (theta_phi[0] - lp.theta_phi_start.x) *
        lp.theta_phi_inv_delta.x * lp.inv_angular_resolution.x;
 
    // converting input phi from -pi..pi to 0..2pi
    float phi = (theta_phi[1] > 0.0f) ? theta_phi[1] : (float(2.0 * M_PI) + theta_phi[1]);
 
    // floorf wraps phi range into 0..2pi
    phi = phi - lp.theta_phi_start.y -
        floorf((phi - lp.theta_phi_start.y) * float(0.5 / M_PI)) * float(2.0 * M_PI);
 
    // (phi < 0.0f) is no problem, this is handled by the (black) border
    // since it implies lp.theta_phi_start.y > 0 (and we really have "no data" below that)
    float v = phi * lp.theta_phi_inv_delta.y * lp.inv_angular_resolution.y;
 
    // half pixel offset
    // see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#linear-filtering
    u += 0.5f * lp.inv_angular_resolution.x;
    v += 0.5f * lp.inv_angular_resolution.y;
 
    // wrap_mode: border black would be an alternative (but it produces artifacts at low res)
    if (u < 0.0f || u > 1.0f || v < 0.0f || v > 1.0f) return 0.0f;
 
    return tex2D<float>(lp.eval_data, u, v) * lp.candela_multiplier;
}
 
// Implementation of df::light_profile_sample() for a light profile.
extern "C" __device__ void df_light_profile_sample(
    float                       result[3],          // output: theta, phi, pdf
    Texture_handler_base const  *self_base,
    unsigned                    light_profile_idx,
    float const                 xi[3])              // uniform random values
{
    result[0] = -1.0f;  // negative theta means no emission
    result[1] = -1.0f;
    result[2] = 0.0f;
 
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (light_profile_idx == 0 || light_profile_idx - 1 >= self->num_lightprofiles)
        return;  // invalid light profile returns zero
 
    const Lightprofile& lp = self->lightprofiles[light_profile_idx - 1];
    uint2 res = lp.angular_resolution;
    if (res.x <= 2 || res.y <= 2)
        return;  // invalid resolution
 
 
    // sample theta_out
    //-------------------------------------------
    float xi0 = xi[0];
    const float* cdf_data_theta = lp.cdf_data;                          // CDF theta
    unsigned idx_theta = sample_cdf(cdf_data_theta, res.x - 1, xi0);    // binary search
 
    float prob_theta = cdf_data_theta[idx_theta];
    if (idx_theta > 0)
    {
        const float tmp = cdf_data_theta[idx_theta - 1];
        prob_theta -= tmp;
        xi0 -= tmp;
    }
    xi0 /= prob_theta;  // rescale for re-usage
 
    // sample phi_out
    //-------------------------------------------
    float xi1 = xi[1];
    const float* cdf_data_phi = cdf_data_theta + (res.x - 1)            // CDF theta block
                              + (idx_theta * (res.y - 1));              // selected CDF for phi
    const unsigned idx_phi = sample_cdf(cdf_data_phi, res.y - 1, xi1);  // binary search
 
    float prob_phi = cdf_data_phi[idx_phi];
    if (idx_phi > 0)
    {
        const float tmp = cdf_data_phi[idx_phi - 1];
        prob_phi -= tmp;
        xi1 -= tmp;
    }
    xi1 /= prob_phi;  // rescale for re-usage
 
    // compute theta and phi
    //-------------------------------------------
    // sample uniformly within the patch (grid cell)
    const float2 start = lp.theta_phi_start;
    const float2 delta = lp.theta_phi_delta;
 
    const float cos_theta_0 = cosf(start.x + float(idx_theta)      * delta.x);
    const float cos_theta_1 = cosf(start.x + float(idx_theta + 1u) * delta.x);
 
    //               n = \int_{\theta_0}^{\theta_1} \sin{\theta} \delta \theta
    //                 = 1 / (\cos{\theta_0} - \cos{\theta_1})
    //
    //             \xi = n * \int_{\theta_0}^{\theta_1} \sin{\theta} \delta \theta
    // => \cos{\theta} = (1 - \xi) \cos{\theta_0} + \xi \cos{\theta_1}
 
    const float cos_theta = (1.0f - xi1) * cos_theta_0 + xi1 * cos_theta_1;
    result[0] = acosf(cos_theta);
    result[1] = start.y + (float(idx_phi) + xi0) * delta.y;
 
    // align phi
    if (result[1] > float(2.0 * M_PI)) result[1] -= float(2.0 * M_PI);              // wrap
    if (result[1] > float(1.0 * M_PI)) result[1] = float(-2.0 * M_PI) + result[1];  // to [-pi, pi]
 
    // compute pdf
    //-------------------------------------------
    result[2] = prob_theta * prob_phi / (delta.y * (cos_theta_0 - cos_theta_1));
}
 
 
// Implementation of df::light_profile_pdf() for a light profile.
extern "C" __device__ float df_light_profile_pdf(
    Texture_handler_base const  *self_base,
    unsigned                    light_profile_idx,
    float const                 theta_phi[2])
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (light_profile_idx == 0 || light_profile_idx - 1 >= self->num_lightprofiles)
        return 0.0f;  // invalid light profile returns zero
 
    const Lightprofile& lp = self->lightprofiles[light_profile_idx - 1];
    const uint2 res = lp.angular_resolution;
    const float* cdf_data_theta = lp.cdf_data;
 
    // map theta to 0..1 range
    const float theta = theta_phi[0] - lp.theta_phi_start.x;
    const int idx_theta = int(theta * lp.theta_phi_inv_delta.x);
 
    // converting input phi from -pi..pi to 0..2pi
    float phi = (theta_phi[1] > 0.0f) ? theta_phi[1] : (float(2.0 * M_PI) + theta_phi[1]);
 
    // floorf wraps phi range into 0..2pi
    phi = phi - lp.theta_phi_start.y -
        floorf((phi - lp.theta_phi_start.y) * float(0.5 / M_PI)) * float(2.0 * M_PI);
 
    // (phi < 0.0f) is no problem, this is handled by the (black) border
    // since it implies lp.theta_phi_start.y > 0 (and we really have "no data" below that)
    const int idx_phi = int(phi * lp.theta_phi_inv_delta.y);
 
    // wrap_mode: border black would be an alternative (but it produces artifacts at low res)
    if (idx_theta < 0 || idx_theta > res.x - 2 || idx_phi < 0 || idx_phi > res.y - 2)
        return 0.0f;
 
 
    // get probability for theta
    //-------------------------------------------
    float prob_theta = cdf_data_theta[idx_theta];
    if (idx_theta > 0)
    {
        const float tmp = cdf_data_theta[idx_theta - 1];
        prob_theta -= tmp;
    }
 
    // get probability for phi
    //-------------------------------------------
    const float* cdf_data_phi = cdf_data_theta
        + (res.x - 1)                             // CDF theta block
        + (idx_theta * (res.y - 1));              // selected CDF for phi
 
    float prob_phi = cdf_data_phi[idx_phi];
    if (idx_phi > 0)
    {
        const float tmp = cdf_data_phi[idx_phi - 1];
        prob_phi -= tmp;
    }
 
    // compute probability to select a position in the sphere patch
    const float2 start = lp.theta_phi_start;
    const float2 delta = lp.theta_phi_delta;
 
    const float cos_theta_0 = cosf(start.x + float(idx_theta)      * delta.x);
    const float cos_theta_1 = cosf(start.x + float(idx_theta + 1u) * delta.x);
 
    return prob_theta * prob_phi / (delta.y * (cos_theta_0 - cos_theta_1));
}
 
 
// ------------------------------------------------------------------------------------------------
// BSDF Measurements
// ------------------------------------------------------------------------------------------------
 
// Implementation of bsdf_measurement_isvalid() for an MBSDF.
extern "C" __device__ bool df_bsdf_measurement_isvalid(
    Texture_handler_base const *self_base,
    unsigned                    bsdf_measurement_idx)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    return bsdf_measurement_idx != 0 && bsdf_measurement_idx - 1 < self->num_mbsdfs;
}
 
// Implementation of df::bsdf_measurement_resolution() function needed by generated code,
// which retrieves the angular and chromatic resolution of the given MBSDF.
// The returned triple consists of: number of equi-spaced steps of theta_i and theta_o,
// number of equi-spaced steps of phi, and number of color channels (1 or 3).
extern "C" __device__ void df_bsdf_measurement_resolution(
    unsigned                    result[3],
    Texture_handler_base const  *self_base,
    unsigned                    bsdf_measurement_idx,
    Mbsdf_part                  part)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if (bsdf_measurement_idx == 0 || bsdf_measurement_idx - 1 >= self->num_mbsdfs)
    {
        // invalid MBSDF returns zero
        result[0] = 0;
        result[1] = 0;
        result[2] = 0;
        return;
    }
 
    Mbsdf const &bm = self->mbsdfs[bsdf_measurement_idx - 1];
    const unsigned part_idx = static_cast<unsigned>(part);
 
    // check for the part
    if (part_idx > 1 || bm.has_data[part_idx] == 0)
    {
        result[0] = 0;
        result[1] = 0;
        result[2] = 0;
        return;
    }
 
    // pass out the information
    result[0] = bm.angular_resolution[part_idx].x;
    result[1] = bm.angular_resolution[part_idx].y;
    result[2] = bm.num_channels[part_idx];
}
 
 
__device__ inline float3 bsdf_compute_uvw(const float theta_phi_in[2],
                                          const float theta_phi_out[2])
{
    // assuming each phi is between -pi and pi
    float u = theta_phi_out[1] - theta_phi_in[1];
    if (u < 0.0) u += float(2.0 * M_PI);
    if (u > float(1.0 * M_PI)) u = float(2.0 * M_PI) - u;
    u *= float(M_ONE_OVER_PI);
 
    const float v = theta_phi_out[0] * float(2.0 / M_PI);
    const float w = theta_phi_in[0] * float(2.0 / M_PI);
 
    return make_float3(u, v, w);
}
 
template<typename T>
__device__ inline T bsdf_measurement_lookup(const cudaTextureObject_t& eval_volume,
                                            const float theta_phi_in[2],
                                            const float theta_phi_out[2])
{
    // 3D volume on the GPU (phi_delta x theta_out x theta_in)
    const float3 uvw = bsdf_compute_uvw(theta_phi_in, theta_phi_out);
    return tex3D<T>(eval_volume, uvw.x, uvw.y, uvw.z);
}
 
// Implementation of df::bsdf_measurement_evaluate() for an MBSDF.
extern "C" __device__ void df_bsdf_measurement_evaluate(
    float                       result[3],
    Texture_handler_base const  *self_base,
    unsigned                    bsdf_measurement_idx,
    float const                 theta_phi_in[2],
    float const                 theta_phi_out[2],
    Mbsdf_part                  part)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if (bsdf_measurement_idx == 0 || bsdf_measurement_idx - 1 >= self->num_mbsdfs)
    {
        // invalid MBSDF returns zero
        store_result3(result, 0.0f);
        return;
    }
 
    const Mbsdf& bm = self->mbsdfs[bsdf_measurement_idx - 1];
    const unsigned part_idx = static_cast<unsigned>(part);
 
    // check for the parts
    if (part_idx > 1 || bm.has_data[part_idx] == 0)
    {
        store_result3(result, 0.0f);
        return;
    }
 
    // handle channels
    if (bm.num_channels[part_idx] == 3)
    {
        const float4 sample = bsdf_measurement_lookup<float4>(
            bm.eval_data[part_idx], theta_phi_in, theta_phi_out);
        store_result3(result, sample.x, sample.y, sample.z);
    }
    else
    {
        const float sample = bsdf_measurement_lookup<float>(
            bm.eval_data[part_idx], theta_phi_in, theta_phi_out);
        store_result3(result, sample);
    }
}
 
// Implementation of df::bsdf_measurement_sample() for an MBSDF.
extern "C" __device__ void df_bsdf_measurement_sample(
    float                       result[3],          // output: theta, phi, pdf
    Texture_handler_base const  *self_base,
    unsigned                    bsdf_measurement_idx,
    float const                 theta_phi_out[2],
    float const                 xi[3],              // uniform random values
    Mbsdf_part                  part)
{
    result[0] = -1.0f;  // negative theta means absorption
    result[1] = -1.0f;
    result[2] = 0.0f;
 
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (bsdf_measurement_idx == 0 || bsdf_measurement_idx - 1 >= self->num_mbsdfs)
        return;  // invalid MBSDFs returns zero
 
    const Mbsdf& bm = self->mbsdfs[bsdf_measurement_idx - 1];
    unsigned part_idx = static_cast<unsigned>(part);
 
    if (part_idx > 1 || bm.has_data[part_idx] == 0)
        return;  // check for the part
 
    // CDF data
    uint2 res = bm.angular_resolution[part_idx];
    const float* sample_data = bm.sample_data[part_idx];
    if (res.x < 1 || res.y < 1)
        return;  // invalid resolution
 
    unsigned idx_theta_out = unsigned(theta_phi_out[0] * float(M_ONE_OVER_PI * 2.0f) * float(res.x));
    idx_theta_out = min(idx_theta_out, res.x - 1);
 
    // sample theta_in
    //-------------------------------------------
    float xi0 = xi[0];
    const float* cdf_theta = sample_data + idx_theta_out * res.x;
    unsigned idx_theta_in = sample_cdf(cdf_theta, res.x, xi0);       // binary search
 
    float prob_theta = cdf_theta[idx_theta_in];
    if (idx_theta_in > 0)
    {
        const float tmp = cdf_theta[idx_theta_in - 1];
        prob_theta -= tmp;
        xi0 -= tmp;
    }
    xi0 /= prob_theta;  // rescale for re-usage
 
    // sample phi
    //-------------------------------------------
    float xi1 = xi[1];
    const float* cdf_phi = sample_data +
                           (res.x * res.x) +                                // CDF theta block
                           (idx_theta_out * res.x + idx_theta_in) * res.y;  // selected CDF phi
 
    // select which half-circle to choose with probability 0.5
    const bool flip = (xi1 > 0.5f);
    if (flip)
        xi1 = 1.0f - xi1;
    xi1 *= 2.0f;
 
    unsigned idx_phi = sample_cdf(cdf_phi, res.y, xi1);           // binary search
    float prob_phi = cdf_phi[idx_phi];
    if (idx_phi > 0)
    {
        const float tmp = cdf_phi[idx_phi - 1];
        prob_phi -= tmp;
        xi1 -= tmp;
    }
    xi1 /= prob_phi;  // rescale for re-usage
 
    // compute direction
    //-------------------------------------------
    const float2 inv_res = bm.inv_angular_resolution[part_idx];
 
    const float s_theta = float(0.5 * M_PI) * inv_res.x;
    const float s_phi   = float(1.0 * M_PI) * inv_res.y;
 
    const float cos_theta_0 = cosf(float(idx_theta_in)      * s_theta);
    const float cos_theta_1 = cosf(float(idx_theta_in + 1u) * s_theta);
 
    const float cos_theta = cos_theta_0 * (1.0f - xi1) + cos_theta_1 * xi1;
    result[0] = acosf(cos_theta);
    result[1] = (float(idx_phi) + xi0) * s_phi;
 
    if (flip)
        result[1] = float(2.0 * M_PI) - result[1];  // phi \in [0, 2pi]
 
    // align phi
    result[1] += (theta_phi_out[1] > 0) ? theta_phi_out[1] : (float(2.0 * M_PI) + theta_phi_out[1]);
    if (result[1] > float(2.0 * M_PI)) result[1] -= float(2.0 * M_PI);
    if (result[1] > float(1.0 * M_PI)) result[1] = float(-2.0 * M_PI) + result[1];  // to [-pi, pi]
 
    // compute pdf
    //-------------------------------------------
    result[2] = prob_theta * prob_phi * 0.5f
                / (s_phi * (cos_theta_0 - cos_theta_1));
}
 
// Implementation of df::bsdf_measurement_pdf() for an MBSDF.
extern "C" __device__ float df_bsdf_measurement_pdf(
    Texture_handler_base const  *self_base,
    unsigned                    bsdf_measurement_idx,
    float const                 theta_phi_in[2],
    float const                 theta_phi_out[2],
    Mbsdf_part                  part)
{
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
 
    if (bsdf_measurement_idx == 0 || bsdf_measurement_idx - 1 >= self->num_mbsdfs)
        return 0.0f;  // invalid MBSDF returns zero
 
    const Mbsdf& bm = self->mbsdfs[bsdf_measurement_idx - 1];
    unsigned part_idx = static_cast<unsigned>(part);
 
    // check for the part
    if (part_idx > 1 || bm.has_data[part_idx] == 0)
        return 0.0f;
 
    // CDF data and resolution
    const float* sample_data = bm.sample_data[part_idx];
    uint2 res = bm.angular_resolution[part_idx];
 
    // compute indices in the CDF data
    float3 uvw = bsdf_compute_uvw(theta_phi_in, theta_phi_out); // phi_delta, theta_out, theta_in
    unsigned idx_theta_in = unsigned(uvw.z * float(res.x));
    unsigned idx_theta_out  = unsigned(uvw.y * float(res.x));
    unsigned idx_phi   = unsigned(uvw.x * float(res.y));
    idx_theta_in  = min(idx_theta_in, res.x - 1);
    idx_theta_out = min(idx_theta_out, res.x - 1);
    idx_phi   = min(idx_phi, res.y - 1);
 
    // get probability to select theta_in
    const float* cdf_theta = sample_data + idx_theta_out * res.x;
    float prob_theta = cdf_theta[idx_theta_in];
    if (idx_theta_in > 0)
    {
        const float tmp = cdf_theta[idx_theta_in - 1];
        prob_theta -= tmp;
    }
 
    // get probability to select phi
    const float* cdf_phi = sample_data +
        (res.x * res.x) +                                // CDF theta block
        (idx_theta_out * res.x + idx_theta_in) * res.y;  // selected CDF phi
    float prob_phi = cdf_phi[idx_phi];
    if (idx_phi > 0)
    {
        const float tmp = cdf_phi[idx_phi - 1];
        prob_phi -= tmp;
    }
 
    // compute probability to select a position in the sphere patch
    float2 inv_res = bm.inv_angular_resolution[part_idx];
 
    const float s_theta = float(0.5 * M_PI) * inv_res.x;
    const float s_phi = float(1.0 * M_PI) * inv_res.y;
 
    const float cos_theta_0 = cosf(float(idx_theta_in)      * s_theta);
    const float cos_theta_1 = cosf(float(idx_theta_in + 1u) * s_theta);
 
    return prob_theta * prob_phi * 0.5f
        / (s_phi * (cos_theta_0 - cos_theta_1));
}
 
 
__device__ inline void df_bsdf_measurement_albedo(
    float                       result[2],          // output: max (in case of color) albedo
                                                    // for the selected direction ([0]) and
                                                    // global ([1])
    Texture_handler const       *self,
    unsigned                    bsdf_measurement_idx,
    float const                 theta_phi[2],
    Mbsdf_part                  part)
{
    const Mbsdf& bm = self->mbsdfs[bsdf_measurement_idx - 1];
    const unsigned part_idx = static_cast<unsigned>(part);
 
    // check for the part
    if (part_idx > 1 || bm.has_data[part_idx] == 0)
        return;
 
    const uint2 res = bm.angular_resolution[part_idx];
    if (res.x < 1)
        return;
 
    unsigned idx_theta = unsigned(theta_phi[0] * float(2.0 / M_PI) * float(res.x));
    idx_theta = min(idx_theta, res.x - 1u);
    result[0] = bm.albedo_data[part_idx][idx_theta];
    result[1] = bm.max_albedo[part_idx];
}
 
// Implementation of df::bsdf_measurement_albedos() for an MBSDF.
extern "C" __device__ void df_bsdf_measurement_albedos(
    float                       result[4],          // output: [0] albedo refl. for theta_phi
                                                    //         [1] max albedo refl. global
                                                    //         [2] albedo trans. for theta_phi
                                                    //         [3] max albedo trans. global
    Texture_handler_base const  *self_base,
    unsigned                    bsdf_measurement_idx,
    float const                 theta_phi[2])
{
    result[0] = 0.0f;
    result[1] = 0.0f;
    result[2] = 0.0f;
    result[3] = 0.0f;
 
    Texture_handler const *self = static_cast<Texture_handler const *>(self_base);
    if (bsdf_measurement_idx == 0 || bsdf_measurement_idx - 1 >= self->num_mbsdfs)
        return;  // invalid MBSDF returns zero
 
    df_bsdf_measurement_albedo(
        &result[0],
        self,
        bsdf_measurement_idx,
        theta_phi,
        mi::neuraylib::MBSDF_DATA_REFLECTION);
 
    df_bsdf_measurement_albedo(
        &result[2],
        self,
        bsdf_measurement_idx,
        theta_phi,
        mi::neuraylib::MBSDF_DATA_TRANSMISSION);
}
 
 
// ------------------------------------------------------------------------------------------------
// Normal adaption (dummy functions)
//
// Can be enabled via backend option "use_renderer_adapt_normal".
// ------------------------------------------------------------------------------------------------
 
#ifndef TEX_SUPPORT_NO_DUMMY_ADAPTNORMAL
 
// Implementation of adapt_normal().
extern "C" __device__ void adapt_normal(
    float                                  result[3],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    float const                            normal[3])
{
    // just return original normal
    result[0] = normal[0];
    result[1] = normal[1];
    result[2] = normal[2];
}
 
#endif  // TEX_SUPPORT_NO_DUMMY_ADAPTNORMAL
 
 
// ------------------------------------------------------------------------------------------------
// Scene data (dummy functions)
// ------------------------------------------------------------------------------------------------
 
#ifndef TEX_SUPPORT_NO_DUMMY_SCENEDATA
 
// Implementation of scene_data_isvalid().
extern "C" __device__ bool scene_data_isvalid(
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id)
{
    return false;
}
 
// Implementation of scene_data_lookup_float4().
extern "C" __device__ void scene_data_lookup_float4(
    float                                  result[4],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value[4],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
    result[2] = default_value[2];
    result[3] = default_value[3];
}
 
// Implementation of scene_data_lookup_float3().
extern "C" __device__ void scene_data_lookup_float3(
    float                                  result[3],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value[3],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
    result[2] = default_value[2];
}
 
// Implementation of scene_data_lookup_color().
extern "C" __device__ void scene_data_lookup_color(
    float                                  result[3],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value[3],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
    result[2] = default_value[2];
}
 
// Implementation of scene_data_lookup_float2().
extern "C" __device__ void scene_data_lookup_float2(
    float                                  result[2],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value[2],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
}
 
// Implementation of scene_data_lookup_float().
extern "C" __device__ float scene_data_lookup_float(
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    return default_value;
}
 
// Implementation of scene_data_lookup_int4().
extern "C" __device__ void scene_data_lookup_int4(
    int                                    result[4],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    int const                              default_value[4],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
    result[2] = default_value[2];
    result[3] = default_value[3];
}
 
// Implementation of scene_data_lookup_int3().
extern "C" __device__ void scene_data_lookup_int3(
    int                                    result[3],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    int const                              default_value[3],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
    result[2] = default_value[2];
}
 
// Implementation of scene_data_lookup_int2().
extern "C" __device__ void scene_data_lookup_int2(
    int                                    result[2],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    int const                              default_value[2],
    bool                                   uniform_lookup)
{
    // just return default value
    result[0] = default_value[0];
    result[1] = default_value[1];
}
 
// Implementation of scene_data_lookup_int().
extern "C" __device__ int scene_data_lookup_int(
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    int                                    default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    return default_value;
}
 
// Implementation of scene_data_lookup_float4x4().
extern "C" __device__ void scene_data_lookup_float4x4(
    float                                  result[16],
    Texture_handler_base const            *self_base,
    Shading_state_material                *state,
    unsigned                               scene_data_id,
    float const                            default_value[16],
    bool                                   uniform_lookup)
{
    // just return default value
    for (int i = 0; i < 16; ++i)
        result[i] = default_value[i];
}
 
// Implementation of scene_data_lookup_float4() with derivatives.
extern "C" __device__ void scene_data_lookup_deriv_float4(
    tct_deriv_arr_float_4                 *result,
    Texture_handler_base const            *self_base,
    Shading_state_material_with_derivs    *state,
    unsigned                               scene_data_id,
    tct_deriv_arr_float_4 const           *default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    *result = *default_value;
}
 
// Implementation of scene_data_lookup_float3() with derivatives.
extern "C" __device__ void scene_data_lookup_deriv_float3(
    tct_deriv_arr_float_3                 *result,
    Texture_handler_base const            *self_base,
    Shading_state_material_with_derivs    *state,
    unsigned                               scene_data_id,
    tct_deriv_arr_float_3 const           *default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    *result = *default_value;
}
 
// Implementation of scene_data_lookup_color() with derivatives.
extern "C" __device__ void scene_data_lookup_deriv_color(
    tct_deriv_arr_float_3                 *result,
    Texture_handler_base const            *self_base,
    Shading_state_material_with_derivs    *state,
    unsigned                               scene_data_id,
    tct_deriv_arr_float_3 const           *default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    *result = *default_value;
}
 
// Implementation of scene_data_lookup_float2() with derivatives.
extern "C" __device__ void scene_data_lookup_deriv_float2(
    tct_deriv_arr_float_2                 *result,
    Texture_handler_base const            *self_base,
    Shading_state_material_with_derivs    *state,
    unsigned                               scene_data_id,
    tct_deriv_arr_float_2 const           *default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    *result = *default_value;
}
 
// Implementation of scene_data_lookup_float() with derivatives.
extern "C" __device__ void scene_data_lookup_deriv_float(
    tct_deriv_float                       *result,
    Texture_handler_base const            *self_base,
    Shading_state_material_with_derivs    *state,
    unsigned                               scene_data_id,
    tct_deriv_float const                 *default_value,
    bool                                   uniform_lookup)
{
    // just return default value
    *result = *default_value;
}
 
#endif  // TEX_SUPPORT_NO_DUMMY_SCENEDATA
 
 
// ------------------------------------------------------------------------------------------------
// Vtables
// ------------------------------------------------------------------------------------------------
 
#ifndef TEX_SUPPORT_NO_VTABLES
// The vtable containing all texture access handlers required by the generated code
// in "vtable" mode.
__device__ mi::neuraylib::Texture_handler_vtable tex_vtable = {
    tex_lookup_float4_2d,
    tex_lookup_float3_2d,
    tex_texel_float4_2d,
    tex_lookup_float4_3d,
    tex_lookup_float3_3d,
    tex_texel_float4_3d,
    tex_lookup_float4_cube,
    tex_lookup_float3_cube,
    tex_resolution_2d,
    tex_resolution_3d,
    tex_texture_isvalid,
    tex_frame,
    df_light_profile_power,
    df_light_profile_maximum,
    df_light_profile_isvalid,
    df_light_profile_evaluate,
    df_light_profile_sample,
    df_light_profile_pdf,
    df_bsdf_measurement_isvalid,
    df_bsdf_measurement_resolution,
    df_bsdf_measurement_evaluate,
    df_bsdf_measurement_sample,
    df_bsdf_measurement_pdf,
    df_bsdf_measurement_albedos,
    adapt_normal,
    scene_data_isvalid,
    scene_data_lookup_float,
    scene_data_lookup_float2,
    scene_data_lookup_float3,
    scene_data_lookup_float4,
    scene_data_lookup_int,
    scene_data_lookup_int2,
    scene_data_lookup_int3,
    scene_data_lookup_int4,
    scene_data_lookup_color,
    scene_data_lookup_float4x4,
};
 
// The vtable containing all texture access handlers required by the generated code
// in "vtable" mode with derivatives.
__device__ mi::neuraylib::Texture_handler_deriv_vtable tex_deriv_vtable = {
    tex_lookup_deriv_float4_2d,
    tex_lookup_deriv_float3_2d,
    tex_texel_float4_2d,
    tex_lookup_float4_3d,
    tex_lookup_float3_3d,
    tex_texel_float4_3d,
    tex_lookup_float4_cube,
    tex_lookup_float3_cube,
    tex_resolution_2d,
    tex_resolution_3d,
    tex_texture_isvalid,
    tex_frame,
    df_light_profile_power,
    df_light_profile_maximum,
    df_light_profile_isvalid,
    df_light_profile_evaluate,
    df_light_profile_sample,
    df_light_profile_pdf,
    df_bsdf_measurement_isvalid,
    df_bsdf_measurement_resolution,
    df_bsdf_measurement_evaluate,
    df_bsdf_measurement_sample,
    df_bsdf_measurement_pdf,
    df_bsdf_measurement_albedos,
    adapt_normal,
    scene_data_isvalid,
    scene_data_lookup_float,
    scene_data_lookup_float2,
    scene_data_lookup_float3,
    scene_data_lookup_float4,
    scene_data_lookup_int,
    scene_data_lookup_int2,
    scene_data_lookup_int3,
    scene_data_lookup_int4,
    scene_data_lookup_color,
    scene_data_lookup_float4x4,
    scene_data_lookup_deriv_float,
    scene_data_lookup_deriv_float2,
    scene_data_lookup_deriv_float3,
    scene_data_lookup_deriv_float4,
    scene_data_lookup_deriv_color,
};
#endif  // TEX_SUPPORT_NO_VTABLES
 
#endif  // __CUDACC__
 
#endif  // TEXTURE_SUPPORT_CUDA_H

Source Code Location: examples/mdl_sdk/shared/example_cuda_shared.h

/******************************************************************************
 * Copyright 2024 NVIDIA Corporation. All rights reserved.
 *****************************************************************************/
 
// Code shared by CUDA MDL SDK examples
 
#ifndef EXAMPLE_CUDA_SHARED_H
#define EXAMPLE_CUDA_SHARED_H
 
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
 
#define _USE_MATH_DEFINES
#include <math.h>
 
#include "example_shared.h"
#include "compiled_material_traverser_base.h"
 
#include <cuda.h>
#ifdef OPENGL_INTEROP
#include <GL/glew.h>
#include <GLFW/glfw3.h>
#include <cudaGL.h>
#endif
#include <cuda_runtime.h>
#include <vector_functions.h>
 
#include "utils/profiling.h"
 
// Structure representing an MDL texture, containing filtered and unfiltered CUDA texture
// objects and the size of the texture.
struct Texture
{
    explicit Texture(cudaTextureObject_t  filtered_object,
            cudaTextureObject_t  unfiltered_object,
            uint3                size)
        : filtered_object(filtered_object)
        , unfiltered_object(unfiltered_object)
        , size(size)
        , inv_size(make_float3(1.0f / size.x, 1.0f / size.y, 1.0f / size.z))
    {}
 
    cudaTextureObject_t  filtered_object;    // uses filter mode cudaFilterModeLinear
    cudaTextureObject_t  unfiltered_object;  // uses filter mode cudaFilterModePoint
    uint3                size;               // size of the texture, needed for texel access
    float3               inv_size;           // the inverse values of the size of the texture
};
 
// Structure representing an MDL bsdf measurement.
struct Mbsdf
{
    explicit Mbsdf()
    {
        for (unsigned i = 0; i < 2; ++i) {
            has_data[i] = 0u;
            eval_data[i] = 0;
            sample_data[i] = 0;
            albedo_data[i] = 0;
            this->max_albedo[i] = 0.0f;
            angular_resolution[i] = make_uint2(0u, 0u);
            inv_angular_resolution[i] = make_float2(0.0f, 0.0f);
            num_channels[i] = 0;
        }
    }
 
    void Add(mi::neuraylib::Mbsdf_part part,
             const uint2& angular_resolution,
             unsigned num_channels)
    {
        unsigned part_idx = static_cast<unsigned>(part);
 
        this->has_data[part_idx] = 1u;
        this->angular_resolution[part_idx] = angular_resolution;
        this->inv_angular_resolution[part_idx] = make_float2(1.0f / float(angular_resolution.x),
                                                             1.0f / float(angular_resolution.y));
        this->num_channels[part_idx] = num_channels;
    }
 
    unsigned            has_data[2];            // true if there is a measurement for this part
    cudaTextureObject_t eval_data[2];           // uses filter mode cudaFilterModeLinear
    float               max_albedo[2];          // max albedo used to limit the multiplier
    float*              sample_data[2];         // CDFs for sampling a BSDF measurement
    float*              albedo_data[2];         // max albedo for each theta (isotropic)
 
    uint2           angular_resolution[2];      // size of the dataset, needed for texel access
    float2          inv_angular_resolution[2];  // the inverse values of the size of the dataset
    unsigned        num_channels[2];            // number of color channels (1 or 3)
};
 
 
// Structure representing a Light Profile
struct Lightprofile
{
    explicit Lightprofile(
        uint2               angular_resolution = make_uint2(0, 0),
        float2              theta_phi_start = make_float2(0.0f, 0.0f),
        float2              theta_phi_delta = make_float2(0.0f, 0.0f),
        float               candela_multiplier = 0.0f,
        float               total_power = 0.0f,
        cudaTextureObject_t eval_data = 0,
        float               *cdf_data = nullptr)
    : angular_resolution(angular_resolution)
    , inv_angular_resolution(make_float2(
        1.0f / float(angular_resolution.x),
        1.0f / float(angular_resolution.y)))
    , theta_phi_start(theta_phi_start)
    , theta_phi_delta(theta_phi_delta)
        , theta_phi_inv_delta(make_float2(0.0f, 0.0f))
    , candela_multiplier(candela_multiplier)
    , total_power(total_power)
    , eval_data(eval_data)
    , cdf_data(cdf_data)
    {
        theta_phi_inv_delta.x = theta_phi_delta.x ? (1.f / theta_phi_delta.x) : 0.f;
        theta_phi_inv_delta.y = theta_phi_delta.y ? (1.f / theta_phi_delta.y) : 0.f;
    }
 
    uint2           angular_resolution;     // angular resolution of the grid
    float2          inv_angular_resolution; // inverse angular resolution of the grid
    float2          theta_phi_start;        // start of the grid
    float2          theta_phi_delta;        // angular step size
    float2          theta_phi_inv_delta;    // inverse step size
    float           candela_multiplier;     // factor to rescale the normalized data
    float           total_power;
 
    cudaTextureObject_t eval_data;          // normalized data sampled on grid
    float*              cdf_data;           // CDFs for sampling a light profile
};
 
// Structure representing the resources used by the generated code of a target code.
struct Target_code_data
{
    Target_code_data(
        size_t num_textures,
        CUdeviceptr textures,
        size_t num_mbsdfs,
        CUdeviceptr mbsdfs,
        size_t num_lightprofiles,
        CUdeviceptr lightprofiles,
                     CUdeviceptr ro_data_segment)
        : num_textures(num_textures)
        , textures(textures)
        , num_mbsdfs(num_mbsdfs)
        , mbsdfs(mbsdfs)
        , num_lightprofiles(num_lightprofiles)
        , lightprofiles(lightprofiles)
        , ro_data_segment(ro_data_segment)
    {}
 
    size_t      num_textures;           // number of elements in the textures field
    CUdeviceptr textures;               // a device pointer to a list of Texture objects, if used
 
    size_t      num_mbsdfs;             // number of elements in the mbsdfs field
    CUdeviceptr mbsdfs;                 // a device pointer to a list of mbsdfs objects, if used
 
    size_t      num_lightprofiles;     // number of elements in the lightprofiles field
    CUdeviceptr lightprofiles;         // a device pointer to a list of mbsdfs objects, if used
 
    CUdeviceptr ro_data_segment;        // a device pointer to the read-only data segment, if used
};
 
 
//------------------------------------------------------------------------------
//
// Helper functions
//
//------------------------------------------------------------------------------
 
// Return a textual representation of the given value.
template <typename T>
std::string to_string(T val)
{
    std::ostringstream stream;
    stream << val;
    return stream.str();
}
 
// Collects the handles in a compiled material
class Handle_collector : public Compiled_material_traverser_base
{
public:
    // add all handle appearing in the provided material to the collectors handle list.
    explicit Handle_collector(
        mi::neuraylib::ITransaction* transaction,
        const mi::neuraylib::ICompiled_material* material)
    : Compiled_material_traverser_base()
    {
        traverse(material, transaction);
    }
 
    // get the collected handles.
    const std::vector<std::string>& get_handles() const { return m_handles; }
 
private:
    // Called when the traversal reaches a new element.
    void visit_begin(const mi::neuraylib::ICompiled_material* material,
                     const Compiled_material_traverser_base::Traversal_element& element,
                     void* context) override
    {
        // look for direct calls
        if (!element.expression ||
            element.expression->get_kind() != mi::neuraylib::IExpression::EK_DIRECT_CALL)
            return;
 
        // check if it is a distribution function
        auto transaction = static_cast<mi::neuraylib::ITransaction*>(context);
 
        const mi::base::Handle<const mi::neuraylib::IExpression_direct_call> expr_dcall(
            element.expression->get_interface<const mi::neuraylib::IExpression_direct_call
            >());
        const mi::base::Handle<const mi::neuraylib::IExpression_list> args(
            expr_dcall->get_arguments());
        const mi::base::Handle<const mi::neuraylib::IFunction_definition> func_def(
            transaction->access<mi::neuraylib::IFunction_definition>(
            expr_dcall->get_definition()));
        const mi::neuraylib::IFunction_definition::Semantics semantic = func_def->
            get_semantic();
 
        if (semantic < mi::neuraylib::IFunction_definition::DS_INTRINSIC_DF_FIRST
            || semantic > mi::neuraylib::IFunction_definition::DS_INTRINSIC_DF_LAST)
            return;
 
        // check if the last argument is a handle
        const mi::base::Handle<const mi::neuraylib::IExpression_list> arguments(
            expr_dcall->get_arguments());
 
        mi::Size arg_count = arguments->get_size();
        const char* name = arguments->get_name(arg_count - 1);
        if (strcmp(name, "handle") != 0)
            return;
 
        // get the handle value
        mi::base::Handle<const mi::neuraylib::IExpression> expr(
            arguments->get_expression(arg_count - 1));
 
        if (expr->get_kind() != mi::neuraylib::IExpression::EK_CONSTANT)
            return; // is an error if 'handle' is a reserved parameter name
 
        const mi::base::Handle<const mi::neuraylib::IExpression_constant> expr_const(
            expr->get_interface<const mi::neuraylib::IExpression_constant>());
 
        const mi::base::Handle<const mi::neuraylib::IValue> value(expr_const->get_value());
        if (value->get_kind() != mi::neuraylib::IValue::VK_STRING)
            return;
 
        const mi::base::Handle<const mi::neuraylib::IValue_string> handle(
            value->get_interface<const mi::neuraylib::IValue_string>());
 
        std::string handle_value = handle->get_value() ? std::string(handle->get_value()) : "";
 
        if (std::find(m_handles.begin(), m_handles.end(), handle_value) == m_handles.end())
            m_handles.push_back(handle_value);
    }
 
    std::vector<std::string> m_handles;
};
 
 
//------------------------------------------------------------------------------
//
// CUDA helper functions
//
//------------------------------------------------------------------------------
 
// Helper macro. Checks whether the expression is cudaSuccess and if not prints a message and
// resets the device and exits.
 
#ifdef ENABLE_DEPRECATED_UTILIY_FUNCTIONS
 
#define check_cuda_success(expr) \
    do { \
        int err = (expr); \
        if (err != 0) { \
            fprintf(stderr, "CUDA error %d in file %s, line %u: \"%s\".\n", \
                err, __FILE__, __LINE__, #expr); \
            keep_console_open(); \
            cudaDeviceReset(); \
            exit(EXIT_FAILURE); \
        } \
    } while (false)
 
#else
 
#define check_cuda_success(expr) \
    do { \
        int err = (expr); \
        if (err != 0) { \
            cudaDeviceReset(); \
            exit_failure( "Error in file %s, line %u: \"%s\".\n", __FILE__, __LINE__, #expr); \
        } \
    } while (false)
 
#endif
 
// Initialize CUDA.
CUcontext init_cuda(
    int ordinal
#ifdef OPENGL_INTEROP
    , const bool opengl_interop
#endif
    )
{
    CUdevice cu_device;
    CUcontext cu_context;
 
    check_cuda_success(cuInit(0));
#if defined(OPENGL_INTEROP) && !defined(__APPLE__)
    if (opengl_interop) {
        // Use first device used by OpenGL context
        unsigned int num_cu_devices;
        check_cuda_success(cuGLGetDevices(&num_cu_devices, &cu_device, 1, CU_GL_DEVICE_LIST_ALL));
    }
    else
#endif
    {
        // Use given device
        check_cuda_success(cuDeviceGet(&cu_device, ordinal));
    }
 
    check_cuda_success(cuCtxCreate(&cu_context, 0, cu_device));
 
    // For this example, increase printf CUDA buffer size to support a larger number
    // of MDL debug::print() calls per CUDA kernel launch
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 16 * 1024 * 1024);
 
    return cu_context;
}
 
// Uninitialize CUDA.
void uninit_cuda(CUcontext cuda_context)
{
    check_cuda_success(cuCtxDestroy(cuda_context));
}
 
template<typename T> struct Resource_deleter {
    /*compile error*/
};
 
template<> struct Resource_deleter<cudaArray_t> {
    void operator()(cudaArray_t res) { check_cuda_success(cudaFreeArray(res)); }
};
 
template<> struct Resource_deleter<cudaMipmappedArray_t> {
    void operator()(cudaMipmappedArray_t res) { check_cuda_success(cudaFreeMipmappedArray(res)); }
};
 
template<> struct Resource_deleter<Texture> {
    void operator()(Texture &res) {
        check_cuda_success(cudaDestroyTextureObject(res.filtered_object));
        check_cuda_success(cudaDestroyTextureObject(res.unfiltered_object));
    }
};
 
template<> struct Resource_deleter<Mbsdf> {
    void operator()(Mbsdf &res) {
        for (size_t i = 0; i < 2; ++i) {
            if (res.has_data[i] != 0u) {
                check_cuda_success(cudaDestroyTextureObject(res.eval_data[i]));
                check_cuda_success(cuMemFree(reinterpret_cast<CUdeviceptr>(res.sample_data[i])));
                check_cuda_success(cuMemFree(reinterpret_cast<CUdeviceptr>(res.albedo_data[i])));
            }
        }
    }
};
 
template<> struct Resource_deleter<Lightprofile> {
    void operator()(Lightprofile res) {
        if (res.cdf_data)
            check_cuda_success(cuMemFree((CUdeviceptr)res.cdf_data));
    }
};
 
template<> struct Resource_deleter<Target_code_data> {
    void operator()(Target_code_data &res) {
        if (res.textures)
            check_cuda_success(cuMemFree(res.textures));
        if (res.ro_data_segment)
            check_cuda_success(cuMemFree(res.ro_data_segment));
    }
};
 
template<> struct Resource_deleter<CUdeviceptr> {
    void operator()(CUdeviceptr res) {
        if (res != 0)
            check_cuda_success(cuMemFree(res));
    }
};
 
template<typename T, typename D = Resource_deleter<T> >
struct Resource_handle {
    Resource_handle(T res) : m_res(res) {}
 
    ~Resource_handle() {
        D deleter;
        deleter(m_res);
    }
 
    T &get() { return m_res; }
 
    T const &get() const { return m_res; }
 
    void set(T res) { m_res = res; }
 
private:
    // No copy possible.
    Resource_handle(Resource_handle const &);
    Resource_handle &operator=(Resource_handle const &);
 
private:
    T m_res;
};
 
template<typename T, typename C = std::vector<T>, typename D = Resource_deleter<T> >
struct Resource_container {
    Resource_container() : m_cont() {}
 
    ~Resource_container() {
        D deleter;
        typedef typename C::iterator I;
        for (I it(m_cont.begin()), end(m_cont.end()); it != end; ++it) {
            T &r = *it;
            deleter(r);
        }
    }
 
    C &operator*() { return m_cont; }
 
    C const &operator*() const { return m_cont; }
 
    C *operator->() { return &m_cont; }
 
    C const *operator->() const { return &m_cont; }
 
private:
    // No copy possible.
    Resource_container(Resource_container const &);
    Resource_container &operator=(Resource_container const &);
 
private:
    C m_cont;
};
 
// Allocate memory on GPU and copy the given data to the allocated memory.
CUdeviceptr gpu_mem_dup(void const *data, size_t size)
{
    CUdeviceptr device_ptr;
    check_cuda_success(cuMemAlloc(&device_ptr, size));
    check_cuda_success(cuMemcpyHtoD(device_ptr, data, size));
    return device_ptr;
}
 
// Allocate memory on GPU and copy the given data to the allocated memory.
template <typename T>
CUdeviceptr gpu_mem_dup(Resource_handle<T> const *data, size_t size)
{
    return gpu_mem_dup((void *)data->get(), size);
}
 
// Allocate memory on GPU and copy the given data to the allocated memory.
template<typename T>
CUdeviceptr gpu_mem_dup(std::vector<T> const &data)
{
    return gpu_mem_dup(&data[0], data.size() * sizeof(T));
}
 
// Allocate memory on GPU and copy the given data to the allocated memory.
template<typename T, typename C>
CUdeviceptr gpu_mem_dup(Resource_container<T,C> const &cont)
{
    return gpu_mem_dup(*cont);
}
 
 
//------------------------------------------------------------------------------
//
// Material_gpu_context class
//
//------------------------------------------------------------------------------
 
// Helper class responsible for making textures and read-only data available to the GPU
// by generating and managing a list of Target_code_data objects.
class Material_gpu_context
{
public:
    Material_gpu_context(bool enable_derivatives)
        : m_enable_derivatives(enable_derivatives)
        , m_device_target_code_data_list(0)
        , m_device_target_argument_block_list(0)
    {
        // Use first entry as "not-used" block
        m_target_argument_block_list->push_back(0);
    }
 
    // Prepare the needed data of the given target code.
    bool prepare_target_code_data(
        mi::neuraylib::ITransaction          *transaction,
        mi::neuraylib::IImage_api            *image_api,
        mi::neuraylib::ITarget_code const    *target_code,
        std::vector<size_t> const            &arg_block_indices);
 
    // Get a device pointer to the target code data list.
    CUdeviceptr get_device_target_code_data_list();
 
    // Get a device pointer to the target argument block list.
    CUdeviceptr get_device_target_argument_block_list();
 
    // Get a device pointer to the i'th target argument block.
    CUdeviceptr get_device_target_argument_block(size_t i)
    {
        // First entry is the "not-used" block, so start at index 1.
        if (i + 1 >= m_target_argument_block_list->size())
            return 0;
        return (*m_target_argument_block_list)[i + 1];
    }
 
    // Get the number of target argument blocks.
    size_t get_argument_block_count() const
    {
        return m_own_arg_blocks.size();
    }
 
    // Get the argument block of the i'th BSDF.
    // If the BSDF has no target argument block, size_t(~0) is returned.
    size_t get_bsdf_argument_block_index(size_t i) const
    {
        if (i >= m_bsdf_arg_block_indices.size()) return size_t(~0);
        return m_bsdf_arg_block_indices[i];
    }
 
    // Get a writable copy of the i'th target argument block.
    mi::base::Handle<mi::neuraylib::ITarget_argument_block> get_argument_block(size_t i)
    {
        if (i >= m_own_arg_blocks.size())
            return mi::base::Handle<mi::neuraylib::ITarget_argument_block>();
        return m_own_arg_blocks[i];
    }
 
    // Get the layout of the i'th target argument block.
    mi::base::Handle<mi::neuraylib::ITarget_value_layout const> get_argument_block_layout(size_t i)
    {
        if (i >= m_arg_block_layouts.size())
            return mi::base::Handle<mi::neuraylib::ITarget_value_layout const>();
        return m_arg_block_layouts[i];
    }
 
    // Update the i'th target argument block on the device with the data from the corresponding
    // block returned by get_argument_block().
    void update_device_argument_block(size_t i);
private:
    // Copy the image data of a canvas to a CUDA array.
    void copy_canvas_to_cuda_array(cudaArray_t device_array, mi::neuraylib::ICanvas const *canvas);
 
    // Prepare the texture identified by the texture_index for use by the texture access functions
    // on the GPU.
    bool prepare_texture(
        mi::neuraylib::ITransaction       *transaction,
        mi::neuraylib::IImage_api         *image_api,
        mi::neuraylib::ITarget_code const *code_ptx,
        mi::Size                           texture_index,
        std::vector<Texture>              &textures);
 
    // Prepare the mbsdf identified by the mbsdf_index for use by the bsdf measurement access
    // functions on the GPU.
    bool prepare_mbsdf(
        mi::neuraylib::ITransaction       *transaction,
        mi::neuraylib::ITarget_code const *code_ptx,
        mi::Size                           mbsdf_index,
        std::vector<Mbsdf>                &mbsdfs);
 
    // Prepare the mbsdf identified by the mbsdf_index for use by the bsdf measurement access
    // functions on the GPU.
    bool prepare_lightprofile(
        mi::neuraylib::ITransaction       *transaction,
        mi::neuraylib::ITarget_code const *code_ptx,
        mi::Size                           lightprofile_index,
        std::vector<Lightprofile>        &lightprofiles);
 
    // If true, mipmaps will be generated for all 2D textures.
    bool m_enable_derivatives;
 
    // The device pointer of the target code data list.
    Resource_handle<CUdeviceptr> m_device_target_code_data_list;
 
    // List of all target code data objects owned by this context.
    Resource_container<Target_code_data> m_target_code_data_list;
 
    // The device pointer of the target argument block list.
    Resource_handle<CUdeviceptr> m_device_target_argument_block_list;
 
    // List of all target argument blocks owned by this context.
    Resource_container<CUdeviceptr> m_target_argument_block_list;
 
    // List of all local, writable copies of the target argument blocks.
    std::vector<mi::base::Handle<mi::neuraylib::ITarget_argument_block> > m_own_arg_blocks;
 
    // List of argument block indices per material BSDF.
    std::vector<size_t> m_bsdf_arg_block_indices;
 
    // List of all target argument block layouts.
    std::vector<mi::base::Handle<mi::neuraylib::ITarget_value_layout const> > m_arg_block_layouts;
 
    // List of all Texture objects owned by this context.
    Resource_container<Texture> m_all_textures;
 
    // List of all MBSDFs objects owned by this context.
    Resource_container<Mbsdf> m_all_mbsdfs;
 
    // List of all Light profiles objects owned by this context.
    Resource_container<Lightprofile> m_all_lightprofiles;
 
    // List of all CUDA arrays owned by this context.
    Resource_container<cudaArray_t> m_all_texture_arrays;
 
    // List of all CUDA mipmapped arrays owned by this context.
    Resource_container<cudaMipmappedArray_t> m_all_texture_mipmapped_arrays;
};
 
// Get a device pointer to the target code data list.
CUdeviceptr Material_gpu_context::get_device_target_code_data_list()
{
    if (!m_device_target_code_data_list.get())
        m_device_target_code_data_list.set(gpu_mem_dup(m_target_code_data_list));
    return m_device_target_code_data_list.get();
}
 
// Get a device pointer to the target argument block list.
CUdeviceptr Material_gpu_context::get_device_target_argument_block_list()
{
    if (!m_device_target_argument_block_list.get())
        m_device_target_argument_block_list.set(gpu_mem_dup(m_target_argument_block_list));
    return m_device_target_argument_block_list.get();
}
 
// Copy the image data of a canvas to a CUDA array.
void Material_gpu_context::copy_canvas_to_cuda_array(
    cudaArray_t device_array,
    mi::neuraylib::ICanvas const *canvas)
{
    mi::base::Handle<const mi::neuraylib::ITile> tile(canvas->get_tile());
    mi::Float32 const *data = static_cast<mi::Float32 const *>(tile->get_data());
    check_cuda_success(cudaMemcpy2DToArray(
        device_array, 0, 0, data,
        canvas->get_resolution_x() * sizeof(float) * 4,
        canvas->get_resolution_x() * sizeof(float) * 4,
        canvas->get_resolution_y(),
        cudaMemcpyHostToDevice));
}
 
// Prepare the texture identified by the texture_index for use by the texture access functions
// on the GPU.
bool Material_gpu_context::prepare_texture(
    mi::neuraylib::ITransaction       *transaction,
    mi::neuraylib::IImage_api         *image_api,
    mi::neuraylib::ITarget_code const *code_ptx,
    mi::Size                           texture_index,
    std::vector<Texture>              &textures)
{
    // Get access to the texture data by the texture database name from the target code.
    mi::base::Handle<const mi::neuraylib::ITexture> texture(
        transaction->access<mi::neuraylib::ITexture>(code_ptx->get_texture(texture_index)));
    mi::base::Handle<const mi::neuraylib::IImage> image(
        transaction->access<mi::neuraylib::IImage>(texture->get_image()));
    mi::base::Handle<const mi::neuraylib::ICanvas> canvas(image->get_canvas(0, 0, 0));
    mi::Uint32 tex_width = canvas->get_resolution_x();
    mi::Uint32 tex_height = canvas->get_resolution_y();
    mi::Uint32 tex_layers = canvas->get_layers_size();
    char const *image_type = image->get_type(0, 0);
 
    if (image->is_uvtile() || image->is_animated()) {
        std::cerr << "The example does not support uvtile and/or animated textures!" << std::endl;
        return false;
    }
 
    // For simplicity, the texture access functions are only implemented for float4 and gamma
    // is pre-applied here (all images are converted to linear space).
 
    // Convert to linear color space if necessary
    if (texture->get_effective_gamma(0, 0) != 1.0f) {
        // Copy/convert to float4 canvas and adjust gamma from "effective gamma" to 1.
        mi::base::Handle<mi::neuraylib::ICanvas> gamma_canvas(
            image_api->convert(canvas.get(), "Color"));
        gamma_canvas->set_gamma(texture->get_effective_gamma(0, 0));
        image_api->adjust_gamma(gamma_canvas.get(), 1.0f);
        canvas = gamma_canvas;
    } else if (strcmp(image_type, "Color") != 0 && strcmp(image_type, "Float32<4>") != 0) {
        // Convert to expected format
        canvas = image_api->convert(canvas.get(), "Color");
    }
 
    cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc<float4>();
    cudaResourceDesc res_desc;
    memset(&res_desc, 0, sizeof(res_desc));
 
    // Copy image data to GPU array depending on texture shape
    mi::neuraylib::ITarget_code::Texture_shape texture_shape =
        code_ptx->get_texture_shape(texture_index);
    if (texture_shape == mi::neuraylib::ITarget_code::Texture_shape_cube ||
        texture_shape == mi::neuraylib::ITarget_code::Texture_shape_3d ||
        texture_shape == mi::neuraylib::ITarget_code::Texture_shape_bsdf_data) {
        // Cubemap and 3D texture objects require 3D CUDA arrays
 
        if (texture_shape == mi::neuraylib::ITarget_code::Texture_shape_cube &&
            tex_layers != 6) {
            std::cerr << "Invalid number of layers (" << tex_layers
                << "), cubemaps must have 6 layers!" << std::endl;
            return false;
        }
 
        // Allocate a 3D array on the GPU
        cudaExtent extent = make_cudaExtent(tex_width, tex_height, tex_layers);
        cudaArray_t device_tex_array;
        check_cuda_success(cudaMalloc3DArray(
            &device_tex_array, &channel_desc, extent,
            texture_shape == mi::neuraylib::ITarget_code::Texture_shape_cube ?
            cudaArrayCubemap : 0));
 
        // Prepare the memcpy parameter structure
        cudaMemcpy3DParms copy_params;
        memset(&copy_params, 0, sizeof(copy_params));
        copy_params.dstArray = device_tex_array;
        copy_params.extent = make_cudaExtent(tex_width, tex_height, 1);
        copy_params.kind = cudaMemcpyHostToDevice;
 
        // Copy the image data of all layers (the layers are not consecutive in memory)
        for (mi::Uint32 layer = 0; layer < tex_layers; ++layer) {
            mi::base::Handle<const mi::neuraylib::ITile> tile(
                canvas->get_tile(layer));
            float const *data = static_cast<float const *>(tile->get_data());
 
            copy_params.srcPtr = make_cudaPitchedPtr(
                const_cast<float *>(data), tex_width * sizeof(float) * 4,
                tex_width, tex_height);
            copy_params.dstPos = make_cudaPos(0, 0, layer);
 
            check_cuda_success(cudaMemcpy3D(&copy_params));
        }
 
        res_desc.resType = cudaResourceTypeArray;
        res_desc.res.array.array = device_tex_array;
 
        m_all_texture_arrays->push_back(device_tex_array);
    } else if (m_enable_derivatives) {
        // mipmapped textures use CUDA mipmapped arrays
        mi::Uint32 num_levels = image->get_levels(0, 0);
        cudaExtent extent = make_cudaExtent(tex_width, tex_height, 0);
        cudaMipmappedArray_t device_tex_miparray;
        check_cuda_success(cudaMallocMipmappedArray(
            &device_tex_miparray, &channel_desc, extent, num_levels));
 
        // create all mipmap levels and copy them to the CUDA arrays in the mipmapped array
        mi::base::Handle<mi::IArray> mipmaps(image_api->create_mipmap(canvas.get(), 1.0f));
 
        for (mi::Uint32 level = 0; level < num_levels; ++level) {
            mi::base::Handle<mi::neuraylib::ICanvas const> level_canvas;
            if (level == 0)
                level_canvas = canvas;
            else {
                mi::base::Handle<mi::IPointer> mipmap_ptr(mipmaps->get_element<mi::IPointer>(level - 1));
                level_canvas = mipmap_ptr->get_pointer<mi::neuraylib::ICanvas>();
            }
            cudaArray_t device_level_array;
            cudaGetMipmappedArrayLevel(&device_level_array, device_tex_miparray, level);
            copy_canvas_to_cuda_array(device_level_array, level_canvas.get());
        }
 
        res_desc.resType = cudaResourceTypeMipmappedArray;
        res_desc.res.mipmap.mipmap = device_tex_miparray;
 
        m_all_texture_mipmapped_arrays->push_back(device_tex_miparray);
    } else {
        // 2D texture objects use CUDA arrays
        cudaArray_t device_tex_array;
        check_cuda_success(cudaMallocArray(
            &device_tex_array, &channel_desc, tex_width, tex_height));
 
        copy_canvas_to_cuda_array(device_tex_array, canvas.get());
 
        res_desc.resType = cudaResourceTypeArray;
        res_desc.res.array.array = device_tex_array;
 
        m_all_texture_arrays->push_back(device_tex_array);
    }
 
    // For cube maps we need clamped address mode to avoid artifacts in the corners
    cudaTextureAddressMode addr_mode =
        texture_shape == mi::neuraylib::ITarget_code::Texture_shape_cube
        ? cudaAddressModeClamp
        : cudaAddressModeWrap;
 
    // Create filtered texture object
    cudaTextureDesc tex_desc;
    memset(&tex_desc, 0, sizeof(tex_desc));
    tex_desc.addressMode[0]   = addr_mode;
    tex_desc.addressMode[1]   = addr_mode;
    tex_desc.addressMode[2]   = addr_mode;
    tex_desc.filterMode       = cudaFilterModeLinear;
    tex_desc.readMode         = cudaReadModeElementType;
    tex_desc.normalizedCoords = 1;
    if (res_desc.resType == cudaResourceTypeMipmappedArray) {
        tex_desc.mipmapFilterMode = cudaFilterModeLinear;
        tex_desc.maxAnisotropy = 16;
        tex_desc.minMipmapLevelClamp = 0.f;
        tex_desc.maxMipmapLevelClamp = 1000.f;  // default value in OpenGL
    }
 
    cudaTextureObject_t tex_obj = 0;
    check_cuda_success(cudaCreateTextureObject(&tex_obj, &res_desc, &tex_desc, nullptr));
 
    // Create unfiltered texture object if necessary (cube textures have no texel functions)
    cudaTextureObject_t tex_obj_unfilt = 0;
    if (texture_shape != mi::neuraylib::ITarget_code::Texture_shape_cube) {
        // Use a black border for access outside of the texture
        tex_desc.addressMode[0]   = cudaAddressModeBorder;
        tex_desc.addressMode[1]   = cudaAddressModeBorder;
        tex_desc.addressMode[2]   = cudaAddressModeBorder;
        tex_desc.filterMode       = cudaFilterModePoint;
 
        check_cuda_success(cudaCreateTextureObject(
            &tex_obj_unfilt, &res_desc, &tex_desc, nullptr));
    }
 
    // Store texture infos in result vector
    textures.push_back(Texture(
        tex_obj,
        tex_obj_unfilt,
        make_uint3(tex_width, tex_height, tex_layers)));
    m_all_textures->push_back(textures.back());
 
    return true;
}
 
namespace
{
    bool prepare_mbsdfs_part(mi::neuraylib::Mbsdf_part part, Mbsdf& mbsdf_cuda_representation,
                             const mi::neuraylib::IBsdf_measurement* bsdf_measurement)
    {
        mi::base::Handle<const mi::neuraylib::Bsdf_isotropic_data> dataset;
        switch (part)
        {
            case mi::neuraylib::MBSDF_DATA_REFLECTION:
                dataset = bsdf_measurement->get_reflection<mi::neuraylib::Bsdf_isotropic_data>();
                break;
            case mi::neuraylib::MBSDF_DATA_TRANSMISSION:
                dataset = bsdf_measurement->get_transmission<mi::neuraylib::Bsdf_isotropic_data>();
                break;
        }
 
        // no data, fine
        if (!dataset)
            return true;
 
        // get dimensions
        uint2 res;
        res.x = dataset->get_resolution_theta();
        res.y = dataset->get_resolution_phi();
        unsigned num_channels = dataset->get_type() == mi::neuraylib::BSDF_SCALAR ? 1 : 3;
        mbsdf_cuda_representation.Add(part, res, num_channels);
 
 
        // get data
        mi::base::Handle<const mi::neuraylib::IBsdf_buffer> buffer(dataset->get_bsdf_buffer());
        // {1,3} * (index_theta_in * (res_phi * res_theta) + index_theta_out * res_phi + index_phi)
 
        const mi::Float32* src_data = buffer->get_data();
 
        // ----------------------------------------------------------------------------------------
        // prepare importance sampling data:
        // - for theta_in we will be able to perform a two stage CDF, first to select theta_out,
        //   and second to select phi_out
        // - maximum component is used to "probability" in case of colored measurements
 
        // CDF of the probability to select a certain theta_out for a given theta_in
        const unsigned int cdf_theta_size = res.x * res.x;
 
        // for each of theta_in x theta_out combination, a CDF of the probabilities to select a
        // a certain theta_out is stored
        const unsigned sample_data_size = cdf_theta_size + cdf_theta_size * res.y;
        float* sample_data = new float[sample_data_size];
 
        float* albedo_data = new float[res.x]; // albedo for sampling reflection and transmission
 
        float* sample_data_theta = sample_data;                // begin of the first (theta) CDF
        float* sample_data_phi = sample_data + cdf_theta_size; // begin of the second (phi) CDFs
 
        const float s_theta = (float) (M_PI * 0.5) / float(res.x);  // step size
        const float s_phi = (float) (M_PI) / float(res.y);          // step size
 
        float max_albedo = 0.0f;
        for (unsigned int t_in = 0; t_in < res.x; ++t_in)
        {
            float sum_theta = 0.0f;
            float sintheta0_sqd = 0.0f;
            for (unsigned int t_out = 0; t_out < res.x; ++t_out)
            {
                const float sintheta1 = sinf(float(t_out + 1) * s_theta);
                const float sintheta1_sqd = sintheta1 * sintheta1;
 
                // BSDFs are symmetric: f(w_in, w_out) = f(w_out, w_in)
                // take the average of both measurements
 
                // area of two the surface elements (the ones we are averaging)
                const float mu = (sintheta1_sqd - sintheta0_sqd) * s_phi * 0.5f;
                sintheta0_sqd = sintheta1_sqd;
 
                // offset for both the thetas into the measurement data (select row in the volume)
                const unsigned int offset_phi  = (t_in * res.x + t_out) * res.y;
                const unsigned int offset_phi2 = (t_out * res.x + t_in) * res.y;
 
                // build CDF for phi
                float sum_phi = 0.0f;
                for (unsigned int p_out = 0; p_out < res.y; ++p_out)
                {
                    const unsigned int idx  = offset_phi  + p_out;
                    const unsigned int idx2 = offset_phi2 + p_out;
 
                    float value = 0.0f;
                    if (num_channels == 3)
                    {
                        value = fmax(fmaxf(src_data[3 * idx  + 0], src_data[3 * idx  + 1]),
                                     fmaxf(src_data[3 * idx  + 2], 0.0f))
                              + fmax(fmaxf(src_data[3 * idx2 + 0], src_data[3 * idx2 + 1]),
                                     fmaxf(src_data[3 * idx2 + 2], 0.0f));
                    }
                    else /* num_channels == 1 */
                    {
                        value = fmaxf(src_data[idx], 0.0f) + fmaxf(src_data[idx2], 0.0f);
                    }
 
                    sum_phi += value * mu;
                    sample_data_phi[idx] = sum_phi;
                }
 
                // normalize CDF for phi
                for (unsigned int p_out = 0; p_out < res.y; ++p_out)
                {
                    const unsigned int idx = offset_phi + p_out;
                    sample_data_phi[idx] = sample_data_phi[idx] / sum_phi;
                }
 
                // build CDF for theta
                sum_theta += sum_phi;
                sample_data_theta[t_in * res.x + t_out] = sum_theta;
            }
 
            if (sum_theta > max_albedo)
                max_albedo = sum_theta;
 
            albedo_data[t_in] = sum_theta;
 
            // normalize CDF for theta
            for (unsigned int t_out = 0; t_out < res.x; ++t_out)
            {
                const unsigned int idx = t_in * res.x + t_out;
                sample_data_theta[idx] = sample_data_theta[idx] / sum_theta;
            }
        }
 
        // copy entire CDF data buffer to GPU
        CUdeviceptr sample_obj = 0;
        check_cuda_success(cuMemAlloc(&sample_obj, sample_data_size * sizeof(float)));
        check_cuda_success(cuMemcpyHtoD(sample_obj, sample_data, sample_data_size * sizeof(float)));
        delete[] sample_data;
 
        CUdeviceptr albedo_obj = 0;
        check_cuda_success(cuMemAlloc(&albedo_obj, res.x * sizeof(float)));
        check_cuda_success(cuMemcpyHtoD(albedo_obj, albedo_data, res.x * sizeof(float)));
        delete[] albedo_data;
 
 
        mbsdf_cuda_representation.sample_data[part] = reinterpret_cast<float*>(sample_obj);
        mbsdf_cuda_representation.albedo_data[part] = reinterpret_cast<float*>(albedo_obj);
        mbsdf_cuda_representation.max_albedo[part] = max_albedo;
 
        // ----------------------------------------------------------------------------------------
        // prepare evaluation data:
        // - simply store the measured data in a volume texture
        // - in case of color data, we store each sample in a vector4 to get texture support
        unsigned lookup_channels = (num_channels == 3) ? 4 : 1;
 
        // make lookup data symmetric
        float* lookup_data = new float[lookup_channels * res.y * res.x * res.x];
        for (unsigned int t_in = 0; t_in < res.x; ++t_in)
        {
            for (unsigned int t_out = 0; t_out < res.x; ++t_out)
            {
                const unsigned int offset_phi = (t_in * res.x + t_out) * res.y;
                const unsigned int offset_phi2 = (t_out * res.x + t_in) * res.y;
                for (unsigned int p_out = 0; p_out < res.y; ++p_out)
                {
                    const unsigned int idx = offset_phi + p_out;
                    const unsigned int idx2 = offset_phi2 + p_out;
 
                    if (num_channels == 3)
                    {
                        lookup_data[4*idx+0] = (src_data[3*idx+0] + src_data[3*idx2+0]) * 0.5f;
                        lookup_data[4*idx+1] = (src_data[3*idx+1] + src_data[3*idx2+1]) * 0.5f;
                        lookup_data[4*idx+2] = (src_data[3*idx+2] + src_data[3*idx2+2]) * 0.5f;
                        lookup_data[4*idx+3] = 1.0f;
                    }
                    else
                    {
                        lookup_data[idx] = (src_data[idx] + src_data[idx2]) * 0.5f;
                    }
                }
            }
        }
 
        // Copy data to GPU array
        cudaArray_t device_mbsdf_data;
        cudaChannelFormatDesc channel_desc = (num_channels == 3
            ? cudaCreateChannelDesc<float4>() // float3 is not supported
            : cudaCreateChannelDesc<float>());
 
        // Allocate a 3D array on the GPU (phi_delta x theta_out x theta_in)
        cudaExtent extent = make_cudaExtent(res.y, res.x, res.x);
        check_cuda_success(cudaMalloc3DArray(&device_mbsdf_data, &channel_desc, extent, 0));
 
        // prepare and copy
        cudaMemcpy3DParms copy_params;
        memset(&copy_params, 0, sizeof(copy_params));
        copy_params.srcPtr = make_cudaPitchedPtr(
            (void*)(lookup_data),                                   // base pointer
            res.y * lookup_channels * sizeof(float),                // row pitch
            res.y,                                                  // width of slice
            res.x);                                                 // height of slice
        copy_params.dstArray = device_mbsdf_data;
        copy_params.extent = extent;
        copy_params.kind = cudaMemcpyHostToDevice;
        check_cuda_success(cudaMemcpy3D(&copy_params));
        delete[] lookup_data;
 
        cudaResourceDesc    texRes;
        memset(&texRes, 0, sizeof(cudaResourceDesc));
        texRes.resType = cudaResourceTypeArray;
        texRes.res.array.array = device_mbsdf_data;
 
        cudaTextureDesc     texDescr;
        memset(&texDescr, 0, sizeof(cudaTextureDesc));
        texDescr.normalizedCoords = 1;
        texDescr.filterMode = cudaFilterModeLinear;
        texDescr.addressMode[0] = cudaAddressModeClamp;
        texDescr.addressMode[1] = cudaAddressModeClamp;
        texDescr.addressMode[2] = cudaAddressModeClamp;
        texDescr.readMode = cudaReadModeElementType;
 
        cudaTextureObject_t eval_tex_obj;
        check_cuda_success(cudaCreateTextureObject(&eval_tex_obj, &texRes, &texDescr, nullptr));
        mbsdf_cuda_representation.eval_data[part] = eval_tex_obj;
 
        return true;
    }
}
 
bool Material_gpu_context::prepare_mbsdf(
    mi::neuraylib::ITransaction       *transaction,
    mi::neuraylib::ITarget_code const *code_ptx,
    mi::Size                           mbsdf_index,
    std::vector<Mbsdf>                &mbsdfs)
{
    // Get access to the texture data by the texture database name from the target code.
    mi::base::Handle<const mi::neuraylib::IBsdf_measurement> mbsdf(
        transaction->access<mi::neuraylib::IBsdf_measurement>(
        code_ptx->get_bsdf_measurement(mbsdf_index)));
 
    Mbsdf mbsdf_cuda;
 
    // handle reflection and transmission
    if (!prepare_mbsdfs_part(mi::neuraylib::MBSDF_DATA_REFLECTION, mbsdf_cuda, mbsdf.get()))
        return false;
    if (!prepare_mbsdfs_part(mi::neuraylib::MBSDF_DATA_TRANSMISSION, mbsdf_cuda, mbsdf.get()))
        return false;
 
    mbsdfs.push_back(mbsdf_cuda);
    m_all_mbsdfs->push_back(mbsdfs.back());
    return true;
}
 
bool Material_gpu_context::prepare_lightprofile(
    mi::neuraylib::ITransaction       *transaction,
    mi::neuraylib::ITarget_code const *code_ptx,
    mi::Size                           lightprofile_index,
    std::vector<Lightprofile>         &lightprofiles)
{
 
    // Get access to the texture data by the texture database name from the target code.
    mi::base::Handle<const mi::neuraylib::ILightprofile> lprof_nr(
        transaction->access<mi::neuraylib::ILightprofile>(
        code_ptx->get_light_profile(lightprofile_index)));
 
    uint2 res = make_uint2(lprof_nr->get_resolution_theta(), lprof_nr->get_resolution_phi());
    float2 start = make_float2(lprof_nr->get_theta(0), lprof_nr->get_phi(0));
    float2 delta = make_float2(lprof_nr->get_theta(1) - start.x, lprof_nr->get_phi(1) - start.y);
 
    // phi-mayor: [res.x x res.y]
    const float* data = lprof_nr->get_data();
 
    // --------------------------------------------------------------------------------------------
    // compute total power
    // compute inverse CDF data for sampling
    // sampling will work on cells rather than grid nodes (used for evaluation)
 
    // first (res.x-1) for the cdf for sampling theta
    // rest (rex.x-1) * (res.y-1) for the individual cdfs for sampling phi (after theta)
    size_t cdf_data_size = (res.x - 1) + (res.x - 1) * (res.y - 1);
    float* cdf_data = new float[cdf_data_size];
 
    float debug_total_erea = 0.0f;
    float sum_theta = 0.0f;
    float total_power = 0.0f;
    float cos_theta0 = cosf(start.x);
    for (unsigned int t = 0; t < res.x - 1; ++t)
    {
        const float cos_theta1 = cosf(start.x + float(t + 1) * delta.x);
 
        // area of the patch (grid cell)
        // \mu = int_{theta0}^{theta1} sin{theta} \delta theta
        const float mu = cos_theta0 - cos_theta1;
        cos_theta0 = cos_theta1;
 
        // build CDF for phi
        float* cdf_data_phi = cdf_data + (res.x - 1) + t * (res.y - 1);
        float sum_phi = 0.0f;
        for (unsigned int p = 0; p < res.y - 1; ++p)
        {
            // the probability to select a patch corresponds to the value times area
            // the value of a cell is the average of the corners
            // omit the *1/4 as we normalize in the end
            float value = data[p * res.x + t]
                + data[p * res.x + t + 1]
                + data[(p + 1) * res.x + t]
                + data[(p + 1) * res.x + t + 1];
 
            sum_phi += value * mu;
            cdf_data_phi[p] = sum_phi;
            debug_total_erea += mu;
        }
 
        // normalize CDF for phi
        for (unsigned int p = 0; p < res.y - 2; ++p)
            cdf_data_phi[p] = sum_phi ? (cdf_data_phi[p] / sum_phi) : 0.0f;
 
        cdf_data_phi[res.y - 2] = 1.0f;
 
        // build CDF for theta
        sum_theta += sum_phi;
        cdf_data[t] = sum_theta;
    }
    total_power = sum_theta * 0.25f * delta.y;
 
    // normalize CDF for theta
    for (unsigned int t = 0; t < res.x - 2; ++t)
        cdf_data[t] = sum_theta ? (cdf_data[t] / sum_theta) : cdf_data[t];
 
    cdf_data[res.x - 2] = 1.0f;
 
    // copy entire CDF data buffer to GPU
    CUdeviceptr cdf_data_obj = 0;
    check_cuda_success(cuMemAlloc(&cdf_data_obj, cdf_data_size * sizeof(float)));
    check_cuda_success(cuMemcpyHtoD(cdf_data_obj, cdf_data, cdf_data_size * sizeof(float)));
    delete[] cdf_data;
 
    // --------------------------------------------------------------------------------------------
    // prepare evaluation data
    //  - use a 2d texture that allows bilinear interpolation
    // Copy data to GPU array
    cudaArray_t device_lightprofile_data;
    cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc<float>();
 
    // 2D texture objects use CUDA arrays
    check_cuda_success(cudaMallocArray(&device_lightprofile_data, &channel_desc, res.x, res.y));
    check_cuda_success(cudaMemcpy2DToArray(
        device_lightprofile_data, 0, 0, data,
        res.x * sizeof(float), res.x * sizeof(float), res.y, cudaMemcpyHostToDevice));
 
    // Create filtered texture object
    cudaResourceDesc res_desc;
    memset(&res_desc, 0, sizeof(res_desc));
    res_desc.resType = cudaResourceTypeArray;
    res_desc.res.array.array = device_lightprofile_data;
 
    cudaTextureDesc tex_desc;
    memset(&tex_desc, 0, sizeof(tex_desc));
    tex_desc.addressMode[0] = cudaAddressModeClamp;
    tex_desc.addressMode[1] = cudaAddressModeClamp;
    tex_desc.addressMode[2] = cudaAddressModeClamp;
    tex_desc.borderColor[0] = 1.0f;
    tex_desc.borderColor[1] = 1.0f;
    tex_desc.borderColor[2] = 1.0f;
    tex_desc.borderColor[3] = 1.0f;
    tex_desc.filterMode = cudaFilterModeLinear;
    tex_desc.readMode = cudaReadModeElementType;
    tex_desc.normalizedCoords = 1;
 
    cudaTextureObject_t tex_obj = 0;
    check_cuda_success(cudaCreateTextureObject(&tex_obj, &res_desc, &tex_desc, nullptr));
 
    double multiplier = lprof_nr->get_candela_multiplier();
    Lightprofile lprof(
        res,
        start,
        delta,
        float(multiplier),
        float(total_power * multiplier),
        tex_obj,
        reinterpret_cast<float*>(cdf_data_obj));
 
    lightprofiles.push_back(lprof);
    m_all_lightprofiles->push_back(lightprofiles.back());
    return true;
}
 
// Prepare the needed target code data of the given target code.
bool Material_gpu_context::prepare_target_code_data(
    mi::neuraylib::ITransaction          *transaction,
    mi::neuraylib::IImage_api            *image_api,
    mi::neuraylib::ITarget_code const    *target_code,
    std::vector<size_t> const            &arg_block_indices)
{
    // Target code data list may not have been retrieved already
    check_success(m_device_target_code_data_list.get() == 0);
 
    // Handle the read-only data segments if necessary.
    // They are only created, if the "enable_ro_segment" backend option was set to "on".
    CUdeviceptr device_ro_data = 0;
    if (target_code->get_ro_data_segment_count() > 0) {
        device_ro_data = gpu_mem_dup(
            target_code->get_ro_data_segment_data(0),
            target_code->get_ro_data_segment_size(0));
    }
 
    // Copy textures to GPU if the code has more than just the invalid texture
    CUdeviceptr device_textures = 0;
    mi::Size num_textures = target_code->get_texture_count();
    if (num_textures > 1) {
        std::vector<Texture> textures;
 
        // Loop over all textures skipping the first texture,
        // which is always the invalid texture
        for (mi::Size i = 1; i < num_textures; ++i) {
            if (!prepare_texture(
                    transaction, image_api, target_code, i, textures))
                return false;
        }
 
        // Copy texture list to GPU
        device_textures = gpu_mem_dup(textures);
    }
 
    // Copy MBSDFs to GPU if the code has more than just the invalid mbsdf
    CUdeviceptr device_mbsdfs = 0;
    mi::Size num_mbsdfs = target_code->get_bsdf_measurement_count();
    if (num_mbsdfs > 1) {
        std::vector<Mbsdf> mbsdfs;
 
        // Loop over all mbsdfs skipping the first mbsdf,
        // which is always the invalid mbsdf
        for (mi::Size i = 1; i < num_mbsdfs; ++i) {
            if (!prepare_mbsdf(
                transaction, target_code, i, mbsdfs))
                return false;
        }
 
        // Copy mbsdf list to GPU
        device_mbsdfs = gpu_mem_dup(mbsdfs);
    }
 
    // Copy light profiles to GPU if the code has more than just the invalid light profile
    CUdeviceptr device_lightprofiles = 0;
    mi::Size num_lightprofiles = target_code->get_light_profile_count();
    if (num_lightprofiles > 1) {
        std::vector<Lightprofile> lightprofiles;
 
        // Loop over all profiles skipping the first profile,
        // which is always the invalid profile
        for (mi::Size i = 1; i < num_lightprofiles; ++i) {
            if (!prepare_lightprofile(
                transaction, target_code, i, lightprofiles))
                return false;
        }
 
        // Copy light profile list to GPU
        device_lightprofiles = gpu_mem_dup(lightprofiles);
    }
 
    (*m_target_code_data_list).push_back(
        Target_code_data(num_textures, device_textures,
                         num_mbsdfs, device_mbsdfs,
                         num_lightprofiles, device_lightprofiles,
                         device_ro_data));
 
    for (mi::Size i = 0, num = target_code->get_argument_block_count(); i < num; ++i) {
        mi::base::Handle<mi::neuraylib::ITarget_argument_block const> arg_block(
            target_code->get_argument_block(i));
        CUdeviceptr dev_block = gpu_mem_dup(arg_block->get_data(), arg_block->get_size());
        m_target_argument_block_list->push_back(dev_block);
        m_own_arg_blocks.push_back(mi::base::make_handle(arg_block->clone()));
        m_arg_block_layouts.push_back(
            mi::base::make_handle(target_code->get_argument_block_layout(i)));
    }
 
    for (size_t arg_block_index : arg_block_indices) {
        m_bsdf_arg_block_indices.push_back(arg_block_index);
    }
 
    return true;
}
 
// Update the i'th target argument block on the device with the data from the corresponding
// block returned by get_argument_block().
void Material_gpu_context::update_device_argument_block(size_t i)
{
    CUdeviceptr device_ptr = get_device_target_argument_block(i);
    if (device_ptr == 0) return;
 
    mi::base::Handle<mi::neuraylib::ITarget_argument_block> arg_block(get_argument_block(i));
    check_cuda_success(cuMemcpyHtoD(
        device_ptr, arg_block->get_data(), arg_block->get_size()));
}
 
 
//------------------------------------------------------------------------------
//
// MDL material compilation code
//
//------------------------------------------------------------------------------
 
class Material_compiler {
public:
    // Constructor.
    Material_compiler(
        mi::neuraylib::IMdl_impexp_api* mdl_impexp_api,
        mi::neuraylib::IMdl_backend_api* mdl_backend_api,
        mi::neuraylib::IMdl_factory* mdl_factory,
        mi::neuraylib::ITransaction* transaction,
        unsigned num_texture_results,
#if !defined(MDL_SOURCE_RELEASE) && defined(MDL_ENABLE_INTERPRETER)
        bool use_df_interpreter,
#endif
        bool enable_derivatives,
        bool fold_ternary_on_df,
        bool enable_auxiliary,
        bool enable_pdf,
        bool use_adapt_normal,
        bool enable_bsdf_flags,
        bool enable_mdl_next,
        const std::string& df_handle_mode,
        const std::string& lambda_return_mode);
 
    // Loads an MDL module and returns the module DB.
    std::string load_module(const std::string& mdl_module_name);
 
    // Add a subexpression of a given material to the link unit.
    // path is the path of the sub-expression.
    // fname is the function name in the generated code.
    // If class_compilation is true, the material will use class compilation.
    bool add_material_subexpr(
        const std::string& qualified_module_name,
        const std::string& material_db_name,
        const char* path,
        const char* fname,
        bool class_compilation=false);
 
    // Add a distribution function of a given material to the link unit.
    // path is the path of the sub-expression.
    // fname is the function name in the generated code.
    // If class_compilation is true, the material will use class compilation.
    bool add_material_df(
        const std::string& qualified_module_name,
        const std::string& material_db_name,
        const char* path,
        const char* base_fname,
        bool class_compilation=false);
 
    // Add (multiple) MDL distribution function and expressions of a material to this link unit.
    // For each distribution function it results in four functions, suffixed with \c "_init",
    // \c "_sample", \c "_evaluate", and \c "_pdf". Functions can be selected by providing a
    // a list of \c Target_function_descriptions. Each of them needs to define the \c path, the root
    // of the expression that should be translated. After calling this function, each element of
    // the list will contain information for later usage in the application,
    // e.g., the \c argument_block_index and the \c function_index.
    bool add_material(
        const std::string& qualified_module_name,
        const std::string& material_db_name,
        mi::neuraylib::Target_function_description* function_descriptions,
        mi::Size description_count,
        bool class_compilation);
 
    // Generates CUDA PTX target code for the current link unit.
    mi::base::Handle<const mi::neuraylib::ITarget_code> generate_cuda_ptx();
 
    typedef std::vector<mi::base::Handle<mi::neuraylib::IFunction_definition const> >
        Material_definition_list;
 
    // Get the list of used material definitions.
    // There will be one entry per add_* call.
    Material_definition_list const &get_material_defs()
    {
        return m_material_defs;
    }
 
    typedef std::vector<mi::base::Handle<mi::neuraylib::ICompiled_material const> >
        Compiled_material_list;
 
    // Get the list of compiled materials.
    // There will be one entry per add_* call.
    Compiled_material_list const &get_compiled_materials()
    {
        return m_compiled_materials;
    }
 
    std::vector<size_t> const &get_argument_block_indices() const {
        return m_arg_block_indexes;
    }
 
    const std::vector<std::string>& get_handles() const {
        return m_handles;
    }
 
private:
    // Creates an instance of the given material.
    mi::neuraylib::IFunction_call* create_material_instance(
        const std::string& qualified_module_name,
        const std::string& material_db_name);
 
    // Compiles the given material instance in the given compilation modes.
    mi::neuraylib::ICompiled_material* compile_material_instance(
        mi::neuraylib::IFunction_call* material_instance,
        bool class_compilation);
 
private:
    mi::base::Handle<mi::neuraylib::IMdl_impexp_api> m_mdl_impexp_api;
    mi::base::Handle<mi::neuraylib::IMdl_backend>    m_be_cuda_ptx;
    mi::base::Handle<mi::neuraylib::IMdl_factory>    m_mdl_factory;
    mi::base::Handle<mi::neuraylib::ITransaction>    m_transaction;
 
    mi::base::Handle<mi::neuraylib::IMdl_execution_context> m_context;
    mi::base::Handle<mi::neuraylib::ILink_unit>             m_link_unit;
 
    Material_definition_list  m_material_defs;
    Compiled_material_list    m_compiled_materials;
    std::vector<size_t>       m_arg_block_indexes;
    std::vector<std::string>  m_handles;
};
 
// Constructor.
Material_compiler::Material_compiler(
        mi::neuraylib::IMdl_impexp_api* mdl_impexp_api,
        mi::neuraylib::IMdl_backend_api* mdl_backend_api,
        mi::neuraylib::IMdl_factory* mdl_factory,
        mi::neuraylib::ITransaction* transaction,
        unsigned num_texture_results,
#if !defined(MDL_SOURCE_RELEASE) && defined(MDL_ENABLE_INTERPRETER)
        bool use_df_interpreter,
#endif
        bool enable_derivatives,
        bool fold_ternary_on_df,
        bool enable_auxiliary,
        bool enable_pdf,
        bool use_adapt_normal,
        bool enable_bsdf_flags,
        bool enable_mdl_next,
        const std::string& df_handle_mode,
        const std::string& lambda_return_mode)
    : m_mdl_impexp_api(mdl_impexp_api, mi::base::DUP_INTERFACE)
    , m_be_cuda_ptx(mdl_backend_api->get_backend(mi::neuraylib::IMdl_backend_api::MB_CUDA_PTX))
    , m_mdl_factory(mdl_factory, mi::base::DUP_INTERFACE)
    , m_transaction(transaction, mi::base::DUP_INTERFACE)
    , m_context(mdl_factory->create_execution_context())
    , m_link_unit()
{
    check_success(m_be_cuda_ptx->set_option("num_texture_spaces", "1") == 0);
 
    // Option "enable_ro_segment": Default is disabled.
    // If you have a lot of big arrays, enabling this might speed up compilation.
    // check_success(m_be_cuda_ptx->set_option("enable_ro_segment", "on") == 0);
 
    if (enable_derivatives) {
        // Option "texture_runtime_with_derivs": Default is disabled.
        // We enable it to get coordinates with derivatives for texture lookup functions.
        check_success(m_be_cuda_ptx->set_option("texture_runtime_with_derivs", "on") == 0);
    }
 
    // Option "tex_lookup_call_mode": Default mode is vtable mode.
    // You can switch to the slower vtable mode by commenting out the next line.
    check_success(m_be_cuda_ptx->set_option("tex_lookup_call_mode", "direct_call") == 0);
 
    // Option "num_texture_results": Default is 0.
    // Set the size of a renderer provided array for texture results in the MDL SDK state in number
    // of float4 elements processed by the init() function.
    check_success(m_be_cuda_ptx->set_option(
        "num_texture_results",
        to_string(num_texture_results).c_str()) == 0);
 
    if (enable_auxiliary) {
        // Option "enable_auxiliary": Default is disabled.
        // We enable it to create an additional 'auxiliary' function that can be called on each
        // distribution function to fill an albedo and normal buffer e.g. for denoising.
        check_success(m_be_cuda_ptx->set_option("enable_auxiliary", "on") == 0);
    }
    if (!enable_pdf) {
        // Option "enable_pdf": Default is enabled.
        check_success(m_be_cuda_ptx->set_option("enable_pdf", "off") == 0);
    }
 
#if !defined(MDL_SOURCE_RELEASE) && defined(MDL_ENABLE_INTERPRETER)
    // Option "enable_df_interpreter": Default is disabled.
    // Using the interpreter allows to reuse the same code for multiple materials
    // reducing code divergence, if your scene shows many materials at the same time.
    if (use_df_interpreter) {
        check_success(m_be_cuda_ptx->set_option("enable_df_interpreter", "on") == 0);
    }
#endif
 
    // Option "df_handle_slot_mode": Default is "none".
    // When using light path expressions, individual parts of the distribution functions can be
    // selected using "handles". The contribution of each of those parts has to be evaluated during
    // rendering. This option controls how many parts are evaluated with each call into the
    // generated "evaluate" and "auxiliary" functions and how the data is passed.
    // The CUDA backend supports pointers, which means an externally managed buffer of arbitrary
    // size is used to transport the contributions of each part.
    check_success(m_be_cuda_ptx->set_option("df_handle_slot_mode", df_handle_mode.c_str()) == 0);
 
    // Option "lambda_return_mode": Default is "default".
    // Selects how generated lambda functions return their results. For PTX, the default
    // is equivalent to "sret" mode, where all values are returned in a buffer provided as
    // first argument. In "value" mode, base types and vector types are directly returned
    // by the functions, other types are returned as for the "sret" mode.
    check_success(m_be_cuda_ptx->set_option("lambda_return_mode", lambda_return_mode.c_str()) == 0);
 
    // Option "scene_data_names": Default is "".
    // Uncomment the line below to enable calling the scene data runtime functions
    // for any scene data names or specify a comma-separated list of names for which
    // you may provide scene data. The example runtime functions always return the
    // default values, which is the same as not supporting any scene data.
    //     m_be_cuda_ptx->set_option("scene_data_names", "*");
 
    if (use_adapt_normal) {
        // Option "use_renderer_adapt_normal": Default is "off".
        // If enabled, the renderer can adapt the normal of BSDFs before use.
        check_success(m_be_cuda_ptx->set_option("use_renderer_adapt_normal", "on") == 0);
    }
 
    if (enable_bsdf_flags) {
        // Option "libbsdf_flags_in_bsdf_data": Default is "off".
        // If enabled, the generated code will use the optional "flags" field
        // in the BSDF data structures.
        check_success(m_be_cuda_ptx->set_option("libbsdf_flags_in_bsdf_data", "on") == 0);
    }
 
    if (enable_mdl_next) {
        m_context->set_option("mdl_next", true);
    }
 
    // force experimental to true for now
    m_context->set_option("experimental", true);
 
    m_context->set_option("fold_ternary_on_df", fold_ternary_on_df);
 
    // After we set the options, we can create the link unit
    m_link_unit = mi::base::make_handle(m_be_cuda_ptx->create_link_unit(transaction, m_context.get()));
}
 
std::string Material_compiler::load_module(const std::string& mdl_module_name)
{
    // load module
    m_mdl_impexp_api->load_module(m_transaction.get(), mdl_module_name.c_str(), m_context.get());
    if (!print_messages(m_context.get()))
        exit_failure("Failed to load module: %s", mdl_module_name.c_str());
 
    // get and return the DB name
    mi::base::Handle<const mi::IString> db_module_name(
        m_mdl_factory->get_db_module_name(mdl_module_name.c_str()));
    return db_module_name->get_c_str();
}
 
// Creates an instance of the given material.
mi::neuraylib::IFunction_call* Material_compiler::create_material_instance(
    const std::string& qualified_module_name,
    const std::string& material_db_name)
{
    // Create a material instance from the material definition
    // with the default arguments.
    mi::base::Handle<const mi::neuraylib::IFunction_definition> material_definition(
        m_transaction->access<mi::neuraylib::IFunction_definition>(
            material_db_name.c_str()));
    if (!material_definition) {
        // material with given name does not exist
        print_message(
            mi::base::details::MESSAGE_SEVERITY_ERROR,
            mi::neuraylib::IMessage::MSG_COMPILER_DAG,
            (
                "Material '" +
                material_db_name +
                "' does not exist in '" +
                qualified_module_name + "'").c_str());
        return nullptr;
    }
 
    m_material_defs.push_back(material_definition);
 
    mi::Sint32 result;
    mi::base::Handle<mi::neuraylib::IFunction_call> material_instance(
        material_definition->create_function_call(0, &result));
    check_success(result == 0);
 
    material_instance->retain();
    return material_instance.get();
}
 
// Compiles the given material instance in the given compilation modes.
mi::neuraylib::ICompiled_material *Material_compiler::compile_material_instance(
    mi::neuraylib::IFunction_call* material_instance,
    bool class_compilation)
{
    // convert to target type SID_MATERIAL
    mi::base::Handle<mi::neuraylib::IType_factory> tf(
        m_mdl_factory->create_type_factory(m_transaction.get()));
    mi::base::Handle<const mi::neuraylib::IType> standard_material_type(
        tf->get_predefined_struct(mi::neuraylib::IType_struct::SID_MATERIAL));
    m_context->set_option("target_type", standard_material_type.get());
 
 
    mi::Uint32 flags = class_compilation
        ? mi::neuraylib::IMaterial_instance::CLASS_COMPILATION
        : mi::neuraylib::IMaterial_instance::DEFAULT_OPTIONS;
    mi::base::Handle<const mi::neuraylib::IMaterial_instance> material_instance2(
        material_instance->get_interface<const mi::neuraylib::IMaterial_instance>());
    mi::base::Handle<mi::neuraylib::ICompiled_material> compiled_material(
        material_instance2->create_compiled_material(flags, m_context.get()));
    check_success(print_messages(m_context.get()));
 
    m_compiled_materials.push_back(compiled_material);
 
    compiled_material->retain();
    return compiled_material.get();
}
 
// Generates CUDA PTX target code for the current link unit.
mi::base::Handle<const mi::neuraylib::ITarget_code> Material_compiler::generate_cuda_ptx()
{
    mi::base::Handle<const mi::neuraylib::ITarget_code> code_cuda_ptx(
        m_be_cuda_ptx->translate_link_unit(m_link_unit.get(), m_context.get()));
    check_success(print_messages(m_context.get()));
    check_success(code_cuda_ptx);
 
#ifdef DUMP_PTX
    FILE *file = fopen("target_code.ptx", "wt");
    if (file)
    {
        fwrite(code_cuda_ptx->get_code(), code_cuda_ptx->get_code_size(), 1, file);
        fclose(file);
    }
#endif
 
    return code_cuda_ptx;
}
 
// Add a subexpression of a given material to the link unit.
// path is the path of the sub-expression.
// fname is the function name in the generated code.
bool Material_compiler::add_material_subexpr(
    const std::string& qualified_module_name,
    const std::string& material_db_name,
    const char* path,
    const char* fname,
    bool class_compilation)
{
    mi::neuraylib::Target_function_description desc;
    desc.path = path;
    desc.base_fname = fname;
    add_material(qualified_module_name, material_db_name, &desc, 1, class_compilation);
    return desc.return_code == 0;
}
 
// Add a distribution function of a given material to the link unit.
// path is the path of the sub-expression.
// fname is the function name in the generated code.
bool Material_compiler::add_material_df(
    const std::string& qualified_module_name,
    const std::string& material_db_name,
    const char* path,
    const char* base_fname,
    bool class_compilation)
{
    mi::neuraylib::Target_function_description desc;
    desc.path = path;
    desc.base_fname = base_fname;
    add_material(qualified_module_name, material_db_name, &desc, 1, class_compilation);
    return desc.return_code == 0;
}
 
// Add (multiple) MDL distribution function and expressions of a material to this link unit.
// For each distribution function it results in four functions, suffixed with \c "_init",
// \c "_sample", \c "_evaluate", and \c "_pdf". Functions can be selected by providing a
// a list of \c Target_function_description. Each of them needs to define the \c path, the root
// of the expression that should be translated. After calling this function, each element of
// the list will contain information for later usage in the application,
// e.g., the \c argument_block_index and the \c function_index.
bool Material_compiler::add_material(
    const std::string& qualified_module_name,
    const std::string& material_db_name,
    mi::neuraylib::Target_function_description* function_descriptions,
    mi::Size description_count,
    bool class_compilation)
{
    if (description_count == 0)
        return false;
 
    // Load the given module and create a material instance
    mi::base::Handle<mi::neuraylib::IFunction_call> material_instance(
        create_material_instance(qualified_module_name, material_db_name));
    if (!material_instance)
        return false;
 
    // Compile the material instance in instance compilation mode
    mi::base::Handle<mi::neuraylib::ICompiled_material> compiled_material(
        compile_material_instance(material_instance.get(), class_compilation));
 
    m_link_unit->add_material(
        compiled_material.get(), function_descriptions, description_count,
        m_context.get());
 
    // Note: the same argument_block_index is filled into all function descriptions of a
    //       material, if any function uses it
    m_arg_block_indexes.push_back(function_descriptions[0].argument_block_index);
 
    return print_messages(m_context.get());
}
 
 
//------------------------------------------------------------------------------
//
// Material execution code
//
//------------------------------------------------------------------------------
 
// Helper function to create PTX source code for a non-empty 32-bit value array.
void print_array_u32(
    std::string &str, std::string const &name, unsigned count, std::string const &content)
{
    str += ".visible .const .align 4 .u32 " + name + "[";
    if (count == 0) {
        // PTX does not allow empty arrays, so use a dummy entry
        str += "1] = { 0 };\n";
    } else {
        str += to_string(count) + "] = { " + content + " };\n";
    }
}
 
// Helper function to create PTX source code for a non-empty function pointer array.
void print_array_func(
    std::string &str, std::string const &name, unsigned count, std::string const &content)
{
    str += ".visible .const .align 8 .u64 " + name + "[";
    if (count == 0) {
        // PTX does not allow empty arrays, so use a dummy entry
        str += "1] = { dummy_func };\n";
    } else {
        str += to_string(count) + "] = { " + content + " };\n";
    }
}
 
// Generate PTX array containing the references to all generated functions.
std::string generate_func_array_ptx(
    const std::vector<mi::base::Handle<const mi::neuraylib::ITarget_code> > &target_codes)
{
    // Create PTX header and mdl_expr_functions_count constant
    std::string src =
        ".version 4.0\n"
        ".target sm_20\n"
        ".address_size 64\n";
 
    // Workaround needed to let CUDA linker resolve the function pointers in the arrays.
    // Also used for "empty" function arrays.
    src += ".func dummy_func() { ret; }\n";
 
    std::string tc_offsets;
    std::string function_names;
    std::string tc_indices;
    std::string ab_indices;
    unsigned f_count = 0;
 
    // Iterate over all target codes
    for (size_t tc_index = 0, num = target_codes.size(); tc_index < num; ++tc_index)
    {
        mi::base::Handle<const mi::neuraylib::ITarget_code> const &target_code =
            target_codes[tc_index];
 
        // in case of multiple target codes, we need to address the functions by a pair of
        // target_code_index and function_index.
        // the elements in the resulting function array can then be index by offset + func_index.
        if(!tc_offsets.empty())
            tc_offsets += ", ";
        tc_offsets += to_string(f_count);
 
        // Collect all names and prototypes of callable functions within the current target code
        for (size_t func_index = 0, func_count = target_code->get_callable_function_count();
             func_index < func_count; ++func_index)
        {
            // add to function list
            if (!tc_indices.empty())
            {
                tc_indices += ", ";
                function_names += ", ";
                ab_indices += ", ";
            }
 
            // target code index in case of multiple link units
            tc_indices += to_string(tc_index);
 
            // name of the function
            function_names += target_code->get_callable_function(func_index);
 
            // Get argument block index and translate to 1 based list index (-> 0 = not-used)
            mi::Size ab_index = target_code->get_callable_function_argument_block_index(func_index);
            ab_indices += to_string(ab_index == mi::Size(~0) ? 0 : (ab_index + 1));
            f_count++;
 
            // Add prototype declaration
            src += target_code->get_callable_function_prototype(
                func_index, mi::neuraylib::ITarget_code::SL_PTX);
            src += '\n';
        }
    }
 
    // infos per target code (link unit)
    src += std::string(".visible .const .align 4 .u32 mdl_target_code_count = ")
        + to_string(target_codes.size()) + ";\n";
    print_array_u32(
        src, std::string("mdl_target_code_offsets"), unsigned(target_codes.size()), tc_offsets);
 
    // infos per function
    src += std::string(".visible .const .align 4 .u32 mdl_functions_count = ")
        + to_string(f_count) + ";\n";
    print_array_func(src, std::string("mdl_functions"), f_count, function_names);
    print_array_u32(src, std::string("mdl_arg_block_indices"), f_count, ab_indices);
    print_array_u32(src, std::string("mdl_target_code_indices"), f_count, tc_indices);
 
    return src;
}
 
// Build a linked CUDA kernel containing our kernel and all the generated code, making it
// available to the kernel via an added "mdl_expr_functions" array.
CUmodule build_linked_kernel(
    std::vector<mi::base::Handle<const mi::neuraylib::ITarget_code> > const &target_codes,
    const char *ptx_file,
    const char *kernel_function_name,
    CUfunction *out_kernel_function)
{
    // Generate PTX array containing the references to all generated functions.
    // The linker will resolve them to addresses.
 
    std::string ptx_func_array_src = generate_func_array_ptx(target_codes);
#ifdef DUMP_PTX
    FILE *file = fopen("func_array.ptx", "wt");
    if (file)
    {
        fwrite(ptx_func_array_src.c_str(), ptx_func_array_src.size(), 1, file);
        fclose(file);
    }
#endif
 
    // Link all generated code, our generated PTX array and our kernel together
 
    CUlinkState   cuda_link_state;
    CUmodule      cuda_module;
    void         *linked_cubin;
    size_t        linked_cubin_size;
    char          error_log[8192], info_log[8192];
    CUjit_option  options[4];
    void         *optionVals[4];
 
    mi::examples::profiling::Timing timing("PTX to SASS");
 
    // Setup the linker
 
    // Pass a buffer for info messages
    options[0] = CU_JIT_INFO_LOG_BUFFER;
    optionVals[0] = info_log;
    // Pass the size of the info buffer
    options[1] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
    optionVals[1] = reinterpret_cast<void *>(uintptr_t(sizeof(info_log)));
    // Pass a buffer for error messages
    options[2] = CU_JIT_ERROR_LOG_BUFFER;
    optionVals[2] = error_log;
    // Pass the size of the error buffer
    options[3] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
    optionVals[3] = reinterpret_cast<void *>(uintptr_t(sizeof(error_log)));
 
    check_cuda_success(cuLinkCreate(4, options, optionVals, &cuda_link_state));
 
    CUresult link_result = CUDA_SUCCESS;
    do {
        // Add all code generated by the MDL PTX backend
        for (size_t i = 0, num_target_codes = target_codes.size(); i < num_target_codes; ++i) {
            link_result = cuLinkAddData(
                cuda_link_state, CU_JIT_INPUT_PTX,
                const_cast<char *>(target_codes[i]->get_code()),
                target_codes[i]->get_code_size(),
                nullptr, 0, nullptr, nullptr);
            if (link_result != CUDA_SUCCESS) break;
        }
        if (link_result != CUDA_SUCCESS) break;
 
        // Add the "mdl_expr_functions" array PTX module
        link_result = cuLinkAddData(
            cuda_link_state, CU_JIT_INPUT_PTX,
            const_cast<char *>(ptx_func_array_src.c_str()),
            ptx_func_array_src.size(),
            nullptr, 0, nullptr, nullptr);
        if (link_result != CUDA_SUCCESS) break;
 
        // Add our kernel
        link_result = cuLinkAddFile(
            cuda_link_state, CU_JIT_INPUT_PTX,
            ptx_file, 0, nullptr, nullptr);
        if (link_result != CUDA_SUCCESS) break;
 
        // Link everything to a cubin
        link_result = cuLinkComplete(cuda_link_state, &linked_cubin, &linked_cubin_size);
    } while (false);
    if (link_result != CUDA_SUCCESS) {
        std::cerr << "PTX linker error:\n" << error_log << std::endl;
        check_cuda_success(link_result);
    }
 
    timing.stop();
 
    std::cout << "CUDA link completed." << std::endl;
    if (info_log[0])
        std::cout << "Linker output:\n" << info_log << std::endl;
 
#ifdef DUMP_PTX
    file = fopen("target_code.cubin", "wb");
    if (file)
    {
        fwrite(linked_cubin, linked_cubin_size, 1, file);
        fclose(file);
    }
#endif
 
    // Load the result and get the entrypoint of our kernel
    check_cuda_success(cuModuleLoadData(&cuda_module, linked_cubin));
    check_cuda_success(cuModuleGetFunction(
        out_kernel_function, cuda_module, kernel_function_name));
 
    int regs = 0;
    check_cuda_success(
        cuFuncGetAttribute(&regs, CU_FUNC_ATTRIBUTE_NUM_REGS, *out_kernel_function));
    int lmem = 0;
    check_cuda_success(
        cuFuncGetAttribute(&lmem, CU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES, *out_kernel_function));
    std::cout << "Kernel uses " << regs << " registers and " << lmem << " lmem and has a size of "
        << linked_cubin_size << " bytes." << std::endl;
 
    // Cleanup
    check_cuda_success(cuLinkDestroy(cuda_link_state));
 
    return cuda_module;
}
 
#endif // EXAMPLE_CUDA_SHARED_H
 

Source Code Location: examples/mdl_sdk/execution_cuda/example_execution_cuda.cu

/******************************************************************************
 * Copyright 2024 NVIDIA Corporation. All rights reserved.
 *****************************************************************************/
 
// examples/mdl_sdk/execution_cuda/example_execution_cuda.cu
//
// This file contains the CUDA kernel used to evaluate the material sub-expressions.
 
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <math.h>
 
#include "texture_support_cuda.h"
 
// To reuse this sample code for the MDL SDK and MDL Core the corresponding namespaces are used.
 
// when this CUDA code is used in the context of an SDK sample.
#if defined(MI_NEURAYLIB_BSDF_USE_MATERIAL_IOR)
    #define BSDF_USE_MATERIAL_IOR MI_NEURAYLIB_BSDF_USE_MATERIAL_IOR
    using namespace mi::neuraylib;
// when this CUDA code is used in the context of an Core sample.
#elif defined(MDL_CORE_BSDF_USE_MATERIAL_IOR)
    #define BSDF_USE_MATERIAL_IOR MDL_CORE_BSDF_USE_MATERIAL_IOR
    using namespace mi::mdl;
#endif
 
#ifdef ENABLE_DERIVATIVES
typedef Material_expr_function_with_derivs Mat_expr_func;
typedef Shading_state_material_with_derivs Mdl_state;
typedef Texture_handler_deriv Tex_handler;
#define TEX_VTABLE tex_deriv_vtable
#else
typedef Material_expr_function Mat_expr_func;
typedef Shading_state_material Mdl_state;
typedef Texture_handler Tex_handler;
#define TEX_VTABLE tex_vtable
#endif
 
 
// Custom structure representing the resources used by the generated code of a target code object.
struct Target_code_data
{
    size_t       num_textures;      // number of elements in the textures field
    Texture     *textures;          // a list of Texture objects, if used
    char const  *ro_data_segment;   // the read-only data segment, if used
};
 
 
// The number of generated MDL sub-expression functions available.
extern __constant__ unsigned int     mdl_functions_count;
 
// The target argument block indices for the generated MDL sub-expression functions.
// Note: the original indices are incremented by one to allow us to use 0 as "not-used".
extern __constant__ unsigned int     mdl_arg_block_indices[];
 
// The function pointers of the generated MDL sub-expression functions.
// In this example it is assumed that only expressions are added to the link unit.
// For a more complex use case, see also example df_cuda.
extern __constant__ Mat_expr_func    *mdl_functions[];
 
// The target code indices for the generated MDL sub-expression functions.
// In contrast to the df_cuda sample, this example simply iterates over all generated expressions.
// Therefore, no target_code_indices and function_indices are passed from the host side.
// Instead, this additional array allows the mapping to target_code_index. 
extern __constant__ unsigned int     mdl_target_code_indices[];
 
// Identity matrix.
// The last row is always implied to be (0, 0, 0, 1).
__constant__ const tct_float4 identity[3] = {
    {1.0f, 0.0f, 0.0f, 0.0f},
    {0.0f, 1.0f, 0.0f, 0.0f},
    {0.0f, 0.0f, 1.0f, 0.0f}
};
 
 
// Calculate radical inverse with base 2.
__device__ float radinv2(unsigned int bits)
{
    bits = (bits << 16) | (bits >> 16);
    bits = ((bits & 0x00ff00ff) << 8) | ((bits & 0xff00ff00) >> 8);
    bits = ((bits & 0x0f0f0f0f) << 4) | ((bits & 0xf0f0f0f0) >> 4);
    bits = ((bits & 0x33333333) << 2) | ((bits & 0xcccccccc) >> 2);
    bits = ((bits & 0x55555555) << 1) | ((bits & 0xaaaaaaaa) >> 1);
 
    return float(bits) / float(0x100000000ULL);
}
 
// CUDA kernel evaluating the MDL sub-expression for one texel.
extern "C" __global__ void evaluate_mat_expr(
    float3 *out_buf,
    Target_code_data *tc_data_list,
    char const **arg_block_list,
    unsigned int width,
    unsigned int height,
    unsigned int num_samples)
{
    // Determine x and y coordinates of texel to be evaluated and check
    // whether it is out of bounds (due to block padding)
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height)
        return;
 
    // Calculate position and texture coordinates for a 2x2 quad around the center of the world
    float step_x = 1.f / width;
    float step_y = 1.f / height;
    float pos_x = 2.0f * x * step_x - 1.0f;  // [-1, 1)
    float pos_y = 2.0f * y * step_y - 1.0f;  // [-1, 1)
    float tex_x = float(x) * step_x;         // [0, 1)
    float tex_y = float(y) * step_y;         // [0, 1)
 
    // Assign materials in a checkerboard pattern
    unsigned int material_index =
        ((unsigned int)(tex_x * 4) ^ (unsigned int)(tex_y * 4)) % mdl_functions_count;
    unsigned int tc_idx = mdl_target_code_indices[material_index];
    char const *arg_block = arg_block_list[mdl_arg_block_indices[material_index]];
 
    // Setup MDL material state (with only one texture space)
#ifdef ENABLE_DERIVATIVES
    tct_deriv_float3 texture_coords[1] = {
        { { tex_x, tex_y, 0.0f }, { step_x, 0.0f, 0.0f }, { 0.0f, step_y, 0.0f } } };
#else
    tct_float3 texture_coords[1]    = { { tex_x, tex_y, 0.0f } };
#endif
    tct_float3 texture_tangent_u[1] = { { 1.0f, 0.0f, 0.0f } };
    tct_float3 texture_tangent_v[1] = { { 0.0f, 1.0f, 0.0f } };
 
    Mdl_state mdl_state = {
        /*normal=*/                { 0.0f, 0.0f, 1.0f },
        /*geom_normal=*/           { 0.0f, 0.0f, 1.0f },
#ifdef ENABLE_DERIVATIVES
        /*position=*/
        {
            { pos_x, pos_y, 0.0f },
            { 2 * step_x, 0.0f, 0.0f },
            { 0.0f, 2 * step_y, 0.0f }
        },
#else
        /*position=*/              { pos_x, pos_y, 0.0f },
#endif
        /*animation_time=*/        0.0f,
        /*texture_coords=*/        texture_coords,
        /*tangent_u=*/             texture_tangent_u,
        /*tangent_v=*/             texture_tangent_v,
        /*text_results=*/          NULL,
        /*ro_data_segment=*/       tc_data_list[tc_idx].ro_data_segment,
        /*world_to_object=*/       identity,
        /*object_to_world=*/       identity,
        /*object_id=*/             0,
        /*meters_per_scene_unit=*/ 1.0f
    };
    
    Tex_handler tex_handler;
    tex_handler.vtable       = &;TEX_VTABLE;   // only required in 'vtable' mode, otherwise NULL
    tex_handler.num_textures = tc_data_list[tc_idx].num_textures;
    tex_handler.textures     = tc_data_list[tc_idx].textures;
 
    Resource_data res_data_pair = {
        NULL, reinterpret_cast<Texture_handler_base *>(&tex_handler) };
 
    // Super-sample the current texel with the given number of samples
    float3 res = make_float3(0, 0, 0);
    for (unsigned int i = 0; i < num_samples; ++i) {
        // Calculate the offset for the current sample
        float offs_x = float(i) / num_samples * step_x;
        float offs_y = radinv2(i) * step_y;
 
        // Update the position and the texture coordinate
#ifdef ENABLE_DERIVATIVES
        mdl_state.position.val.x = pos_x + 2 * offs_x;
        mdl_state.position.val.y = pos_y + 2 * offs_y;
        texture_coords[0].val.x = tex_x + offs_x;
        texture_coords[0].val.y = tex_y + offs_y;
#else
        mdl_state.position.x = pos_x + 2 * offs_x;
        mdl_state.position.y = pos_y + 2 * offs_y;
        texture_coords[0].x = tex_x + offs_x;
        texture_coords[0].y = tex_y + offs_y;
#endif
 
        // Add result for current sample
        float3 cur_res;
        mdl_functions[material_index](&cur_res, &mdl_state, &res_data_pair, arg_block);
        res.x += cur_res.x;
        res.y += cur_res.y;
        res.z += cur_res.z;
    }
 
    // Calculate average over all samples and apply gamma correction
    res.x = powf(res.x / num_samples, 1.f / 2.2f);
    res.y = powf(res.y / num_samples, 1.f / 2.2f);
    res.z = powf(res.z / num_samples, 1.f / 2.2f);
 
    // Write result to output buffer
    out_buf[y * width + x] = res;
}

[Previous] [Up] [Next]