Manage global Spark resources for Pullup jobs

When you run a Pullup job, it requires Spark resources. The resource requirements vary depending on factors such as column and row count, data types, and operational complexity. These resources adhere to the limits defined in the global job limits. If the required resources are not immediately available on the Spark engine, the job is queued until resources become free.

Resource allocation directly affects job runtime. Providing more resources typically results in faster processing. Conversely, if you allocate too few resources, the job may experience Out-of-Memory (OOM) errors when it runs.

Note The available Spark engine resources depend on your Edge site installation. Collibra recommends using scalable managed Kubernetes for optimal job execution.

The out-of-the-box defaults for Spark resources are designed for stability across various infrastructures. While these settings ensure that the simplest job launches and runs, they might not be suitable for the size of the workloads you run in your environment.

Important For simplicity when getting started, Collibra recommends using the default Spark resource settings.

Job profile and workload complexity

A job profile determines the complexity of the workload. To establish a baseline for resource allocation, you must consider the following factors:

  • Complexity of the operation (rules and code): This is the most crucial factor. A simple row filter requires minimal compute resources. A full outer join or a complex window function requires substantial shuffle memory, CPU, and potentially more driver memory.
  • Data size (rows and bytes): This directly impacts the number of partitions and the executor memory required for storage and processing blocks.
  • Columns and data type: This impacts memory usage. For example, wide tables consume more RAM, and complex data types such as JSON or nested arrays increase processing overhead.

Implement a phased configuration approach

You should implement a configuration adjustments using a phased, iterative approach. Base these adjustments on observed workload characteristics and the total available infrastructure capacity.

Phase Action Goal
Phase 1: Establish a baseline

Start with the out-of-the-box global limits or set calculated limits based on the smallest common job profile.

Establish a working baseline and understand typical resource consumption for standard jobs. Consider CPUs, memory, and the average duration of your jobs.
Phase 2: Categorize your workload Identify large tables or ones that will have resource-intensive jobs run against them, such as those with large amounts of data to process or that you expect to write custom rules with complex join queries, and identify their peak requirements. Determine the maximum necessary capacity for the most resource-intensive jobs.
Phase 3: Set global limits Set the global maximum values for your Spark resource settings based on the largest expected workload profile, constrained by the total available infrastructure size of your Edge site. Ensure that the largest jobs can run without hitting artificial constraint, while preventing a single job from consuming the entire cluster.
Phase 4: Optimize and refine your settings Monitor resource utilization metrics frequently, using tools such as the Infrastructure logs and Spark History Server. Adjust the global limits downward if your jobs are consistently over-provisioned or upward if your jobs frequently hit the limit and fail. Continuously tune the configuration to maximize throughput and minimize resource waste.