Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Validation failed issue on MI300X #2071

Open
seungmanhan opened this issue Dec 18, 2024 · 16 comments
Open

[Issue]: Validation failed issue on MI300X #2071

seungmanhan opened this issue Dec 18, 2024 · 16 comments

Comments

@seungmanhan
Copy link

seungmanhan commented Dec 18, 2024

Problem Description

  • Validation failed when generating tensile for a specific size(B=32, M=8192, N=8192, K=128) in MI300X.
  • Confirmed that similar sizes also fail, and it mainly fails when size of B is 32.
  • On the other hand, it succeeds in MI250X/MI250. Guess it's related to the issue where MI300 has a slight error especially in rocblas and some gemms

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9474F 48-Core Processor

GPU

gfx942 AMD Instinct MI300X amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-

ROCm Version

ROCm 6.2.0

ROCm Component

Tensile

Steps to Reproduce

cd (Tensile root)/Tensile/bin
mkdir config build tmp
vi config/bgemm_tn_normal.yaml

./Tensile --runtime-language HIP config/bgemm_tn_normal.yaml > tmp/bgemm_tn_normal.txt build

detail of bgemm_tn_normal.yaml

GlobalParameters:
  NumElementsToValidate: 16384
  KernelTime: True
  DataInitTypeAlpha: 1
  DataInitTypeBeta: 0
  MaxWorkspaceSize: 8388608

BenchmarkProblems:
  ########################################
  # TN - standard
  ########################################
  -
    - # ProblemType
      OperationType: GEMM
      DataType: B
      DestDataType: B
      ComputeDataType: s
      HighPrecisionAccumulate: True
      TransposeA: True
      TransposeB: False
      UseBeta: True
      Batched: True

    - # BenchmarkProblemSizeGroup - Standard - non-multiple of 8 M,N
      InitialSolutionParameters:
      BenchmarkCommonParameters:
        - KernelLanguage: ["Assembly"]
        - EdgeType: ["ShiftPtr"]
      ForkParameters:
        - MatrixInstruction:
          # If tensile is not found, comment it out
          # - [32, 32, 4, 1,  1,  1,1,  2,2]  # 64x64
          # - [32, 32, 4, 1,  1,  1,2,  2,2]  # 64x128
          # - [32, 32, 4, 1,  1,  1,4,  2,2]  # 64x256
          # - [32, 32, 4, 1,  1,  2,1,  2,2]  # 128x64
          # - [32, 32, 4, 1,  1,  2,2,  2,2]  # 128x128
          # - [32, 32, 4, 1,  1,  2,4,  2,2]  # 128x256
          # - [32, 32, 4, 1,  1,  4,1,  2,2]  # 256x64
          - [32, 32, 4, 1,  1,  4,2,  2,2]  # 256x128
          - [32, 32, 8, 1,  1,  4,2,  2,2]  # 256x256
          # - [32, 32, 8, 1,  1,  4,4,  2,2]
          # - [16, 16, 8, 1,  1,  1,1,  2,2] # 32x32
          # - [16, 16, 8, 1,  1,  1,2,  2,2] # 32x64
          # - [16, 16, 8, 1,  1,  2,1,  2,2] # 64x32
          # - [32, 32, 1, 2]
          # - [32, 32, 2, 1]

          # - [16, 16, 8, 1,  1,  2,2,  2,2] # 64x64
          # - [16, 16, 8, 1,  1,  4,1,  2,2] # 128x32
          # - [16, 16, 8, 1,  1,  4,2,  2,2] # 128x64
          # - [16, 16, 8, 1,  1,  8,1,  2,2] # 256x32
          # - [16, 16, 8, 1,  1,  1,4,  2,2] # 32x128
          # - [16, 16, 8, 1,  1,  2,4,  2,2] # 64x128
          # - [16, 16, 8, 1,  1,  1,8,  2,2] # 32x256

          # Specialized MT
          # - [32, 32, 4, 1,  1,  1,3,  4,1]  # 128x96
          # - [32, 32, 4, 1,  1,  3,1,  1,4]  # 96x128
          # - [32, 32, 4, 1,  1,  1,5,  4,1]  # 128x160
          # - [32, 32, 4, 1,  1,  5,1,  1,4]  # 160x128
          # - [32, 32, 4, 1,  1,  2,5,  4,1]  # 256x160
          # - [32, 32, 4, 1,  1,  5,2,  1,4]  # 160x256
          # - [16, 16, 8, 1,  1,  2,2,  2,2] # 64x64
          # - [16, 16, 8, 1,  1,  5,2,  2,2] # 160x64
          # - [16, 16, 8, 1,  1,  2,5,  2,2] # 64x160
          # - [16, 16, 8, 1,  1,  3,3,  2,2] # 96x96
          # - [16, 16, 8, 1,  1,  5,5,  2,2] # 160x160

          # - [32, 32, 4, 1,  1,  1,1,  2,2]  # 64x64 
          # - [32, 32, 4, 1,  1,  5,1,  1,2]  # 160x64
          # - [32, 32, 4, 1,  1,  1,5,  2,1]  # 64x160
          # - [32, 32, 4, 1,  1,  5,5,  1,1]  # 160x160

          # - [16, 16, 8, 1,  1,  3,3,  2,2] # 96x96
          # - [16, 16, 8, 1,  1,  5,5,  2,2] # 160x160

          # - [16, 16, 8, 1,  1,  2,9,  4,1] # 128x144
          # - [16, 16, 8, 1,  1,  9,2,  1,4] # 144x128
        - ThreadTile:
          - [ 1, 32 ]
          - [ 2, 32 ]
          - [ 4, 32 ]
          - [ 1, 64 ]
          - [ 2, 64 ]
        - WorkGroup:
          - [ 64, 4, 1 ]
        - AssertFree0ElementMultiple: [1]
        - AssertFree1ElementMultiple: [1]
        - AssertSummationElementMultiple: [1]
        - PrefetchGlobalRead: [1]
        # - PrefetchLocalRead: [5,9,17,32,33]
        - PrefetchLocalRead: [1]
        - DepthU: [32, 64, 128]
        # - DepthU: [32]
        - VectorWidth: [2]
        - GlobalReadVectorWidth: [2,4,8]
        # - LocalReadVectorWidth: [4]
        - SuppressNoLoadLoop: [0]
        - OptNoLoadLoop: [1]
        - ScheduleLocalWrite: [1]
        - ScheduleGlobalRead: [1]
        - ScheduleIterAlg: [3]
        - InnerUnroll: [1]
        - ExpandPointerSwap: [1]
        # - TransposeLDS: [1]
        # - LdsBlockSizePerPadA: [-1]
        # - LdsBlockSizePerPadB: [-1]
        # - LdsPadA: [-1]
        # - LdsPadB: [-1]
        # - StoreRemapVectorWidth: [-1]
        - StoreRemapVectorWidth: [0,4]
        # - StaggerUMapping: [0,3]
        - StaggerUStride: [128,256]
        # - StaggerU: [0,32]
        - StaggerU: [0]
        # - WorkGroupMapping: [1,2,15]
        - WorkGroupMapping: [0]
        # - WorkGroupMapping: [0,1,2,3,4,5,8,15]
        # - WorkGroupMapping: [0]
        - WaveSeparateGlobalReadA: [1]
        - WaveSeparateGlobalReadB: [1]
        - 1LDSBuffer: [0,1]
        # - GlobalSplitU: [1,2,3,5,7,15]
        # - GlobalSplitU: [1]
        # - GlobalSplitU: [1,2]
        - GlobalSplitU: [ 1 ]
      BenchmarkJoinParameters:
      BenchmarkFinalParameters:
        - ProblemSizes:
          - Exact: [ 8192, 8192, 32, 128 ] #gemm generated 


########################################
LibraryLogic:
    ScheduleName: "aquavanjaram"
    DeviceNames: ["Device 0049", "Device 0050"]
    ArchitectureName: "gfx942"

LibraryClient:

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.8.5 is loaded

HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD EPYC 9474F 48-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 9474F 48-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 1188762324(0x46db12d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 1188762324(0x46db12d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 1188762324(0x46db12d4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: AMD EPYC 9474F 48-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 9474F 48-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 1
Compute Unit: 48
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 1188987428(0x46de8224) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 1188987428(0x46de8224) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 1188987428(0x46de8224) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 3


Name: gfx942
Uuid: GPU-e3bbfe66a72d73c7
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 1536
Internal Node ID: 2
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 4


Name: gfx942
Uuid: GPU-2adb59d7c4e0ef46
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 9984
Internal Node ID: 3
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 5


Name: gfx942
Uuid: GPU-52953e0b8b5117e5
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 4
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 17920
Internal Node ID: 4
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 6


Name: gfx942
Uuid: GPU-ccfee17c6febc894
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 5
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 26112
Internal Node ID: 5
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 7


Name: gfx942
Uuid: GPU-10eb49b12e70b220
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 6
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 34304
Internal Node ID: 6
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 8


Name: gfx942
Uuid: GPU-2d0180c6aa816f7a
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 7
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 42496
Internal Node ID: 7
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 9


Name: gfx942
Uuid: GPU-055b27e9e4eb6d54
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 8
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 50688
Internal Node ID: 8
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32


Agent 10


Name: gfx942
Uuid: GPU-a896d90ffb14d65d
Marketing Name: AMD Instinct MI300X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 9
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 262144(0x40000) KB
Chip ID: 29857(0x74a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 58880
Internal Node ID: 9
Compute Unit: 304
SIMDs per CU: 4
Shader Engines: 32
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 150
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 201310208(0xbffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

cmake version 3.28.4

@ppanchad-amd
Copy link

Hi @seungmanhan. Internal ticket has been created to investigate your issue. Thanks!

@tcgu-amd
Copy link

Hi @seungmanhan, thanks for reaching out! Would you be able to provide the exact error message you are seeing? Thanks!

@seungmanhan
Copy link
Author

Here are the results I got. If there is a way to output this in more detail, please let me know.


################################################################################
#
#  Tensile v4.43.0
#  Config: /home/seungman/workspace/Tensile/Tensile/bin/config/bgemm_tn_normal.yaml
#  Date & Time: 18/12/2024 15:11:45
#
################################################################################

# Restoring default globalParameters
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
> UserWarning: HardwareMonitor currently disabled for gfx941, gfx942, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201
# Found hipcc version 6.2.41133-dd7f95766
> UserWarning: ISA (12, 0, 0) isn't supported for ROCm stack 6.2, skipping...
> UserWarning: ISA (12, 0, 1) isn't supported for ROCm stack 6.2, skipping...
# Command-line override: RuntimeLanguage

Overriding RuntimeLanguage=HIP
Overriding CxxCompiler=amdclang++

################################################################################
# Converting Config to BenchmarkProcess Object
################################################################################

# Filling in Parameters With Defaults
# Convert Parameters to Benchmark Step(s)
# Benchmark Final
# NumBenchmarkSteps: 1

################################################################################
# Done Creating BenchmarkProcess Object
################################################################################


################################################################################
# Benchmark Step: Cijk_Alik_Bljk_BBS_BH_00 - 00_Final 12.004s
# Num Sizes: 1
# Fork Parameters:
#     1LDSBuffer: [0, 1]
#     DepthU: [32, 64, 128]
#     GlobalReadVectorWidth: [2, 4, 8]
#     MatrixInstruction: [[32, 32, 4, 1, 1, 1, 1, 2, 2], [32, 32, 4, 1, 1, 1, 2, 2, 2], [32, 32, 4, 1, 1, 1, 4, 2, 2], [32, 32, 4, 1, 1, 2, 1, 2, 2], [32, 32, 4, 1, 1, 2, 2, 2, 2], [32, 32, 4, 1, 1, 2, 4, 2, 2], [32, 32, 4, 1, 1, 4, 1, 2, 2], [32, 32, 4, 1, 1, 4, 2, 2, 2], [32, 32, 8, 1, 1, 4, 2, 2, 2], [32, 32, 8, 1, 1, 4, 4, 2, 2], [16, 16, 8, 1, 1, 1, 1, 2, 2], [16, 16, 8, 1, 1, 1, 2, 2, 2], [16, 16, 8, 1, 1, 2, 1, 2, 2], [32, 32, 1, 2], [32, 32, 2, 1], [16, 16, 8, 1, 1, 2, 2, 2, 2], [16, 16, 8, 1, 1, 4, 1, 2, 2], [16, 16, 8, 1, 1, 4, 2, 2, 2], [16, 16, 8, 1, 1, 8, 1, 2, 2], [16, 16, 8, 1, 1, 1, 4, 2, 2], [16, 16, 8, 1, 1, 2, 4, 2, 2], [16, 16, 8, 1, 1, 1, 8, 2, 2], [32, 32, 4, 1, 1, 1, 3, 4, 1], [32, 32, 4, 1, 1, 3, 1, 1, 4], [32, 32, 4, 1, 1, 1, 5, 4, 1], [32, 32, 4, 1, 1, 5, 1, 1, 4], [32, 32, 4, 1, 1, 2, 5, 4, 1], [32, 32, 4, 1, 1, 5, 2, 1, 4], [16, 16, 8, 1, 1, 2, 2, 2, 2], [16, 16, 8, 1, 1, 5, 2, 2, 2], [16, 16, 8, 1, 1, 2, 5, 2, 2], [16, 16, 8, 1, 1, 3, 3, 2, 2], [16, 16, 8, 1, 1, 5, 5, 2, 2], [32, 32, 4, 1, 1, 1, 1, 2, 2], [32, 32, 4, 1, 1, 5, 1, 1, 2], [32, 32, 4, 1, 1, 1, 5, 2, 1], [32, 32, 4, 1, 1, 5, 5, 1, 1]]
#     StaggerUStride: [128, 256]
#     StoreRemapVectorWidth: [0, 4]
#     ThreadTile: [[1, 32], [2, 32], [4, 32], [1, 64], [2, 64]]
# Using cached solution data
loading config file /home/seungman/workspace/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/build/../source/ClientParameters.ini
Loading /home/seungman/workspace/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/source/library/Kernels.so-000-gfx942.hsaco
Loading /home/seungman/workspace/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/source/library/TensileLibrary_gfx942.co
Log level: Debug
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
run,problem-progress,solution-progress,operation,problem-sizes,solution,validation,time-us,gflops,empty,total-gran,tiles-per-cu,num-cus,tile0-gran,tile1-gran,cu-gran,wave-gran,mem-read-bytes,mem-write-bytes,temp-edge,clock-sys,clock-soc,clock-mem,fan-rpm,hardware-samples,gfx-frequency(median),power(median),hotspot-temperature(median),enqueue-time
0,0/0,0/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA2048_LBSPPB1024_LPA2_LPB2_SRVW0_TT4_64_VW2_VWB2,FAILED,4640.43,118471,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:15.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,1/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA2048_LBSPPB1024_LPA2_LPB2_SRVW0_TT4_64_VW2_VWB2,FAILED,4737.64,116040,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:15.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,2/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA2048_LBSPPB1024_LPA2_LPB2_SRVW0_TT4_64_VW2_VWB2,FAILED,4758.58,115529,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:15.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,3/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,4436.9,123905,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:15.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,4/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,3794.51,144882,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,5/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,4290.25,128141,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,6/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA2048_LBSPPB2048_LPA2_LPB2_SRVW0_TT4_128_VW2_VWB2,FAILED,4568.75,120330,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,7/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA2048_LBSPPB2048_LPA2_LPB2_SRVW0_TT4_128_VW2_VWB2,FAILED,4695.04,117093,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,8/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA2048_LBSPPB2048_LPA2_LPB2_SRVW0_TT4_128_VW2_VWB2,FAILED,4566.8,120381,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,9/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,6525.58,84246.3,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,10/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,4909.53,111977,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,11/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS3_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW0_TT4_128_VW2_VWB2,FAILED,4609.08,119277,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:16.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,12/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA2048_LBSPPB1024_LPA1_LPB1_SRVW4_TT4_64_VW1_VWB1,FAILED,3302.28,166477,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,13/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA2048_LBSPPB1024_LPA1_LPB1_SRVW4_TT4_64_VW1_VWB1,FAILED,3148.68,174599,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,14/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA2048_LBSPPB1024_LPA1_LPB1_SRVW4_TT4_64_VW1_VWB1,FAILED,2947.19,186536,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,15/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,3666.14,149955,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,16/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,2925.38,187927,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,17/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,2576.74,213353,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,18/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA2048_LBSPPB2048_LPA1_LPB1_SRVW4_TT4_128_VW1_VWB1,FAILED,2803.2,196118,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,19/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA2048_LBSPPB2048_LPA1_LPB1_SRVW4_TT4_128_VW1_VWB1,FAILED,2764.54,198860,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,20/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x32_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA2048_LBSPPB2048_LPA1_LPB1_SRVW4_TT4_128_VW1_VWB1,FAILED,2674.1,205586,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,21/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA2_GLVWB2_GRVW2_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,5448.88,100893,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:17.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,22/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA4_GLVWB4_GRVW4_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,3750.28,146591,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:18.
Index:  Device | Reference
[0]  elem=0 idx=0: 0!=17
[1]  elem=131101 idx=131101: 0!=72
[2]  elem=262202 idx=262202: 0!=-64
[3]  elem=393303 idx=393303: 0!=63
0,0/0,23/23,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x256x64_MI32x32x8x1_SN_1LDSB1_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LBSPPA0_LBSPPB0_LPA0_LPB0_SRVW4_TT4_128_VW1_VWB1,FAILED,3074.47,178813,,0.998051,107.789,304,1,1,0.998051,1,8589934592,4294967296,,,,,,,,,,2024-12-18 15:12:18.
> UserWarning: ClientWriter Benchmark Process exited with code 24
> UserWarning: BenchmarkProblems: Benchmark Process exited with code 24
################################################################################
# Cijk_Alik_Bljk_BBS_BH_00
# 00_Final: End - 35.873s
################################################################################

clientExit=1 (ERROR) for ['/home/seungman/workspace/Tensile/Tensile/bin/config/bgemm_tn_normal.yaml']

@tcgu-amd
Copy link

tcgu-amd commented Jan 2, 2025

Hi @seungmanhan, thanks for your patience! Unfortunately, we are currently unable to reproduce the error you are getting -- I am wondering if you can share the exact commit of the version of Tensile you are using? Thanks!

@seungmanhan
Copy link
Author

@tcgu-amd I'm using the latest develop branch and this issue only occurs on gfx942 (commit f2eef8e)

@tcgu-amd
Copy link

tcgu-amd commented Jan 3, 2025

Thanks! We will test this specific commit!

@tcgu-amd
Copy link

tcgu-amd commented Jan 6, 2025

@seungmanhan, unfortunately, we are still unable to reproduce the issue. Do you by chance have any external env configurations? By the way, just to confirm, it seems like you have multiple agents on your system; have you had the chance the try the validation test on other MI300X agents as well? Thanks!

@seungmanhan
Copy link
Author

First, I will try to test it on other agents as you suggested, but I have some questions. Could it be that the GPU and CPU servers are made by different manufacturers?

@seungmanhan
Copy link
Author

I've tested it on other devices, but validation always fails. Can you show me the successful results?

@tcgu-amd
Copy link

tcgu-amd commented Jan 7, 2025

I've tested it on other devices, but validation always fails. Can you show me the successful results?

For sure, here's the results we got

################################################################################
#
#  Tensile v4.43.0
#  Config: /home/test/git/Tensile/Tensile/bin/config/bgemm_tn_normal.yaml
#  Date & Time: 03/01/2025 17:11:52
#
################################################################################

# Restoring default globalParameters
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
> UserWarning: HardwareMonitor currently disabled for gfx941, gfx942, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201
# Found hipcc version 6.2.41134-65d174c3e
> UserWarning: ISA (12, 0, 0) isn't supported for ROCm stack 6.2, skipping...
> UserWarning: ISA (12, 0, 1) isn't supported for ROCm stack 6.2, skipping...
# Command-line override: RuntimeLanguage

Overriding RuntimeLanguage=HIP
Overriding CxxCompiler=amdclang++

################################################################################
# Converting Config to BenchmarkProcess Object
################################################################################

# Filling in Parameters With Defaults
# Convert Parameters to Benchmark Step(s)
# Benchmark Final
# NumBenchmarkSteps: 1

################################################################################
# Done Creating BenchmarkProcess Object
################################################################################


################################################################################
# Benchmark Step: Cijk_Alik_Bljk_BBS_BH_00 - 00_Final 24.415s
# Num Sizes: 1
# Fork Parameters:
#     1LDSBuffer: [0, 1]
#     DepthU: [32, 64, 128]
#     GlobalReadVectorWidth: [2, 4, 8]
#     MatrixInstruction: [[32, 32, 4, 1, 1, 4, 2, 2, 2], [32, 32, 8, 1, 1, 4, 2, 2, 2]]
#     StaggerUStride: [128, 256]
#     StoreRemapVectorWidth: [0, 4]
#     ThreadTile: [[1, 32], [2, 32], [4, 32], [1, 64], [2, 64]]
# Using cached solution data
loading config file /home/test/git/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/build/../source/ClientParameters.ini
Loading /home/test/git/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/source/library/Kernels.so-000-gfx942.hsaco
Loading /home/test/git/Tensile/Tensile/bin/build/1_BenchmarkProblems/Cijk_Alik_Bljk_BBS_BH_00/00_Final/source/library/TensileLibrary_gfx942.co
Log level: Debug
run,problem-progress,solution-progress,operation,problem-sizes,solution,validation,time-us,gflops,empty,total-gran,tiles-per-cu,num-cus,tile0-gran,tile1-gran,cu-gran,wave-gran,mem-read-bytes,mem-write-bytes,temp-edge,clock-sys,clock-soc,clock-mem,fan-rpm,hardware-samples,gfx-frequency(median),power(median),hotspot-temperature(median),enqueue-time
0,0/0,0/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA2_GLVWB2_GRVW2_K1_LPA2_LPB2_SRVW0_VW2_VWB2,PASSED,4038.58,136126,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:55.
0,0/0,1/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA4_GLVWB4_GRVW4_K1_LPA2_LPB2_SRVW0_VW2_VWB2,PASSED,4158.49,132201,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:55.
0,0/0,2/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS3_GLVWA8_GLVWB8_GRVW8_K1_LPA2_LPB2_SRVW0_VW2_VWB2,PASSED,4142.57,132709,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:55.
0,0/0,3/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA2_GLVWB2_GRVW2_K1_LPA1_LPB1_SRVW4_VW1_VWB1,PASSED,3291.42,167027,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:55.
0,0/0,4/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA4_GLVWB4_GRVW4_K1_LPA1_LPB1_SRVW4_VW1_VWB1,PASSED,3352.08,164005,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:55.
0,0/0,5/5,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LPA1_LPB1_SRVW4_VW1_VWB1,PASSED,3085.7,178162,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:12:56.
################################################################################
# Cijk_Alik_Bljk_BBS_BH_00
# 00_Final: End - 66.291s
################################################################################

clientExit=0 (PASS) for ['/home/test/git/Tensile/Tensile/bin/config/bgemm_tn_normal.yaml']


################################################################################
# Analysing data in 2_BenchmarkData - 66.293s
################################################################################

# Analyzing: Cijk_Alik_Bljk_BBS_BH
# Read: /home/test/git/Tensile/Tensile/bin/build/2_BenchmarkData/Cijk_Alik_Bljk_BBS_BH_00.yaml
# Merging Solutions:

                                                    
# LogicAnalyzer .....                          16.7%
                                                    
# LogicAnalyzer ..........                     33.3%
                                                    
# LogicAnalyzer ...............                50.0%
                                                    
# LogicAnalyzer ....................           66.7%
                                                    
# LogicAnalyzer .........................      83.3%
                                                     
# LogicAnalyzer .............................. 100.0%# NumProblemSizes: [0, 0, 0, 0, 0, 0, 0, 0]
reading datafile /home/test/git/Tensile/Tensile/bin/build/2_BenchmarkData/Cijk_Alik_Bljk_BBS_BH_00.csv
# ExactWinners: {(8192, 8192, 32, 128, 8192, 8192, 128, 128): [5, 178162.0]}
problemIndicesForGlobalRange []
Winners {5}
( 0) Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_AMAS0_GLVWA8_GLVWB8_GRVW8_K1_LPA1_LPB1_SRVW4_VW1_VWB1 : Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_1LDSB0_APM1_AAIGTEn1_AAILTEn1_AAV0_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASBE01_ASCE01_ASDE01_ASEM1_AAC0_BL1_BS1_CDO0_CTDA0_CLR0_DSK0_DRA0_DU32_DULD1_DTLA0_DTLB0_DTVA0_DTVB0_DAF0_DKP0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_F16AI0_F16AIR0_FL0_GLVWA8_GLVWB8_GR2A1_GR2B1_GRCGA1_GRCGB1_GRCVA1_GRCVB1_GRPM1_GRVW8_GSU1_GSUASB_GSUAA0_GSUSARR1_GSUWGMRR0_GLS0_ISA942_IU1_IA0_KLA_LEL0_LBSPPA2048_LBSPPB1024_LPA1_LPB1_LDL1_LR2A1_LR2B1_LRVW4_LW2A1_LW2B1_LWPMn1_LDW0_LT1_FMA_MIAV0_MTSM64_MTSM1_MDA2_MI32_32_8_1_MO40_MVN256_MMFSC_MKFGSU256_MVN0_NR0_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PBD0_PFD1_PG2_PSD0_PSLn1_PWCn1_PWLn1_PK0_PKAB0_PAP0_PAPM0_PGR1_PLR1_PKA0_RK0_SGR1_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIU0_SCIUE0_SCIUI1_SCIUPL0_SPO0_SRVW4_SSO0_SVW4_SK0_SKA0_SKXCCM0_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS0_UIIDU0_UMLDSA0_UMLDSB0_UMF0_U64SL1_UIOFGRO0_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG64_4_1_WGM0_WGMTB

[] > UserWarning: [] MultiProblem & NotLastIndex :: nextRule==None; returning

# Score: 0 ms
# Exact Logic:

{(8192, 8192, 32, 128, 8192, 8192, 128, 128): [0, 178162.0]}
################################################################################
# Finish Analysing data to /home/cbrownle/git/Tensile/Tensile/bin/build in 66.327s
################################################################################


LogicFiles: ['/home/test/git/Tensile/Tensile/bin/build/3_LibraryLogic/aquavanjaram_Cijk_Alik_Bljk_BBS_BH.yaml']

################################################################################
# Tensile Create Library
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
# Detected local GPU with ISA: gfx942
> UserWarning: HardwareMonitor currently disabled for gfx941, gfx942, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201
# Found hipcc version 6.2.41134-65d174c3e
> UserWarning: ISA (12, 0, 0) isn't supported for ROCm stack 6.2, skipping...
> UserWarning: ISA (12, 0, 1) isn't supported for ROCm stack 6.2, skipping...
# CodeObjectVersion: default
# CxxCompiler:       amdclang++
# Architecture:      all
# LibraryFormat:     msgpack
# LibraryLogicFiles: found 1 files
#      set --verbose=2 to view all files
# Writing Custom CMake
# Writing Kernels...
# Kernel Building elapsed time = 11.9 secs
# Tensile Library Writer DONE
################################################################################

loading config file /home/cbrownle/git/Tensile/Tensile/bin/build/4_LibraryClient/source/ClientParameters_Cijk_Alik_Bljk_BBS_BH.ini
Loading /home/test/git/Tensile/Tensile/bin/build/4_LibraryClient/library/Kernels.so-000-gfx942.hsaco
Loading /home/test/git/Tensile/Tensile/bin/build/4_LibraryClient/library/TensileLibrary_gfx942.co
Log level: Debug
run,problem-progress,solution-progress,operation,problem-sizes,solution,validation,time-us,gflops,empty,total-gran,tiles-per-cu,num-cus,tile0-gran,tile1-gran,cu-gran,wave-gran,mem-read-bytes,mem-write-bytes,temp-edge,clock-sys,clock-soc,clock-mem,fan-rpm,hardware-samples,gfx-frequency(median),power(median),hotspot-temperature(median),enqueue-time
0,0/0,1/1,Contraction_l_Alik_Bljk_Cijk_Dijk,"(8192,8192,32,128)",Cijk_Alik_Bljk_BBS_BH_MT256x128x32_MI32x32x8x1_SN_K1,PASSED,3087.15,178079,,0.998051,215.579,304,1,1,0.998051,1,10737418240,4294967296,,,,,,,,,,2025-01-03 17:14:15.

@tcgu-amd
Copy link

tcgu-amd commented Jan 7, 2025

First, I will try to test it on other agents as you suggested, but I have some questions. Could it be that the GPU and CPU servers are made by different manufacturers?

Could you elaborate? Thanks!

@seungmanhan
Copy link
Author

When we previously verified the performance of MI300, there were issues with Supermicro and Gigabyte servers, but no issues with AMD's Dell servers.
This was probably due to hardware or firmware differences. So at the time, we were able to resolve this issue through software, bypassing hardware settings.
Could it be that this time too, there was a similar verification issue depending on the server?

@tcgu-amd
Copy link

Hi @seungmanhan, sorry for the late update. Despite our best efforts, we are still unable to reproduce this error, and we are still unsure of potential causes. However, one thing interesting in your log
[0] elem=0 idx=0: 0!=17
[1] elem=131101 idx=131101: 0!=72
[2] elem=262202 idx=262202: 0!=-64
[3] elem=393303 idx=393303: 0!=63
is that every 131101th index returns a 0 instead of the correct value, which makes me wonder if it is the exact same indices on other cards as well. If it is, then it would certainly suggest a failure in Tensile and will provide us a lead. If every cards is different, then the problem could also potentially be hardware/system related.

Regarding issues with other servers, we are not really sure what the problems were, so we can only say that firmware and hardware incompatibility could be a potential cause. If you could provide more details then we can try to look into them.

Thanks!

@seungmanhan
Copy link
Author

seungmanhan commented Jan 15, 2025

I will share our hardware and vbios as well. I will also check the idx error you mentioned. Thank you.

$ cat /sys/class/dmi/id/product_name
G593-ZX1-AAX1-000
rocm-smi --showhw
====================================== ROCm System Management Interface ======================================
=========================================== Concise Hardware Info ============================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS             BUS           PARTITION ID  
0    2     0x74a1  33267  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:06:00.0  0             
1    3     0x74a1  51499  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:27:00.0  0             
2    4     0x74a1  29122  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:46:00.0  0             
3    5     0x74a1  43483  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:66:00.0  0             
4    6     0x74a1  8594   gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:86:00.0  0             
5    7     0x74a1  63883  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:A6:00.0  0             
6    8     0x74a1  53667  gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:C6:00.0  0             
7    9     0x74a1  2490   gfx942   ENABLED  ENABLED   ENABLED  113-MI3SRIOV-001  0000:E6:00.0  0             
==============================================================================================================
============================================ End of ROCm SMI Log =============================================

@seungmanhan
Copy link
Author

And if possible, please share the manufacturer and vbios information you are using.

@tcgu-amd
Copy link

tcgu-amd commented Jan 15, 2025

And if possible, please share the manufacturer and vbios information you are using.

Unfortunately, we are not allowed to share hardware details of our test systems, but thank you for sharing the information. We will see if there anything we can find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants