@@ -77,38 +77,39 @@ Hyesun Hong,
7777* CXL-PNM is the CXL variant for PNM, can work with multiple PIM
7878
7979SYCL Extension for PIM/PNM
80- * Goals
81- * Seamlessly integrate PIM/PNM operation into SYCL
82- * Allow combination of xGPU and PIM/PNM in one device kernel
83- * Not specific to one hardware
84- * Design
85- * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
86- * Model as special function unit
87- * Aligns with trends to model special functional units inside accelerators
88- * Compiler automatic mapping often not possible
89- * joint_matrix
90- * Group functions
91- * Easy to use
92- * Can easily be combined with device code
93- * Give necessary convergence guarantees
94- * Recap of SYCL work-item, work-group and group functions
95- * Group functions must be encountered in converged control flow
80+ * Work in collaboration with Codeplay Software team
81+ * Goals
82+ * Seamlessly integrate PIM/PNM operation into SYCL
83+ * Allow combination of xGPU and PIM/PNM in one device kernel
84+ * Not specific to one hardware
85+ * Design
86+ * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
87+ * Model as special function unit
88+ * Aligns with trends to model special functional units inside accelerators
89+ * Compiler automatic mapping often not possible
90+ * joint_matrix
91+ * Group functions
92+ * Easy to use
93+ * Can easily be combined with device code
94+ * Give necessary convergence guarantees
95+ * Recap of SYCL work-item, work-group and group functions
96+ * Group functions must be encountered in converged control flow
9697* Extension
97- * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
98- * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process
98+ * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
99+ * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process
99100* Extension for PNM
100- * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
101+ * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
101102* PNM standalone has less opportunity for parallelism, also limited by memory controller
102- * -> Combine PNM and PIM, PNM generates commands for PIM blocks
103+ * -> Combine PNM and PIM, PNM generates commands for PIM blocks
103104* Two modes
104105 * PIM mode: PIM blocks can operate independently, can choose number of blocks
105106 * PNM mode: Synchronized execution on multiple PIM blocks
106107* Mapping
107108 * Every PIM block is one work-item
108109 * PNM with attached PIM blocks forms one work-group
109110* Execution
110- * Work-item operations map to PIM operation
111- * Group functions map to PNM operation
111+ * Work-item operations map to PIM operation
112+ * Group functions map to PNM operation
112113* Example
113114 * work-item execution maps to PIM
114115 * group function maps to PNM
@@ -117,15 +118,15 @@ SYCL Extension for PIM/PNM
117118
118119Q&A
119120* Are the proposed functions specific to PIM or could also be used with other HW?
120- * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
121- * Can also map nicely to other types of hardware, for example vector processor
121+ * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
122+ * Can also map nicely to other types of hardware, for example vector processor
122123* Why have the user explicitly specify a block-size?
123- * Not a hardware detail
124- * Rather a promise by the user that data-blocks will always be at least that big
125- * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
124+ * Not a hardware detail
125+ * Rather a promise by the user that data-blocks will always be at least that big
126+ * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
126127* Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size
127- * Yes, that is possible, mainly a design question
128- * Current version might have additional implications regarding alignment
128+ * Yes, that is possible, mainly a design question
129+ * Current version might have additional implications regarding alignment
129130
130131
1311322023-06-05
0 commit comments