Thanks @keryell1, that was quite helpful.
I’ve been able to get a boilerplate working, for tiling based matmul
/* M = N = P */
/* window size TS x TS */
size_t TS = 4;
/* Create buffers */
buffer<T, 1> a(a_host.data(), range<1>{a_host.size()});
buffer<T, 1> b(b_host.data(), range<1>{b_host.size()});
buffer<T, 1> c(c_gpu.data(), range<1>{c_gpu.size()});
auto e = q.submit([&](handler& h) {
/* Create accessors */
auto A = a.template get_access<access::mode::read>(h);
auto B = b.template get_access<access::mode::read>(h);
auto C = c.template get_access<access::mode::write>(h);
/* Local accessor TILES: hyperfast cache */
accessor<T, 2, access::mode::read_write, access::target::local> Asub(range<2>{TS, TS}, h);
accessor<T, 2, access::mode::read_write, access::target::local> Bsub(range<2>{TS, TS}, h);
/* Create kernel */
sycl::stream out(1024, 256, h);
h.parallel_for(nd_range<2>(range<2>(M, P), range<2>(TS, TS)), [=](nd_item<2> item) {
/* row, col thread identifier for each tile */
size_t row = item.get_local_id(0);
size_t col = item.get_local_id(1);
/* row, col thread identifer of C */
size_t globalRow = TS * item.get_global_id(0) + row;
size_t globalCol = TS * item.get_global_id(1) + col;
auto acc = 0;
/* loop over all tiles */
const size_t num_tiles = P / TS;
for (size_t t = 0; t < num_tiles; t++) {
/* Load one tile of A and B into cache */
const size_t tiledRow = TS * t + row;
const size_t tiledCol = TS * t + col;
Asub[row][col] = A[globalRow * M + tiledRow];
Bsub[row][col] = B[tiledCol * N + globalCol];
/* Barrier to sync the read-write */
item.barrier(access::fence_space::local_space);
/* Do the matmul between Asub and Bsub */
for (size_t k = 0; k < TS; k++) {
acc += Asub[row][k] * Bsub[k][col];
}
/* Barrier to sync the read-write */
item.barrier(access::fence_space::local_space);
/* Write from cache to host memory */
C[globalRow * M + globalCol] = acc;
}
});
});
e.wait();
But, now I seem to have a problem in deducing get_group_id(0)
equivalent in SYCL. One option I found out to be, was to use item.get_group().get_id(0)
. That didn’t give me correct results either.
Plus, another important info: I am trying to do the matmul in row-major format, as opposed to col-major, used by cederic in his openCL implementation.
Any suggestions, what is going wrong here?