Version | Cost/s |
---|---|
g++ -std=c++11 SLIC.cpp -o SLIC + srun -p amd_256 -w fa0814 -n 1 ./SLIC | p1 cost : 3351 ms p2 cost : 283 ms p3 cost : 21350 ms p4 cost : 377 ms Computing time=25364 ms |
编译参数+p1的omp
Version | Cost/s |
---|---|
g++ -std=c++11 SLIC.cpp -o SLIC -O3 + srun -p amd_256 -w fa0814 -n 1 ./SLIC | p1 cost : 3048 ms p2 cost : 50 ms p3 cost : 5435 ms p4 cost : 118 ms Computing time=8654 ms |
P1 omp thread 1 | p1 cost : 3028 ms p2 cost : 31 ms p3 cost : 5339 ms p4 cost : 113 ms Computing time=8512 ms |
P1 omp thread 2 | p1 cost : 1548 ms p2 cost : 35 ms p3 cost : 5370 ms p4 cost : 115 ms Computing time=7070 ms |
P1 omp thread 4 | p1 cost : 808 ms p2 cost : 32 ms p3 cost : 5376 ms p4 cost : 114 ms Computing time=6332 ms |
P1 omp thread 8 | p1 cost : 408 ms p2 cost : 33 ms p3 cost : 5366 ms p4 cost : 112 ms Computing time=5922 ms |
P1 omp thread 16 | p1 cost : 208 ms p2 cost : 34 ms p3 cost : 5359 ms p4 cost : 115 ms Computing time=5718 ms |
P1 omp thread 32 | p1 cost : 107 ms p2 cost : 33 ms p3 cost : 5384 ms p4 cost : 113 ms Computing time=5639 ms |
P1 omp thread 64 | p1 cost : 57 ms p2 cost : 34 ms p3 cost : 5382 ms p4 cost : 114 ms Computing time=5589 ms |
![image-20210630233122134](/Users/ylf9811/Library/Application Support/typora-user-images/image-20210630233122134.png)
![image-20210701000128734](/Users/ylf9811/Library/Application Support/typora-user-images/image-20210701000128734.png)
![image-20210701000840184](/Users/ylf9811/Library/Application Support/typora-user-images/image-20210701000840184.png)
![image-20210701002611742](/Users/ylf9811/Library/Application Support/typora-user-images/image-20210701002611742.png)
P3 omp
Version | Cost/s |
---|---|
先枚举i,在枚举n,借此消除写冲突,thread 1 | p1 cost : 86 ms p1.5 cost : 33 ms p2 cost : 0 ms offset 227 minCost : 40205 p3 cost : 40956 ms p4 cost : 113 ms Computing time=41191 ms |
先枚举i,在枚举n,借此消除写冲突,thread 64 | p1 cost : 86 ms p1.5 cost : 34 ms p2 cost : 0 ms offset 227 minCost : 831.787 p3 cost : 1624 ms p4 cost : 114 ms Computing time=1860 ms |
还是先枚举n,在y循环并行,thread64 | p1 cost : 87 ms p1.5 cost : 33 ms p2 cost : 0 ms offset 227 minCost : 488.989 minCost2 : 0 minCost3 : 148.904 minCost4 : 0.032 minCost5 : 599.678 minCost6 : 0.053 p3 cost : 1263 ms p4 cost : 136 ms Computing time=1522 ms |
g++ -S -fverbose-asm -g -std=c++11 -O3 -fopenmp -march=native SLIC.cpp -o SLIC.s
as -alhnd SLIC.s > SLIC.lst
Xmm 还没有做向量化呢
//TODO 把vector换成数组
简单的替换部分vector,去除无用变量 | p1 cost : 101 ms p1.5 cost : 31 ms p2 cost : 0 ms offset 227 minCost1 : 436.859 minCost2 : 0 minCost3 : 118.949 minCost4 : 0.024 minCost5 : 585.219 minCost6 : 0.048 p3 cost : 1141 ms p4 cost : 153 ms Computing time=1429 ms |
完全偷换vector,对minCost3、5进行并行(开副本) | p1 cost : 86 ms p1.5 cost : 33 ms p2 cost : 0 ms offset 227 minCost0 : 0.611 minCost1 : 412.394 minCost2 : 0 minCost3 : 12.588 minCost4 : 0 minCost5 : 24.805 minCost6 : 0.04 p3 cost : 450 ms p4 cost : 159 ms Computing time=730 ms |
p3中所有的地方都并行 | p1 cost : 72 ms p1.5 cost : 34 ms p2 cost : 0 ms offset 227 minCost0 : 0.526 minCost1 : 321.75 minCost2 : 0 minCost3 : 12.799 minCost4 : 0 minCost5 : 24.613 minCost6 : 0.172 p3 cost : 359 ms p4 cost : 160 ms Computing time=627 ms |
minCost1之后记下每个聚类中心关联着那些像素G[numk] | p1 cost : 87 ms p1.5 cost : 33 ms p2 cost : 0 ms offset 227 minCost0 : 0.4 minCost1 : 393.509 minCost2 : 56.777 minCost3 : 23.661 minCost4 : 0 minCost5 : 52.05 minCost6 : 0 p3 cost : 527 ms minCost0 : 4.177 minCost1 : 124.07 p4 cost : 132 ms Computing time=780 ms |
icpc -par-affinity=compactp1 cost : 87 ms p1.5 cost : 33 ms p2 cost : 0 ms offset 227 minCost : 488.989 minCost2 : 0 minCost3 : 148.904 minCost4 : 0.032 minCost5 : 599.678 minCost6 : 0.053 p3 cost : 1263 ms p4 cost : 136 ms Computing time=1522 ms |
p1 cost : 28 ms p1.5 cost : 31 ms p2 cost : 0 ms offset 227 minCost0 : 0.521 minCost1 : 300.9 minCost2 : 0 minCost3 : 18.944 minCost4 : 0 minCost5 : 26.783 minCost6 : 0.078 p3 cost : 347 ms minCost0 : 4.071 minCost1 : 133.902 p4 cost : 141 ms Computing time=550 ms |
//TODO minCost1还是xxm,看看自动向量化能不能成,实在不行就开始写手动版本
//TODO p4好像是有点子依赖,代码还没仔细读,还没加并行
对于最后一个热点,试过了并行bfs,但是由于内部数据量以及分支(==4)太小了,效果不好。
今天突然有想法就是预处理出每一块(就是最后nlables)有谁,这样按照原来代码的逻辑走一遍,把bfs换成直接循环预处理好的数组,并且预处理部分可以按照类别并行(即聚类的结果),类之间不存在依赖,类内部迭代所有的点。注意这部分有很多omp并行因为数据量小或者是访存更慢,单独设置了线程数。最终大约是140ms->40ms,总时间能在450左右。
//TODO做一下明哥说的那个去if的操作,可能预处理的时候能向量化。
今天把最大热点的循环顺序改了改,简单优化了一下常数,1case上能和y循环并行差不多,但是线程拓展性好了很多(也就是单线程很慢),这样就比较适合写多节点了,
调换循环顺序后单节点版本:
m_spcount is 200
input image is input_image.ppm
check image is check.ppm
p1 cost : 31 ms
p1.5 cost : 34 ms
p2 cost : 0 ms
offset 227
minCost0 : 0.535
minCost1 : 329.563
minCost2 : 0
minCost3 : 8.379
minCost4 : 0
minCost5 : 17.494
minCost6 : 0.068
p3 cost : 356 ms
minCost0 : 2.764
minCost1 : 11.048
minCost2 : 15.592
minCost3 : 12.864
minCost4 : 0.018
p4 cost : 45 ms
Computing time=468 ms
There are 0 points' labels are different from original file.
m_spcount is 400
input image is input_image2.ppm
check image is check2.ppm
p1 cost : 57 ms
p1.5 cost : 75 ms
p2 cost : 0 ms
offset 246
minCost0 : 0.741
minCost1 : 1298.71
minCost2 : 0
minCost3 : 26.344
minCost4 : 0
minCost5 : 45.359
minCost6 : 0.068
p3 cost : 1372 ms
minCost0 : 4.262
minCost1 : 27.086
minCost2 : 44.895
minCost3 : 32.516
minCost4 : 0.028
p4 cost : 129 ms
Computing time=1636 ms
There are 0 points' labels are different from original file.
m_spcount is 150
input image is input_image3.ppm
check image is check3.ppm
p1 cost : 24 ms
p1.5 cost : 31 ms
p2 cost : 0 ms
offset 222
minCost0 : 0.502
minCost1 : 184.263
minCost2 : 0
minCost3 : 3.988
minCost4 : 0
minCost5 : 11.883
minCost6 : 0.071
p3 cost : 201 ms
minCost0 : 2.701
minCost1 : 7.967
minCost2 : 15.395
minCost3 : 15.243
minCost4 : 0.01
p4 cost : 43 ms
Computing time=301 ms
There are 0 points' labels are different from original file.
双节点版本:
Process 1 of 2 ,processor name is fb0602.para.bscc
Process 0 of 2 ,processor name is fb0506.para.bscc
m_spcount is 200
input image is input_image.ppm
check image is check.ppm
p1 cost : 32 ms
process 1 word 5065451 to 10130902
p1.5 cost : 35 ms
p2 cost : 0 ms
offset 227
process 0 word 0 to 5065451
p3 cost : 244 ms
minCost0 : 1.83
minCost1 : 186.935
minCost2 : 2.856
minCost3 : 6.815
minCost4 : 3.358
minCost5 : 0
minCost6 : 0
p3 cost : 236 ms
numlabels 196
1done
1Computing time=312 ms
mx 195
minCost0 : 4.034
minCost1 : 13.293
minCost2 : 13.104
minCost3 : 12.306
minCost4 : 0.011
p4 cost : 45 ms
0done
0Computing time=350 ms
0There are 0 points' labels are different from original file.
Process 1 of 2 ,processor name is fb0602.para.bscc
Process 0 of 2 ,processor name is fa1013.para.bscc
m_spcount is 400
input image is input_image2.ppm
check image is check2.ppm
p1 cost : 58 ms
process 1 word 12000000 to 24000000
p1.5 cost : 79 ms
p2 cost : 0 ms
offset 246
process 0 word 0 to 12000000
minCost0 : 1.622
p3 cost : 943 ms
1done
1Computing time=1084 ms
minCost1 : 752.359
minCost2 : 9.655
minCost3 : 36.731
minCost4 : 6.81
minCost5 : 0
minCost6 : 0
p3 cost : 925 ms
numlabels 384
mx 383
minCost0 : 10.138
minCost1 : 38.086
minCost2 : 42.72
minCost3 : 31.405
minCost4 : 0.028
p4 cost : 142 ms
0done
0Computing time=1206 ms
0There are 0 points' labels are different from original file.
Process 0 of 2 ,processor name is fa1013.para.bscc
Process 1 of 2 ,processor name is fb0602.para.bscc
m_spcount is 150
input image is input_image3.ppm
check image is check3.ppm
p1 cost : 26 ms
p1.5 cost : 30 ms
p2 cost : 0 ms
offset 222
process 1 word 3657528 to 7315056
process 0 word 0 to 3657528
minCost0 : 1.579
p3 cost : 141 ms
1done
minCost1 : 105.331
minCost2 : 2.069
minCost3 : 5.511
minCost4 : 2.912
minCost5 : 0
minCost6 : 0
p3 cost : 142 ms
numlabels 147
1Computing time=195 ms
mx 146
minCost0 : 4.52
minCost1 : 11.935
minCost2 : 15.542
minCost3 : 17.845
minCost4 : 0.016
p4 cost : 52 ms
0done
0Computing time=252 ms
0There are 0 points' labels are different from original file.
对比上上个版本的时间
wi : 2599 he : 3898
p1 cost : 32 ms
p1.5 cost : 37 ms
p2 cost : 0 ms
offset 227
minCost0 : 0.548
minCost1 : 295.072
minCost2 : 0
minCost3 : 15.939
minCost4 : 0
minCost5 : 25.714
minCost6 : 0.079
p3 cost : 338 ms
minCost0 : 2.867
minCost1 : 10.801
minCost2 : 15.867
minCost3 : 12.675
minCost4 : 0.011
p4 cost : 45 ms
Computing time=453 ms
There are 0 points' labels are different from original file.
wi : 4000 he : 6000
p1 cost : 61 ms
p1.5 cost : 84 ms
p2 cost : 0 ms
offset 246
minCost0 : 0.875
minCost1 : 805.074
minCost2 : 0
minCost3 : 35.548
minCost4 : 0
minCost5 : 46.064
minCost6 : 0.086
p3 cost : 888 ms
minCost0 : 5.028
minCost1 : 27.845
minCost2 : 43.202
minCost3 : 31.55
minCost4 : 0.03
p4 cost : 128 ms
Computing time=1164 ms
There are 0 points' labels are different from original file.
wi : 2419 he : 3024
p1 cost : 25 ms
p1.5 cost : 29 ms
p2 cost : 0 ms
offset 222
minCost0 : 0.579
minCost1 : 233.842
minCost2 : 0
minCost3 : 15.255
minCost4 : 0
minCost5 : 28.462
minCost6 : 0.074
p3 cost : 301 ms
minCost0 : 3.359
minCost1 : 8.267
minCost2 : 18.033
minCost3 : 18.496
minCost4 : 0.017
p4 cost : 52 ms
Computing time=409 ms
There are 0 points' labels are different from original file.