m1 有原生 numpy scipy 了

NumPy

scipy

conda

原生

42 条回复 • 2021-04-23 04:02:49 +08:00

1

pb941129

2020-12-09 15:39:45 +08:00 via iPhone

想知道对比 Intel i9 mkl 版 numpy 提升多少……

2

NoobX

2020-12-09 16:42:16 +08:00 via iPhone

然而 16g 封顶...

3

Goldilocks

2020-12-09 16:45:04 +08:00 via Android

期待 benchmark，估计被 avx512 吊打

4

felixcode

2020-12-09 19:43:51 +08:00 via Android

显存比你内存大

5

YUX

OP

2020-12-09 19:49:07 +08:00

@pb941129
@NoobX
@Goldilocks
@felixcode

找到了个 numpy 性能脚本跑了一下 https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

```
Dotted two 4096x4096 matrices in 0.53 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.59 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.74 s.

This was obtained using the following Numpy configuration:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
`
```

p.s. python 版本 3.9.1 -arm64 跑的时候关掉了所有后台

6

pb941129

2020-12-09 19:58:15 +08:00

1

@YUX Thx 这是我 16 寸 MBP i9 款跑出来的结果。没有关后台。环境 anaconda 3.8 。看上去比 M1 还是快一点的。（不然 Intel 真的要哭）

```
Dotted two 4096x4096 matrices in 0.45 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 3.53 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/xxx/anaconda/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/xxx/anaconda/include']

```

7

changepc90

2020-12-09 20:12:20 +08:00

M1:Dotted two vectors of length 524288 in 0.25 ms
MBP16:Dotted two vectors of length 524288 in 0.05 ms.
这一项差的好多啊。

8

YUX

OP

2020-12-09 20:13:27 +08:00

@pb941129 不错还是 i9 强😂 是不是跑的时候 8 核 16 线程都占满了

9

YUX

OP

2020-12-09 20:15:42 +08:00

@changepc90 这应该就是指令集差异造成的叭

10

Aspector

2020-12-09 20:19:41 +08:00

1

T480s 上的 i7 8550u，库是 mkl_rt

Dotted two 4096x4096 matrices in 1.07 s.
Dotted two vectors of length 524288 in 0.13 ms.
SVD of a 2048x1024 matrix in 0.53 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 5.07 s.

用 HWMonitor 读出来 8550u 的实时功耗大概在 40-45W，M1 应该才 20W 吧（悲

11

YUX

OP

2020-12-09 20:21:59 +08:00

分享一下朋友的 16inch 2.6 GHz 6-Core Intel Core i7

Dotted two 4096x4096 matrices in 0.49 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.07 s.
Eigendecomposition of a 2048x2048 matrix in 3.16 s.

12

YUX

OP

2020-12-09 20:24:36 +08:00

@Aspector air 的 m1 限制在 10 瓦😂

13

pb941129

2020-12-09 20:25:33 +08:00 via iPhone

@YUX 没看任务，不过以我对 numpy 尿性的理解，不至于不至于。可以等 lightgbm 适配了然后一起跑跑 CPU 版本（当时跑一个小项目找最优参数跑满整个 8700k 三小时

14

rock_cloud

2020-12-09 20:25:53 +08:00

1

2017 iMac 3.4Ghz Intel i5
Dotted two 4096x4096 matrices in 1.04 s.
Dotted two vectors of length 524288 in 0.17 ms.
SVD of a 2048x1024 matrix in 0.58 s.
Cholesky decomposition of a 2048x2048 matrix in 0.12 s.
Eigendecomposition of a 2048x2048 matrix in 5.37 s.
没关任何后台

15

YUX

OP

2020-12-09 20:26:54 +08:00

@pb941129 烤鸡仨小时啊我能在冰箱里测么😂 没风扇怕烤糊了

16

sxd96

2020-12-09 20:31:25 +08:00

1

18 年 13 寸 MBP i5-8259U

Dotted two 4096x4096 matrices in 0.80 s.
Dotted two vectors of length 524288 in 0.11 ms.
SVD of a 2048x1024 matrix in 0.35 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 3.39 s.

17

sxd96

2020-12-09 20:35:06 +08:00

@sxd96 感觉心里平衡了一点点，也是没关后台，mkl 库。但是我发现在核心满负载的情况下，MBP 会有一点点电啸声。虽然现在 ARM 在这上面可能差了一点点，但是如果算能效比，可能并不差。我觉得移动设备重要的还是能效比。

18

Gandum

2020-12-09 20:35:15 +08:00 via iPhone

还是初步版本。不过现在是冬天还不用急，风扇不太吵。明年夏天再买。

19

IgniteWhite

2020-12-09 20:35:29 +08:00 via iPhone

1

哈哈我五个月前发帖讲过啦 /t/688402

20

rock_cloud

2020-12-09 20:36:02 +08:00

1

Intel Xeon Silver 4114 2.2Ghz
Dotted two 4096x4096 matrices in 0.60 s.
Dotted two vectors of length 524288 in 0.04 ms.
SVD of a 2048x1024 matrix in 0.66 s.
Cholesky decomposition of a 2048x2048 matrix in 0.26 s.
Eigendecomposition of a 2048x2048 matrix in 6.67 s.

21

YUX

OP

2020-12-09 20:38:09 +08:00

1

@IgniteWhite 太超前啦😂确实是个好东西

22

Tilie

2020-12-09 20:54:48 +08:00

1

8 代 i7 mac mini
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.09 ms.
SVD of a 2048x1024 matrix in 0.56 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 5.20 s.

23

YUX

OP

2020-12-09 21:03:39 +08:00

Google Colab - 2 Intel(R) Xeon(R) CPU @ 2.20GHz

Dotted two 4096x4096 matrices in 4.16 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 1.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
Eigendecomposition of a 2048x2048 matrix in 13.11 s.

24

zr86

2020-12-09 21:14:01 +08:00

M1 Mac mini

Dotted two 4096x4096 matrices in 0.69 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.68 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.82 s.

25

wydinhk

2020-12-09 22:21:48 +08:00

M1 MacBook Pro

Dotted two 4096x4096 matrices in 0.68 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.71 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 5.03 s.

同时用 powermetrics 测量功耗，前两项约 26W，后三项约 16W

26

lovestudykid

2020-12-10 03:17:17 +08:00

这个测试拉不开差距
MF839，只是比楼主的 M1 慢了一倍
Dotted two 4096x4096 matrices in 2.33 s.
Dotted two vectors of length 524288 in 0.54 ms.
SVD of a 2048x1024 matrix in 1.05 s.
Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
Eigendecomposition of a 2048x2048 matrix in 8.38 s.

Intel(R) Xeon(R) Gold 6134
Dotted two 4096x4096 matrices in 0.32 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 0.89 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 8.19 s.
Anaconda 默认安装的 numpy 版本没有用 mkl，也没有开启 avx512，这个 cpu 是浪费了

27

pubby

2020-12-10 10:01:09 +08:00

3700X 黑苹果

Dotted two 4096x4096 matrices in 0.46 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 7.37 s.
Cholesky decomposition of a 2048x2048 matrix in 0.82 s.
Eigendecomposition of a 2048x2048 matrix in 49.05 s.

This was obtained using the following Numpy configuration:
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-I/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.Internal.sdk/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE

使用姿势不太对....

28

bnuliujing

2020-12-10 10:18:09 +08:00

i7-6950X 的成绩

Dotted two 4096x4096 matrices in 0.35 s.
Dotted two vectors of length 524288 in 0.03 ms.
SVD of a 2048x1024 matrix in 0.27 s.
Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
Eigendecomposition of a 2048x2048 matrix in 3.39 s.

29

NoobX

2020-12-10 11:05:02 +08:00

Mac Mini i5 款的成绩

Dotted two 4096x4096 matrices in 0.58 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.32 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 3.30 s.

M1 成绩印象也不太深刻。。。
不过 16G 内存依旧是一个大问题，系统一般自己就吃掉 4G，16G 只有 12G 放 dataset，老实讲对我不太够用
处理器慢点问题不大，swap 吃满了，那速度是真的噩梦

30

MisakaTian

2020-12-10 11:58:25 +08:00

数据狗表示 anaconda 搞定就上

31

Goldilocks

2020-12-10 12:06:11 +08:00

Processor Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz, 3600 Mhz, 4 Core

Dotted two 4096x4096 matrices in 0.33s ，比 m1 快一倍。但是 m1 是 8 核哦。所以同等频率同样核数，intel 还是要比 m1 快 3-4 倍左右，这还是 3 年前的产品。

32

YUX

OP

2020-12-10 12:12:50 +08:00 via iPhone

@MisakaTian 用 mamba 啊

33

Goldilocks

2020-12-10 12:18:45 +08:00

现在是 2020 年。Intel 如果出个 2 核 3.6G 的 cpu，你肯定看不上它的性能。你要想的是 Intel 10 核、20 核。马上 AMD 都要发布 64 核桌面 CPU 了，apple 还停留在 2 核的水准。

34

meloyang05

2020-12-10 13:35:48 +08:00

@Goldilocks

“8 代 i7 mac mini
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.09 ms.
SVD of a 2048x1024 matrix in 0.56 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 5.20 s.

M1 Mac mini

Dotted two 4096x4096 matrices in 0.69 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.68 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 4.82 s.”

你选择性无视其他测试成绩么。。时间在 ms 级别本来误差就可能很大，也可能是 numpy for m1 现在有 bug，你单独拎 vector 的成绩出来能说明什么问题？

35

Goldilocks

2020-12-10 13:38:09 +08:00

误差不会很大，一般都在 1%以内。因为矩阵乘法就受两个限制：

1. CPU flops
2. 内存带宽

36

Goldilocks

2020-12-10 13:45:33 +08:00

像矩阵乘法这样的数值计算是很成熟的领域，大家都研究的很透了。请参见这个： https://en.wikichip.org/wiki/flops

假设内存带宽能跟得上 cpu 的速度，要么要想跑的更快，就只有：
1. 增加核数
2. 增加 SIMD 的长度

比如 skylake 可以做到 64 FLOPs/cycle，但是同时代的 AMD CPU 只有 16 FLOPs/cycle 。大家主频都差不多，这其中的 4 倍就造成了主要的差距。而且这种差距很难追赶上，可以说一辈子都没希望。

37

Harry1993

2020-12-10 14:08:58 +08:00

用 Apple 的 numpy ( https://github.com/apple/tensorflow_macos)試了一下：

Dotted two 4096x4096 matrices in 0.84 s.
Dotted two vectors of length 524288 in 0.11 ms.
SVD of a 2048x1024 matrix in 0.54 s.
Cholesky decomposition of a 2048x2048 matrix in 0.06 s.
Eigendecomposition of a 2048x2048 matrix in 6.29 s.

38

IgniteWhite

2020-12-10 23:07:30 +08:00

@MisakaTian miniforge 的包管理器不就是 conda 么…只是默认 channel 是 conda-forge

39

lly0514

2020-12-11 15:35:01 +08:00

@Goldilocks 实际上误差非常大，我实测 MKL vs openblas 的性能差距有一倍多

40

Richardyyz

2020-12-13 09:58:14 +08:00

@Goldilocks ZEN2 都已经 32 FLOPs/cycle 了，你这一辈子这么短吗？降频严重的 AVX512 并没有在 ZEN3 面前有多么大的优势。

41

YUX

OP

2021-01-24 20:05:33 +08:00

补充一个树莓派的😂

Dotted two 4096x4096 matrices in 10.18 s.
Dotted two vectors of length 524288 in 2.27 ms.
SVD of a 2048x1024 matrix in 6.67 s.
Cholesky decomposition of a 2048x2048 matrix in 0.85 s.
Eigendecomposition of a 2048x2048 matrix in 37.83 s.

This was obtained using the following Numpy configuration:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
include_dirs = ['/root/mambaforge/envs/maths/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
include_dirs = ['/root/mambaforge/envs/maths/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/mambaforge/envs/maths/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/mambaforge/envs/maths/include']

42

YRInc

2021-04-23 04:02:49 +08:00

提供一个国产的给大家参考：鲲鹏 920

12 核鲲鹏 920 24G 内存：
-------------------
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:45:16)

Dotted two 4096x4096 matrices in 1.48 s.
Dotted two vectors of length 524288 in 0.49 ms.
SVD of a 2048x1024 matrix in 1.10 s.
Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
Eigendecomposition of a 2048x2048 matrix in 8.36 s.
-------------------

24 核鲲鹏 920 48G 内存:
-------------------
Dotted two 4096x4096 matrices in 0.76 s.
Dotted two vectors of length 524288 in 0.48 ms.
SVD of a 2048x1024 matrix in 0.93 s.
Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
Eigendecomposition of a 2048x2048 matrix in 7.66 s.

与 M1 Mac 用的同样的环境，Miniforge3，相关的加速库如下:
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
include_dirs = ['/root/miniforge3/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/root/miniforge3/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/root/miniforge3/include']