memorycache v1.1.5 update: 写入速度达到 Ristretto 五倍

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

The Go Programming Language

› http://golang.org/

› Go Playground

Go Projects

› Revel Web Framework

这是一个创建于 181 天前的主题，其中的信息可能已经有所发展或是发生改变。

https://github.com/lxzan/memorycache

https://pkg.go.dev/github.com/lxzan/memorycache

本次更新, 引入了时间戳缓存, 优势进一步扩大:

// 白嫖 github action 做个测试

goos: linux
goarch: amd64
pkg: github.com/lxzan/memorycache/benchmark
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkMemoryCache_Set-4      11106261    100.6 ns/op	      18 B/op	       0 allocs/op
BenchmarkMemoryCache_Get-4      635988      77.30 ns/op	       0 B/op	       0 allocs/op
BenchmarkRistretto_Set-4        7933663     491.8 ns/op	     170 B/op	       2 allocs/op
BenchmarkRistretto_Get-4        11085688    98.92 ns/op	      18 B/op	       1 allocs/op
PASS

29 条回复 • 2023-12-06 21:58:27 +08:00

limpo

181 天前

不愧是卷王之王😀

Nazz

181 天前

@limpo 有个 gws 粉丝在鞭策我 🤣

matrix1010

176 天前

无意中看到了另一个快 5 倍的 https://github.com/maypok86/otter (看 bench 写入比我的 Theine 也快 5 倍), 可以研究一下

Nazz

176 天前

@matrix1010 源码比我的 MemoryCache 复杂许多, 没看明白它的 TTL 是怎么实现的

Nazz

176 天前

在 mac 上测了下

```
goos: darwin
goarch: arm64
pkg: github.com/lxzan/memorycache/benchmark
BenchmarkMemoryCache_Set-8 13383870 81.57 ns/op 15 B/op 0 allocs/op
BenchmarkMemoryCache_Get-8 14529171 69.02 ns/op 0 B/op 0 allocs/op
BenchmarkRistretto_Set-8 13331672 286.1 ns/op 139 B/op 2 allocs/op
BenchmarkRistretto_Get-8 15165708 81.48 ns/op 18 B/op 1 allocs/op
BenchmarkOtter_Set-8 10669878 114.8 ns/op 39 B/op 0 allocs/op
BenchmarkOtter_Get-8 14930769 114.3 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/lxzan/memorycache/benchmark 34.781s
```

matrix1010

175 天前

我测了测纯 GET 确实很快，因为用的是 xsync map( https://github.com/puzpuzpuz/xsync). 不过按照 xsync 作者的说法 GC 压力会更高( https://github.com/puzpuzpuz/xsync/issues/94)，这点从 benchmark 的 allocs 上倒是不太能看出来

Nazz

175 天前 via Android

@matrix1010 GC 压力高那还不如用内置 map 了，纯 Get 大家表现都差不多。

Nazz

175 天前 via Android

我自己也实现过 map ，但是扩容那块写得太简单粗暴，后面干粹用内置 map 了

maypok86

157 天前

Hey guys, the author of Otter is here. I only understand Chinese with google translator so using this site was quite difficult for me but I think I did well. :)

So I'll try to answer the discussion (at least what I understood) in English, if you don't mind. I will call the Ristretto/Theine/Otter set of libraries as RTO.

1. I think the benchmarks in memorycache are absolutely not representative because they use a terrible trace for all caches with a good eviction policy, since a trace of completely random strings makes the cache roll into single thread execution, constantly preempting keys due to lack of matching. In such conditions all the techniques used in RTO for acceleration not only do not help, but also harm + the cost of eviction policy also harms, simply because such a policy wastes extra CPU time and does not give any benefits. As an example: with high probability in Java, a map with mutex and heap to store the expiration time will overtake Caffeine on a trace of random strings but you can't go to Ben Manes and say that you killed his Caffeine. :) RTO libraries use zipfian (Theine) or more representative scrambled zipfian (Ristretto and Otter) for this reason and on such traces memorycache already loses seriously.

BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-10 25730020 43.56 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-10 11104183 95.58 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-10 24799726 46.79 ns/op 16 B/op 1 allocs/op
BenchmarkCache/Zipf_Memory_reads=100%,writes=0%-10 8501631 143.7 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-10 18628352 62.30 ns/op 6 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-10 2528754 456.8 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-10 2830969 418.2 ns/op 45 B/op 1 allocs/op
BenchmarkCache/Zipf_Memory_reads=75%,writes=25%-10 3291450 376.4 ns/op 9 B/op 0 allocs/op

I will not give results for other load types, but the results are not better there. As a result, we have that the results of these benchmarks simply cannot be trusted.

2. About code complexity. Yes, Otter's code is more complicated to understand, but it demonstrates better perfomance and hit ratio. Because of this, the advantage of simple code is not very clear if a user doesn't need to understand the library and its intricacies. The user just wants to start using the library and have it show good results, and Otter does that. (I could give an example about DB or caffeine, for example, but I'm lazy).

3. About the hash table used in Otter and the pressure on the gc. Spoiler: I found this argument rather strange, but let's look into it. First of all about the hash table in Otter: yes, Otter uses a hash table based on map from xsync library, but it uses a different architecture based on seq locks and a number of hacks that puzpuzpuz couldn't use when implementing the hash table. Now what was puzpuzpuz talking about when mentioning the higher gc pressure? It's about the fact that sync.Map uses two standard maps inside itself, where key and value are stored without creating additional structure for key-value pair (aka sync.Map does not create structure of &Entry{key, value} kind, which is leaked to the heap). xsync.Map does create this structure, which causes additional allocations and increases the load on gc which has to keep track of pointers). Okay, what about Otter? When inserting a key-value pair, Otter creates a structure like &Entry{key, value} and gives this structure to the hash table, which already knows how to work with these structures (compare keys, etc.). Wait, but then Otter additionally pressure gc, as described above. Unfortunately, yes, but other cache libraries do even worse things. Let's break down why.

Here are some links to the main data structures in the rest of the cache libraries:

https://github.com/Yiling-J/theine-go/blob/main/internal/store.go#L39 (here things are even sadder from gc's point of view, because theine not only stores pointers to a key-value pair, but also duplicates the value of keys in two places (if the key is int it's still ok, but if the key is a string it will end up storing two pointers to strings (we'll forget about the overhead with len and cap) and *Entry[K, V]. Which puts more pressure on gc than Otter, obviously.

https://github.com/dgraph-io/ristretto/blob/main/store.go#L118 (ristretto doesn't seem to store pointers, but there's a catch), because 1. they replace the key with a hash, which can lead to terrible errors 2. stores the time completely (along with the location pointer) 3. allocation when adding a key still happens https://github.com/dgraph-io/ristretto/blob/main/cache.go#L285

https://github.com/lxzan/memorycache/blob/main/cache.go#L289 and https://github.com/lxzan/memorycache/blob/main/types.go#L18 are sad here. 1. there is not even a hint of generics 2. the key is again stored twice and chokes gc 3. the value is stored as type any (which is actually interface{}). If you dive a bit deeper into the implementation of interfaces in go, you'll learn that interfaces store a pointer to the value contained in them. As a result, we have that even the value separately leaks into the heap. This puts terrible pressure on gc.

4. About the speed of the otter.
In fact, this is due to several factors: Otter uses hash table, which is faster than the standard one, queue, which is faster than the standard one, wait-free buffers, which are faster than the implementation from Theine, counters, which outperform atomic.Int64 in cases of frequent writes, approximate time for TTL, honest batching for access to the eviction policy (Otter does not waste time on frequent lock/unlock operations with full read and write buffers) and a few more ideas.

Unfortunately, I haven't had time lately to finish the work for Otter, but in the next few weeks, everything should be calmer and I hope to finally finish it.

And a huge thank you for even noticing Otter, because I got the impression that even the eye-catching title doesn't interest anyone and everyone just walks past it.

P.S. how long you have to wait on this site to be able to write something...

Nazz

156 天前

@maypok86 The only competitor to MemoryCache (MC) is Ristretto, neither of which is GC optimized. GC optimization is not all positive gain, Codec overhead is not small. MC seeks to replace Redis in light-use scenarios, where indexed priority queues are the best data structure, simple and efficient.

maypok86

156 天前

@Nazz

1.) any cache library is capable of replacing redis in light-use scenarios and it doesn't give MC any advantage :)

2.) The claim that MC is a rival to Ristretto at all is very funny because MC simply doesn't have an eviction policy (MC just removes the element with the smallest expiration time, which isn't even LRU as written in the README). And in fact the main problem that RTOs solve. As a result we have that MC loses to RTO by hit ratio, loses by performance on real traces Otter and has serious pressure on GC. It seems that a cache library with no eviction policy should offer something in return (like no pressure on GC like bigcache https://github.com/allegro/bigcache). In the end we have that MC is slightly faster than Ristretto but still has a nasty hit ratio because of which the system will spend much more time. And if we compare it to Otter, it will lose on all parameters.

3.) Do you really think that Dgraph labs, PostgreSQL authors and Ben Manes & CO couldn't think of slicing the map into shards with mutex and bolting a heap on there? That's just stupid

Nazz

156 天前

@maypok86 It's not like ristretto didn't consider using a minimal heap https://github.com/dgraph-io/ristretto/blob/5239be55a219459d7ce4f6e9f99c8cea2c0b9d3a/policy.go#L163

maypok86

156 天前

@Nazz lol, the heap isn't there for what you're using. The heap is there just to avoid going through all the values in lfu to find the most rare element. You're using the heap for ttl (no question about that) and as an eviction policy you're just removing the element with the lowest expiry time, which is terrible. This gives you the speed to beat ristretto (since ristretto has to completely lock the eviction policy for all shards) but the hit ratio of such a cache will be horrible, because such actions are not really different from removing the first inserted item

Nazz

156 天前

@maypok86 It's stupid to try and convince people tirelessly. I don't see a problem with using indexed priority queues to implement memory caching with TTL, MC works well in my company's projects. I also don't see a problem with benchmarking based on random strings, redis keys often contain data ids. In MC, if you only use GetWithTTL and SetWithTTL, it's just LRU.

maypok86

155 天前

@Nazz Ok, maybe I'm misunderstanding something then I'd like to understand how it works in general.

1.) Suppose you use heap for the list in lru. Then why do it at all for O(logn) when you can do it for O(1)? And then how does deleting expirated items even work if the heap is not ordered by expiry time? It seems to me there is no way and memorycache leaves a large number of items alive until it reaches them in lru.

2.) Cutting lru into shards looks questionable in terms of hit ratio because one of the shards may simply have more items than the others and will constantly evict good items.

3.) Yes, redis can store id's but that's not what I was writing about. In a trace of random strings, all elements occur only once. And querying each of them is a cache miss but in reality there is a set of most frequent elements, which is what we want to store. And in memorycache benchmarks, any new item causes the caches to constantly evict items which would not happen in reality. It would also be cool to see benchmarks with a combination of set and get operations. And hit ratio comparison too.

maypok86

155 天前

Anyway, I fiddled around and tested benchmarks of mine and memorycache and the verdict is pretty interesting. Here I won't take trace types into account, because on pure set and get they really don't affect much, but only on mixed operations.

1.) There is a bug in memorycache benchmarks: https://github.com/lxzan/memorycache/blob/main/benchmark/benchmark_test.go#L31 this add eats about half of the benchmark time, since a bunch of goroutines compete for this atomic, it is better to use a local counter per goroutine or fastrand. And on such benchmarks the result is about the same:
goos: darwin
goarch: arm64
pkg: github.com/lxzan/memorycache/benchmark
BenchmarkMemoryCache_Set-8 27370962 83.20 ns/op
BenchmarkMemoryCache_Get-8 62720229 18.51 ns/op
BenchmarkRistretto_Set-8 19263478 101.4 ns/op
BenchmarkRistretto_Get-8 39636226 26.51 ns/op
BenchmarkOtter_Set-8 26425458 39.04 ns/op
BenchmarkOtter_Get-8 40814108 28.50 ns/op
PASS
ok github.com/lxzan/memorycache/benchmark 22.620s

Also on my benchmarks I got proportional results, memorycache was about a third faster than otter on pure get and about twice as slow on pure set, but things got sad when mixed load was applied
BenchmarkCache/Zipf_Memory_reads=75%,writes=25%-8 4453213 289.6 ns/op 13 B/op 0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8 13881976 75.53 ns/op 6 B/op 0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8 2853026 469.0 ns/op 0 B/op 0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8 3112142 390.5 ns/op 46 B/op 1 allocs/op

2.) I really wondered how regular locking with map was able to defeat wait-free ristretto and otter with rare request blocking. I went to the memorycache code and saw this https://github.com/lxzan/memorycache/blob/main/cache.go#L124 that is memorycache in no way accounts for reads of non-expired items in the eviction policy. This is not LRU at all and I don't know what it is because any eviction policy has to take into account both reads and writes. Like for example here https://github.com/hashicorp/golang-lru/blob/main/simplelru/lru.go#L72. I'll try to check the hit ratio tomorrow, but I suspect it will be very small

I'd like some kind of answer to this, to be honest, because I've been told so much about how only memorycache can fight ristretto, but so far it's been disappointing.

Nazz

155 天前

@maypok86 The heap is used to quickly remove expired elements. Redis seems to randomly check for expired elements, which is not as efficient as the heap. In fact, users don't care if a library strictly implements the LRU algorithm, all they want is a KV store with TTL.

I've updated the benchmarks. My local cachebench hit test was too much of a pain in the ass to run, so I gave up on it.

goos: linux
goarch: amd64
pkg: github.com/lxzan/memorycache/benchmark
cpu: AMD EPYC 7763 64-Core Processor
BenchmarkMemoryCache_Set-4 7657497 133.3 ns/op 27 B/op 0 allocs/op
BenchmarkMemoryCache_Get-4 23179834 58.10 ns/op 0 B/op 0 allocs/op
BenchmarkMemoryCache_SetAndGet-4 20667798 59.09 ns/op 0 B/op 0 allocs/op
BenchmarkRistretto_Set-4 7739505 321.4 ns/op 135 B/op 2 allocs/op
BenchmarkRistretto_Get-4 12482400 97.67 ns/op 18 B/op 1 allocs/op
BenchmarkRistretto_SetAndGet-4 7265832 140.4 ns/op 31 B/op 1 allocs/op
PASS
ok github.com/lxzan/memorycache/benchmark 31.137s

Nazz

155 天前

@maypok86

> I'd like some kind of answer to this, to be honest, because I've been told so much about how only memorycache can fight ristretto, but so far it's been disappointing.

MC is just an obscure library, and I'm sure hardly anyone would say that.

Nazz

155 天前

@maypok86 It seems to have triggered a bug in the otter, and it's slowing it down terribly.

go test -benchmem -run=^$ -bench . github.com/lxzan/memorycache/benchmark
goos: linux
goarch: amd64
pkg: github.com/lxzan/memorycache/benchmark
cpu: AMD Ryzen 5 PRO 4650G with Radeon Graphics
BenchmarkMemoryCache_Set-12 21004170 72.60 ns/op 9 B/op 0 allocs/op
BenchmarkMemoryCache_Get-12 43787251 40.11 ns/op 0 B/op 0 allocs/op
BenchmarkMemoryCache_SetAndGet-12 45939994 45.35 ns/op 0 B/op 0 allocs/op
BenchmarkRistretto_Set-12 12190314 122.2 ns/op 112 B/op 2 allocs/op
BenchmarkRistretto_Get-12 25565082 44.60 ns/op 16 B/op 1 allocs/op
BenchmarkRistretto_SetAndGet-12 11713868 97.06 ns/op 27 B/op 1 allocs/op
BenchmarkOtter_SetAndGet-12 13760 89816 ns/op 13887 B/op 0 allocs/op
PASS
ok github.com/lxzan/memorycache/benchmark 44.081s

func BenchmarkOtter_SetAndGet(b *testing.B) {
var builder, _ = otter.NewBuilder[string, int](1000)
builder.ShardCount(128)
mc, _ := builder.Build()
for i := 0; i < benchcount; i++ {
mc.SetWithTTL(benchkeys[i%benchcount], 1, time.Hour)
}

b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
var i = atomic.Int64{}
for pb.Next() {
index := i.Add(1) % benchcount
if index&7 == 0 {
mc.SetWithTTL(benchkeys[index], 1, time.Hour)
} else {
mc.Get(benchkeys[index])
}
}
})
}

maypok86

155 天前

1.) Yes, users of cache libraries usually want a map with limited size, ttl and maybe some other set of features and don't want to care about preemptive policies within that library. But if this library doesn't show a good hit ratio, then the rest of its features are of no use at all, because cache misses will waste a huge amount of time on the system. And lru sharding and ignoring get operations in the eviction policy looks like a very small hit ratio (many times smaller than the others). Also as I wrote before, it's not very clear how lru and ttl coexist together in the same heap.

2.) Yes, measuring hit ratio is not a very nice thing to do. But if you don't want to write these checks yourself, you can check them on ristretto, theine or otter benchmarks. All these benchmarks have approximately the same values.

3.) Yes, I once got such strange memorycache benchmarks on otter and it was on pure get when benchcount was very large. I'll try to look at this today tentatively looks like a gc loop, or some bug when accounting for expiration. It would be cool if you could share the full benchmark code so I can profile easier.

4.) I've noticed here that they unironically take off some balance for posts, so let's continue in this issue https://github.com/lxzan/memorycache/issues/13

Nazz

155 天前

@matrix1010 otter 作者被你炸出来了

matrix1010

155 天前

@Nazz 好家伙，这个展开出乎我意料。可能这是 V2EX 为数不多的外国朋友

Nazz

147 天前

@matrix1010 swiss table 的 gc 压力相比内置 map 怎么样?

matrix1010

147 天前

@Nazz 知识盲区，也许这个能参考一下: https://github.com/golang/go/issues/54766

Nazz

147 天前

@matrix1010 可以简单说下 Theine 是怎么维护过期时间和 LFU 吗?

Nazz

147 天前

MC 使用最小四叉堆高效地维护了过期时间, 但是只实现了 LRU, 命中率方面不如 LFU

Nazz

147 天前

我想到了, 再加一个堆, 以查询次数作为比较基准

matrix1010

147 天前

@Nazz Hierarchical Timing Wheels, 我是照着 caffeine 的 java 代码翻译的，也可以 google 。LFU 就复杂些了, 建议去看 W-TinyLFU 的论文。简单来说 frequency 数据是存在 Count-Min Sketch 这种概率类数据结构里的，所以占用空间很小

matrix1010

147 天前

缓存策略相关的论文很多，包括各种改进版的 lru 策略也很多