복붙노트

[REDIS] 비동기 AOF fsync를 너무 오래 (디스크가 사용 중이?)하고있다. 전체에 fsync를 기다리지 않고 AOF 버퍼를 작성,이 레디 스 느려질 수 있습니다

REDIS

비동기 AOF fsync를 너무 오래 (디스크가 사용 중이?)하고있다. 전체에 fsync를 기다리지 않고 AOF 버퍼를 작성,이 레디 스 느려질 수 있습니다

나는 아직도 우리가 우리의 처리 실패의 몇 가지 강조 표시된 오류-1 & 2 메시지와 클러스터가 몇 시간 동안 실패를 참조 지정된 레디 스 구성 값으로 성능 테스트를위한 이상 실행을위한 테스트-1 및 시험-2 이하를 실행했다. 어떻게이 문제를 해결합니다.

누구든지 피하기 클러스터에 대한 제안은 더 이상 십초 이상 진행되는 실패가하십시오, 클러스터가 우리가 재시도 메커니즘 시도 횟수에 사용하는 3 명 재시도 (봄 재시도 템플릿 내에 오지 않아 3으로 설정하고 다음 재시도 후 5 초, 기하 급수적 인 방법입니다 Jedis 클라이언트를 사용하여 시도).

시험-1 :

Run the test with Redis Setting: 

"appendfsync"="yes" 
"appendonly"="no"


[root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6379

127.0.0.1:6379> CONFIG GET **aof***

1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6379> exit

[root@rdcapdev1-redis-cache3 redis-3.2.5]# src/redis-cli -p 6380

127.0.0.1:6380> CONFIG GET **aof***

1) "auto-aof-rewrite-percentage"
2) "30"
3) "auto-aof-rewrite-min-size"
4) "67108864"
5) "aof-rewrite-incremental-fsync"
6) "yes"
7) "aof-load-truncated"
8) "yes"
127.0.0.1:6380> clear

관측:

  1. The redis failover occurred for ~40 sec.
  2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
  3. This has been happened when ~50% of RAM got utilized.
  4. The Master-slave configuration has be reshuffled as below after this failover.
  5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
  6. I have logs for this, for more details: 30Per_AofRW_2.zip

Redis1 마스터 로그 :

2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk

Redis2 마스터 로그 :

837:M 05 May 10:30:22.216 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
837:M 05 May 10:30:22.216 # Cluster state changed: fail
837:M 05 May 10:30:23.148 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
837:M 05 May 10:30:23.188 # Cluster state changed: ok
837:M 05 May 10:30:27.227 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.
837:M 05 May 10:35:22.017 * 10 changes in 300 seconds. Saving...
.
.
.
837:M 05 May 11:12:23.592 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:M 05 May 11:12:27.045 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.

Redis3 마스터 로그 :

833:M 05 May 10:30:22.217 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba
833:M 05 May 10:30:22.217 # Cluster state changed: fail
833:M 05 May 10:30:23.149 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 12
833:M 05 May 10:30:23.189 # Cluster state changed: ok
1822:C 05 May 10:30:27.397 * DB saved on disk
1822:C 05 May 10:30:27.428 * RDB: 8 MB of memory used by copy-on-write
833:M 05 May 10:30:27.528 * Background saving terminated with success
833:M 05 May 10:30:27.828 * Clear FAIL state for node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba: slave is reachable again.

HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00

MASTER SLAVE INFO
hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

시험-2 :

Run the test with Redis Setting: 

"appendfsync"="no" 
"appendonly"="yes"

관측:

1. The redis failover occurred for ~40 sec.
2. There are around 20 documents failed on FX and OCR level. Due to inability to write/read the files to Redis.
3. This has been happened when ~50% of RAM got utilized.
4. The Master-slave configuration has be reshuffled as below after this failover.
5. Below are the few highlights of the redis log, plese refer that attached log for more detail.
30Per_AofRW_2.zip

Redis1 마스터 로그 :

2515:C 05 May 11:06:30.343 * DB saved on disk
2515:C 05 May 11:06:30.379 * RDB: 15 MB of memory used by copy-on-write
837:S 05 May 11:06:30.429 * Background saving terminated with success
837:S 05 May 11:11:31.024 * 10 changes in 300 seconds. Saving...
837:S 05 May 11:11:31.067 * Background saving started by pid 2534
837:S 05 May 11:12:24.802 * FAIL message received from 6b8d49e9db288b13071559c667e95e3691ce8bd0 about ce62a26102ef54f43fa7cca64d24eab45cf42a61
837:S 05 May 11:12:27.049 * Clear FAIL state for node ce62a26102ef54f43fa7cca64d24eab45cf42a61: slave is reachable again.
2534:C 05 May 11:12:31.110 * DB saved on disk

Redis2 마스터 로그 :

5306:M 03 Apr 09:02:36.947 * Background saving terminated with success
5306:M 03 Apr 09:02:49.574 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:02:49.583 * Background append only file rewriting started by pid 12864
5306:M 03 Apr 09:02:54.050 * AOF rewrite child asks to stop sending diffs.
12864:C 03 Apr 09:02:54.051 * Parent agreed to stop sending diffs. Finalizing AOF...
12864:C 03 Apr 09:02:54.051 * Concatenating 0.00 MB of AOF diff received from parent.
12864:C 03 Apr 09:02:54.051 * SYNC append only file rewrite performed
12864:C 03 Apr 09:02:54.058 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:02:54.098 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:02:54.098 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:01.843 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:04:01.853 * Background append only file rewriting started by pid 12867
5306:M 03 Apr 09:04:11.657 * AOF rewrite child asks to stop sending diffs.
12867:C 03 Apr 09:04:11.657 * Parent agreed to stop sending diffs. Finalizing AOF...
12867:C 03 Apr 09:04:11.657 * Concatenating 0.00 MB of AOF diff received from parent.
12867:C 03 Apr 09:04:11.657 * SYNC append only file rewrite performed
12867:C 03 Apr 09:04:11.664 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:04:11.675 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:04:11.675 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:04:48.054 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 09:05:28.571 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:05:28.581 * Background append only file rewriting started by pid 12873
5306:M 03 Apr 09:05:33.300 * AOF rewrite child asks to stop sending diffs.
12873:C 03 Apr 09:05:33.300 * Parent agreed to stop sending diffs. Finalizing AOF...
12873:C 03 Apr 09:05:33.300 * Concatenating 2.09 MB of AOF diff received from parent.
12873:C 03 Apr 09:05:33.329 * SYNC append only file rewrite performed
12873:C 03 Apr 09:05:33.336 * AOF rewrite: 11 MB of memory used by copy-on-write
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:05:33.395 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:05:33.395 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:07:37.082 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:07:37.092 * Background saving started by pid 12875
12875:C 03 Apr 09:07:47.016 * DB saved on disk
12875:C 03 Apr 09:07:47.024 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:47.113 * Background saving terminated with success
5306:M 03 Apr 09:07:51.622 * Starting automatic rewriting of AOF on 3% growth
5306:M 03 Apr 09:07:51.632 * Background append only file rewriting started by pid 12876
5306:M 03 Apr 09:07:56.559 * AOF rewrite child asks to stop sending diffs.
12876:C 03 Apr 09:07:56.559 * Parent agreed to stop sending diffs. Finalizing AOF...
12876:C 03 Apr 09:07:56.559 * Concatenating 0.00 MB of AOF diff received from parent.
12876:C 03 Apr 09:07:56.559 * SYNC append only file rewrite performed
12876:C 03 Apr 09:07:56.567 * AOF rewrite: 2 MB of memory used by copy-on-write
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite terminated with success
5306:M 03 Apr 09:07:56.645 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
5306:M 03 Apr 09:07:56.645 * Background AOF rewrite finished successfully
5306:M 03 Apr 09:12:48.071 * 10 changes in 300 seconds. Saving...
5306:M 03 Apr 09:12:48.081 * Background saving started by pid 12882
12882:C 03 Apr 09:12:58.381 * DB saved on disk
12882:C 03 Apr 09:12:58.389 * RDB: 5 MB of memory used by copy-on-write
5306:M 03 Apr 09:12:58.403 * Background saving terminated with success
5306:M 03 Apr 10:17:33.005 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:22:42.042 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 10:27:51.039 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
5306:M 03 Apr 11:10:10.606 * Marking node a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba as failing (quorum reached).
5306:M 03 Apr 11:10:10.607 # Cluster state changed: fail
5306:M 03 Apr 11:10:10.608 * FAIL message received from 83f6a9589aa1bce8932a367894fa391edd0ce269 about ac630108d1556786a4df74945cfe35db981d15fa
5306:M 03 Apr 11:10:11.594 # Failover auth granted to 6b8d49e9db288b13071559c667e95e3691ce8bd0 for epoch 7

HOST: localhost PORT: 6379
machine master slave
10.2.1.233 0.00 2.00
10.2.1.46 2.00 0.00
10.2.1.202 1.00 1.00

MASTER SLAVE 정보

hashCode master slave hashSlot
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 10.2.1.233:6380, 10923-16383
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 10.2.1.233:6379, 0-5460
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 10.2.1.202:6380, 5461-10922
6b8d49e9db288b13071559c667e95e3691ce8bd0 10.2.1.46:6380 master - 0 1493981044497 12 connected 0-5460
81ae2d757f57f36fa1df6e930af3b072084ba3e8 10.2.1.202:6379 master - 0 1493981045500 3 connected 10923-16383
ce62a26102ef54f43fa7cca64d24eab45cf42a61 10.2.1.202:6380 slave 83f6a9589aa1bce8932a367894fa391edd0ce269 0 1493981043495 10 connected
ac630108d1556786a4df74945cfe35db981d15fa 10.2.1.233:6380 slave 81ae2d757f57f36fa1df6e930af3b072084ba3e8 0 1493981042492 11 connected
83f6a9589aa1bce8932a367894fa391edd0ce269 10.2.1.46:6379 master - 0 1493981044497 2 connected 5461-10922
a523100ddfbf844c6d1cc7e0b6a4b3a2aa970aba 10.2.1.233:6379 myself,slave 6b8d49e9db288b13071559c667e95e3691ce8bd0 0 0 1 connected

해결법

    from https://stackoverflow.com/questions/47693975/asynchronous-aof-fsync-is-taking-too-long-disk-is-busy-writing-the-aof-buffe by cc-by-sa and MIT license