问题描述
新安装不久的PostgreSQL数据库,断电后重启,查看日志如下
2019-01-08 08:44:19.989 UTC [7493] LOG: database system was interrupted; last known up at 2018-12-24 10:56:28 UTC
2019-01-08 08:44:19.989 UTC [7493] PANIC: could not read file "pg_logical/replorigin_checkpoint": Success
2019-01-08 08:44:19.989 UTC [7491] LOG: startup process (PID 7493) was terminated by signal 6: Aborted
2019-01-08 08:44:19.989 UTC [7491] LOG: aborting startup due to startup process failure
2019-01-08 08:44:20.011 UTC [7491] LOG: database system is shut down
其中致命的错误是这句。
2019-01-08 08:44:19.989 UTC [7493] PANIC: could not read file "pg_logical/replorigin_checkpoint": Success
这个提示很明显,与文件replorigin_checkpoint有关。
于是找到这个文件,在数据库正常启动的情况下,尝试通过删除和修改的方式重现该问题。
无论是删除,还是修改,在数据库重启的情况下都会被重建。无法复现错误。
之所以如此,是因为一直认为该文件是在数据库启动时重建的。
直到意外的,在关闭数据库后清空文件内容,才成功复现了该问题。
注意:pg_logical/replorigin_checkpoint,文件是在数据库关闭时重建的,并且不能删除,必须清空文件内容。
问题已经清楚了,那么我们还有两个问题需要解决
- 文件的作用,以及数据库相应的处理流程?
- 如何避免该问题?
1.文件的作用
网上对于这部分的资料非常少,好在PG是开源的数据库,结合代码得到了如下的这些信息。
- 与逻辑复制有关,Logical replication(逻辑复制)。
- 通过名称pg_logical/replorigin_checkpoint 查找源代码,出现在如下两个函数中。
截取StartupReplicationOrigin()函数的一段代码
if (readBytes != sizeof(magic))
{
if (readBytes < 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",
path)));
else
ereport(PANIC,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("could not read file \"%s\": read %d of %zu",
path, readBytes, sizeof(magic))));
}
代码中出现了我们日志中的错误 could not read file....,: Success
错误的原因是内容的大小与预设值不一致。与我们之前的测试结果类似,只有更改内容才会报错。
CheckPointReplicationOrigin()
其中,截取CheckPointReplicationOrigin函数的代码
{
const char *tmppath = "pg_logical/replorigin_checkpoint.tmp";
const char *path = "pg_logical/replorigin_checkpoint";
int tmpfd;
int i;
uint32 magic = REPLICATION_STATE_MAGIC;
pg_crc32c crc;
if (max_replication_slots == 0)
return;
INIT_CRC32C(crc);
/* make sure no old temp file is remaining */
if (unlink(tmppath) < 0 && errno != ENOENT)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
tmppath)));
/*
* no other backend can perform this at the same time, we're protected by
* CheckpointLock.
*/
tmpfd = OpenTransientFile(tmppath,
O_CREAT | O_EXCL | O_WRONLY | PG_BINARY);
if (tmpfd < 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m",
tmppath)));
/* write magic */
errno = 0;
if ((write(tmpfd, &magic, sizeof(magic))) != sizeof(magic))
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
errno = ENOSPC;
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",
tmppath)));
}
COMP_CRC32C(crc, &magic, sizeof(magic));
/* prevent concurrent creations/drops */
LWLockAcquire(ReplicationOriginLock, LW_SHARED);
/* write actual data */
for (i = 0; i < max_replication_slots; i++)
{
ReplicationStateOnDisk disk_state;
ReplicationState *curstate = &replication_states[i];
XLogRecPtr local_lsn;
if (curstate->roident == InvalidRepOriginId)
continue;
/* zero, to avoid uninitialized padding bytes */
memset(&disk_state, 0, sizeof(disk_state));
LWLockAcquire(&curstate->lock, LW_SHARED);
disk_state.roident = curstate->roident;
disk_state.remote_lsn = curstate->remote_lsn;
local_lsn = curstate->local_lsn;
LWLockRelease(&curstate->lock);
/* make sure we only write out a commit that's persistent */
XLogFlush(local_lsn);
errno = 0;
if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
sizeof(disk_state))
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
errno = ENOSPC;
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",
tmppath)));
}
COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
}
LWLockRelease(ReplicationOriginLock);
/* write out the CRC */
FIN_CRC32C(crc);
errno = 0;
if ((write(tmpfd, &crc, sizeof(crc))) != sizeof(crc))
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
errno = ENOSPC;
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",
tmppath)));
}
if (CloseTransientFile(tmpfd))
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not close file \"%s\": %m",
tmppath)));
/* fsync, rename to permanent file, fsync file and directory */
durable_rename(tmppath, path, PANIC);
}
- 当参数max_replication_slots设置为0时,不会创建和修改文件,而是直接返回了。
if (max_replication_slots == 0)
return;
参数max_replication_slots的设置,与 logical decoding有关
Before you can use logical decoding, you must set wal_level to logical and max_replication_slots
to at least 1.
在PostgreSQL 9.5.16 Documentation的文档中,我们找到如下描述的内容。
Replication slots provide an automated way to ensure that the master does not remove WAL segments until they have been received by all standbys, and that the master does not remove rows which could cause a recovery conflict even when the standby is disconnected.
复制槽提供了一种自动化的方法,以确保主服务器在所有备用服务器接收到WAL段之前不会删除它们,并且主服务器不会删除可能导致恢复冲突的行,即使在备用服务器断开连接时也是如此。
In lieu of using replication slots, it is possible to prevent the removal of old WAL segments using wal_keep_segments, or by storing the segments in an archive using archive_command. However, these methods often result in retaining more WAL segments than required, whereas replication slots retain only the number of segments known to be needed. An advantage of these methods is that they bound the space requirement for pg_xlog; there is currently no way to do this using replication slots.
Similarly, hot_standby_feedback and vacuum_defer_cleanup_age provide protection against relevant rows being removed by vacuum, but the former provides no protection during any time period when the standby is not connected, and the latter often needs to be set to a high value to provide adequate protection. Replication slots overcome these disadvantages.
接下来的两段,是在说明,在没有使用复制槽的时候,如何避免上面提到的两个问题
- 确保主服务器在所有备用服务器接收到WAL段之前不会删除它们
- 主服务器不会删除可能导致恢复冲突的行
很显然,使用复制槽能够更好的避免这些问题。
而我们通过源码,能够发现,将复制槽中内容,持续写入到文件的过程
/* write actual data */
for (i = 0; i < max_replication_slots; i++)
{
ReplicationStateOnDisk disk_state;
ReplicationState *curstate = &replication_states[i];
XLogRecPtr local_lsn;
if (curstate->roident == InvalidRepOriginId)
continue;
/* zero, to avoid uninitialized padding bytes */
memset(&disk_state, 0, sizeof(disk_state));
LWLockAcquire(&curstate->lock, LW_SHARED);
disk_state.roident = curstate->roident;
disk_state.remote_lsn = curstate->remote_lsn;
local_lsn = curstate->local_lsn;
LWLockRelease(&curstate->lock);
/* make sure we only write out a commit that's persistent */
XLogFlush(local_lsn);
errno = 0;
if ((write(tmpfd, &disk_state, sizeof(disk_state))) !=
sizeof(disk_state))
{
/* if write didn't set errno, assume problem is no disk space */
if (errno == 0)
errno = ENOSPC;
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",
tmppath)));
}
COMP_CRC32C(crc, &disk_state, sizeof(disk_state));
}
2.如何避免该问题
避免该问题需要从两方面考虑
第一个,如果我们需要使用逻辑复制,也就是需要使用logical decoding,那么必须设置参数max_replication_slots大于1
当出现突然断电的情况,pg_logical/replorigin_checkpoint只创建了文件,没有来得及写数据的情况下,仍然会出现这种问题。
所以可能的解决办法是,当数据库启动异常时,将文件删除,确保数据库能够正常启动。
第二个,当我们不需要使用逻辑复制,那么只需要将参数max_replication_slots设置为0,则数据库自动不创建该文件,从而避免该问题。