Python: 如何读取大文件的信息

本人在处理实验结果的数据时，发现结果的文本文件过大（好几个G），不管是用Notepad或者vscode读取都存在一些问题，最终想到用python读取。但是正常的读取整个文本，大概需要1min左右，但实际上需要的只是最后输出的结果，后面发现可以直接采用二进制的方式读取，大大提升效率。

1 高效读取文件的最后几行#

我们可以使用以下方法来高效地读取大文件的最后几行：

def read_last_lines(file_path):
    with open(file_path, 'rb') as file:
        try:
            file.seek(-1024, 2)  # 将file指针从文件末尾向前移动1024字节
        except IOError:
            file.seek(0)  # 如果文件小于1024字节，就从头开始读
        last_lines = file.read().splitlines()[-3:]
        return [line.decode('utf-8', errors='replace') for line in last_lines]

这个函数的工作原理：

以二进制模式打开文件。
尝试从文件末尾向前移动 1024 字节。如果文件小于 1024 字节，就从头开始读。
读取这部分内容，分割成行，并取最后三行。
尝试用 UTF-8 解码这些行，如果遇到无法解码的字符，就替换它们。

2 处理多个文件#

要处理目录中的多个文件，我们可以使用以下代码：

import os

# 获取所有符合条件的文件并排序
matching_files = [
    file for file in os.listdir("results")
    if file.startswith("xxx") and file.endswith(".txt") and not file.endswith("xxxx.txt")
]
matching_files.sort()  # 按字母顺序排序文件名

# 按排序后的顺序处理文件
for file in matching_files:
    file_path = os.path.join("results", file)
    last_three_lines = read_last_lines(file_path)
    
    # 打印文件名和最后三行
    print(f"\nFile: {file}")
    for line in last_three_lines:
        print(line)

这段代码的特点：

使用列表推导式找出所有符合条件的文件。
对文件名进行排序，确保输出的顺序一致。
遍历文件，读取每个文件的最后三行并打印。

3 处理编码问题#

在处理文件时，我们可能会遇到 UnicodeDecodeError。这通常发生在文件不是用 UTF-8 编码保存的情况下。除了上面使用的 errors='replace' 参数外，还有其他几种方法可以处理这个问题：

尝试多种编码：

def read_with_multiple_encodings(file_path):
    encodings = ['utf-8', 'gbk', 'gb18030', 'iso-8859-1', 'latin1']
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as file:
                return file.read()
        except UnicodeDecodeError:
            continue
    raise UnicodeDecodeError(f"Unable to decode the file {file_path} with any of the tried encodings")

使用第三方库如 chardet 来检测文件编码：

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as file:
        raw_data = file.read()
    return chardet.detect(raw_data)['encoding']