使用python读取二进制文件-IT科技

使用python读取二进制文件

2025-02-21 08:50:00

admin

原创

摘要：问题描述：我发现用 Python 读取二进制文件特别困难。你能帮我一下吗？我需要读取这个文件，在 Fortran 90 中，int*4 n_particles, n_groups real*4 group_id(n_particles) read (*) n_particles, n_groups read ...

问题描述：

我发现用 Python 读取二进制文件特别困难。你能帮我一下吗？我需要读取这个文件，在 Fortran 90 中，

int*4 n_particles, n_groups
real*4 group_id(n_particles)
read (*) n_particles, n_groups
read (*) (group_id(j),j=1,n_particles)

具体来说，文件格式为：

Bytes 1-4 -- The integer 8.
Bytes 5-8 -- The number of particles, N.
Bytes 9-12 -- The number of groups.
Bytes 13-16 -- The integer 8.
Bytes 17-20 -- The integer 4*N.
Next many bytes -- The group ID numbers for all the particles.
Last 4 bytes -- The integer 4*N.

我该如何使用 Python 读取它？我尝试了所有方法，但都没有成功。我是否可以使用 Python 中的 f90 程序读取此二进制文件，然后保存我需要使用的数据？

解决方案 1：

读取二进制文件内容如下：

with open(fileName, mode='rb') as file: # b is important -> binary
    fileContent = file.read()

然后使用struct.unpack “解包”二进制数据：

起始字节：struct.unpack("iiiii", fileContent[:20])

主体：忽略标题字节和尾随字节（= 24）；剩余部分构成主体，要知道主体中的字节数，请将其除以 4 的整数倍；将获得的商乘以字符串'i'以创建解包方法的正确格式：

struct.unpack("i" * ((len(fileContent) -24) // 4), fileContent[20:-4])

结束字节：struct.unpack("i", fileContent[-4:])

解决方案 2：

要将二进制文件读取到bytes对象：

from pathlib import Path
data = Path('/path/to/file').read_bytes()  # Python 3.5+

int要从数据的 0-3 字节创建：

i = int.from_bytes(data[:4], byteorder='little', signed=False)

int要从数据中解压多个：

import struct
ints = struct.unpack('iiii', data[:16])

pathlib
int.from_bytes()
struct

解决方案 3：

总体而言，我建议您考虑使用 Python 的struct模块。它是 Python 的标准，应该很容易将问题的规范转换为适合的格式字符串struct.unpack()。

请注意，如果字段之间/周围有“不可见的”填充，则需要弄清楚并将其包含在unpack()调用中，否则您将读到错误的位。

读取文件的内容以便解压一些东西是相当简单的：

import struct

data = open("from_fortran.bin", "rb").read()

(eight, N) = struct.unpack("@II", data)

这将解压前两个字段，假设它们从文件的最开头开始（没有填充或无关数据），并假设本机字节顺序（@符号）。I格式字符串中的 s 表示“无符号整数，32 位”。

解决方案 4：

您可以使用numpy.fromfile，它可以从文本和二进制文件中读取数据。首先，使用构造一个代表文件格式的数据类型numpy.dtype，然后使用从文件中读取此类型numpy.fromfile。

解决方案 5：

我也发现 Python 在读写二进制文件方面有所欠缺，因此我编写了一个小模块（适用于 Python 3.6+）。

使用二进制文件，您可以做如下的事情（我猜测，因为我不懂 Fortran）：

import binaryfile

def particle_file(f):
    f.array('group_ids')  # Declare group_ids to be an array (so we can use it in a loop)
    f.skip(4)  # Bytes 1-4
    num_particles = f.count('num_particles', 'group_ids', 4)  # Bytes 5-8
    f.int('num_groups', 4)  # Bytes 9-12
    f.skip(8)  # Bytes 13-20
    for i in range(num_particles):
        f.struct('group_ids', '>f')  # 4 bytes x num_particles
    f.skip(4)

with open('myfile.bin', 'rb') as fh:
    result = binaryfile.read(fh, particle_file)
print(result)

产生如下输出：

{
    'group_ids': [(1.0,), (0.0,), (2.0,), (0.0,), (1.0,)],
    '__skipped': [b'x00x00x00x08', b'x00x00x00x08x00x00x00x14', b'x00x00x00x14'],
    'num_particles': 5,
    'num_groups': 3
}

我使用 skip() 跳过 Fortran 添加的额外数据，但您可能希望添加一个实用程序来正确处理 Fortran 记录。如果您愿意，欢迎提出拉取请求。

解决方案 6：

如果数据是数组，我喜欢使用numpy.memmap来加载它。

下面是一个从 64 个通道加载 1000 个样本的示例，存储为双字节整数。

import numpy as np
mm = np.memmap(filename, np.int16, 'r', shape=(1000, 64))

然后您可以沿任一轴切分数据：

mm[5, :] # sample 5, all channels
mm[:, 5] # all samples, channel 5

所有常见格式均可用，包括 C 和 Fortran 顺序、各种 dtypes 和 endianness 等。

这种方法的一些优点：

在实际使用之前，不会将任何数据加载到内存中（这就是 memmap 的用途）。
更直观的语法（不需要生成由 64000 个字符组成的 struct.unpack 字符串）
数据可以采用任何适合您的应用程序的形状。

对于非数组数据（例如，编译代码）、异构格式（“10 个字符，然后 3 个整数，然后 5 个浮点数，...”）或类似数据，上面给出的其他方法之一可能更有意义。

解决方案 7：

#!/usr/bin/python

import array
data = array.array('f')
f = open('c:\\code\\c_code\\no1.dat', 'rb')
data.fromfile(f, 5)
print(data)

解决方案 8：

import pickle
f=open("filename.dat","rb")
try:
    while True:
        x=pickle.load(f)
        print x
except EOFError:
    pass
f.close()