在python脚本中读取tar文件内容而不解压它-IT科技

摘要：问题描述：我有一个tar文件，里面包含多个文件。我需要编写一个python脚本，读取文件内容并计算字符总数，包括字母、空格、换行符等所有内容，而无需解压tar文件。解决方案 1：你可以使用getmembers()>>> import tarfile >>> tar = t...

问题描述：

我有一个tar文件，里面包含多个文件。我需要编写一个python脚本，读取文件内容并计算字符总数，包括字母、空格、换行符等所有内容，而无需解压tar文件。

解决方案 1：

你可以使用getmembers()

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()

之后，您可以使用extractfile()将成员提取为文件对象。仅举个例子

import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
    f=tar.extractfile(member)
    content=f.read()
    print "%s has %d newlines" %(member, content.count("
"))
    print "%s has %d spaces" % (member,content.count(" "))
    print "%s has %d characters" % (member, len(content))
    sys.exit()
tar.close()

f有了上面例子中的文件对象，你就可以使用等read()。readlines()

解决方案 2：

你需要使用 tarfile 模块。具体来说，你需要使用 TarFile 类的实例来访问文件，然后使用 TarFile.getnames() 来获取文件名。

 |  getnames(self)
 |      Return the members of the archive as a list of their names. It has
 |      the same order as the list returned by getmembers().

如果您想阅读内容，那么您可以使用此方法

 |  extractfile(self, member)
 |      Extract a member from the archive as a file object. `member' may be
 |      a filename or a TarInfo object. If `member' is a regular file, a
 |      file-like object is returned. If `member' is a link, a file-like
 |      object is constructed from the link's target. If `member' is none of
 |      the above, None is returned.
 |      The file-like object is read-only and provides the following
 |      methods: read(), readline(), readlines(), seek() and tell()

解决方案 3：

之前，这篇文章展示了一个“dict(zip(()”将成员名称和成员列表合并在一起的示例，这很愚蠢并且会导致过度读取档案，为了实现相同的目的，我们可以使用字典理解：

index = {i.name: i for i in my_tarfile.getmembers()}

有关如何使用 tarfile 的更多信息

提取 tarfile 成员

#!/usr/bin/env python3
import tarfile

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

print(my_tarfile.extractfile('./path/to/file.png').read())

索引 tar 文件

#!/usr/bin/env python3
import tarfile
import pprint

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

index = my_tarfile.getnames()  # a list of strings, each members name
# or
# index = {i.name: i for i in my_tarfile.getmembers()}

pprint.pprint(index)

索引、读取、动态附加 tar 文件

#!/usr/bin/env python3

import tarfile
import base64
import textwrap
import random

# note, indexing a tar file requires reading it completely once
# if we want to do anything after indexing it, it must be a file
# that can be seeked (not a stream), so here we open a file we
# can seek
my_tarfile = tarfile.open('/path/to/mytar.tar')


# tarfile.getmembers is similar to os.stat kind of, it will
# give you the member names (i.name) as well as TarInfo attributes:
#
# chksum,devmajor,devminor,gid,gname,linkname,linkpath,
# mode,mtime,name,offset,offset_data,path,pax_headers,
# size,sparse,tarfile,type,uid,uname
#
# here we use a dictionary comprehension to index all TarInfo
# members by the member name
index = {i.name: i for i in my_tarfile.getmembers()}

print(index.keys())

# pick your member
# note: if you can pick your member before indexing the tar file,
# you don't need to index it to read that file, you can directly
# my_tarfile.extractfile(name)
# or my_tarfile.getmember(name)

# pick your filename from the index dynamically
my_file_name = random.choice(index.keys())

my_file_tarinfo = index[my_file_name]
my_file_size = my_file_tarinfo.size
my_file_buf = my_tarfile.extractfile( 
    my_file_name
    # or my_file_tarinfo
)

print('file_name: {}'.format(my_file_name))
print('file_size: {}'.format(my_file_size))
print('----- BEGIN FILE BASE64 -----'
print(
    textwrap.fill(
        base64.b64encode(
            my_file_buf.read()
        ).decode(),
        72
    )
)
print('----- END FILE BASE64 -----'

具有重复成员的 tarfile

如果我们有一个奇怪创建的 tar，在这个例子中，通过将同一文件的多个版本附加到同一个 tar 存档中，我们可以仔细处理它，我已经注释了哪些成员包含哪些文本，假设我们想要第四个（索引 3）成员“capturetheflag\n”

tar -tf mybadtar.tar 
mymember.txt  # "version 1
"
mymember.txt  # "version 1
"
mymember.txt  # "version 2
"
mymember.txt  # "capturetheflag
"
mymember.txt  # "version 3
"

#!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')

# >>> my_tarfile.getnames()
# ['mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt']

# if we use extracfile on a name, we get the last entry, I'm not sure how python is smart enough to do this, it must read the entire tar file and buffer every valid member and return the last one

# >>> my_tarfile.extractfile('mymember.txt').read()
# b'version 3
'

# >>> my_tarfile.extractfile(my_tarfile.getmembers()[3]).read()
# b'capturetheflag
'

或者我们可以迭代 tar 文件 #!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# note, if we do anything to the tarfile object that will 
# cause a full read, the tarfile.next() method will return none,
# so call next in a loop as the first thing you do if you want to
# iterate

while True:
    my_member = my_tarfile.next()
    if not my_member:
        break
    print((my_member.offset, mytarfile.extractfile(my_member).read,))

# (0, b'version 1
')
# (1024, b'version 1
')
# (2048, b'version 2
')
# (3072, b'capturetheflag
')
# (4096, b'version 3
')

解决方案 4：

您可以使用getnames()

以下是您可以使用的代码示例：

import tarfile
tar_open = tarfile.open("filename.tar")
contents = tar_open.getnames()

# contents will contain a list of filenames inside filename.tar
print(contents)

解决方案 5：

您可以使用 tarfile.list() 例如：

filename = "abc.tar.bz2"
with open( filename , mode='r:bz2') as f1:
    print(f1.list())

获取这些数据后，您可以操作或将此输出写入文件并执行任何您的要求。

解决方案 6：

 import tarfile

 targzfile = "path to the file"

 tar = tarfile.open(targzfile)

 for item in tar.getnames():

     if "README.txt" in item:

       file_content = tar.extractfile(item).read()

       fileout = open("output file path", 'wb')

       fileout.write(file_content)

       fileout.close()

       break