如何处理使用yfinance下载的多级列名-IT科技

摘要：问题描述：我有一份股票代码列表 ( tickerStrings)，需要一次性全部下载。当我尝试使用 Pandas 时，read_csv它无法像我从yfinance下载数据那样读取CSV文件。我通常通过这样的代码访问我的数据：data['AAPL']或data['AAPL'].Close，但是当我从 CSV 文...

问题描述：

我有一份股票代码列表 ( tickerStrings)，需要一次性全部下载。当我尝试使用 Pandas 时，read_csv它无法像我从yfinance下载数据那样读取CSV文件。

我通常通过这样的代码访问我的数据：data['AAPL']或data['AAPL'].Close，但是当我从 CSV 文件读取数据时，它不允许我这样做。

if path.exists(data_file):
    data = pd.read_csv(data_file, low_memory=False)
    data = pd.DataFrame(data)
    print(data.head())
else:
    data = yf.download(tickerStrings, group_by="Ticker", period=prd, interval=intv)
    data.to_csv(data_file)

这是打印输出：

                  Unnamed: 0                 OLN               OLN.1               OLN.2               OLN.3  ...                 W.1                 W.2                 W.3                 W.4     W.5
0                        NaN                Open                High                 Low               Close  ...                High                 Low               Close           Adj Close  Volume
1                   Datetime                 NaN                 NaN                 NaN                 NaN  ...                 NaN                 NaN                 NaN                 NaN     NaN
2  2020-06-25 09:30:00-04:00    11.1899995803833  11.220000267028809  11.010000228881836  11.079999923706055  ...   201.2899932861328   197.3000030517578  197.36000061035156  197.36000061035156  112156
3  2020-06-25 09:45:00-04:00  11.130000114440918  11.260000228881836  11.100000381469727   11.15999984741211  ...  200.48570251464844  196.47999572753906  199.74000549316406  199.74000549316406   83943
4  2020-06-25 10:00:00-04:00  11.170000076293945  11.220000267028809  11.119999885559082  11.170000076293945  ...  200.49000549316406  198.19000244140625   200.4149932861328   200.4149932861328   88771

尝试访问数据时出现的错误：

Traceback (most recent call last):
File "getdata.py", line 49, in processData
    avg = data[x].Close.mean()
AttributeError: 'Series' object has no attribute 'Close'

解决方案 1：

在处理来自多个股票行情的财务数据时，具体来说，使用yfinance和pandas，该过程可以分为几个关键步骤：下载数据、以结构化格式组织数据以及以符合用户需求的方式访问数据。下面，答案被组织成清晰、可操作的部分。

下载多个股票数据

直接下载并创建 DataFrame

单一代码、单一 DataFrame 方法：

+ 对于单个股票，直接从中下载的 DataFrame`yfinance`带有单级列名，但缺少股票列。通过迭代每个股票，添加股票列，然后将它们组合成单个 DataFrame，可以保持每个股票数据的清晰结构。

import yfinance as yf
import pandas as pd

tickerStrings = ['AAPL', 'MSFT']
df_list = []
for ticker in tickerStrings:
    data = yf.download(ticker, group_by="Ticker", period='2d')
    data['ticker'] = ticker  # Add ticker column
    df_list.append(data)

# Combine all dataframes into a single dataframe
df = pd.concat(df_list)
df.to_csv('ticker.csv')

压缩单个 DataFrame 方法：

+ 使用列表推导式通过一行代码实现与上述相同的结果，简化了获取和合并数据的过程。

# Download 2 days of data for each ticker in tickerStrings, add a 'ticker' column for identification, and concatenate into a single DataFrame with continuous indexing.
df = pd.concat([yf.download(ticker, group_by="Ticker", period='2d').assign(ticker=ticker) for ticker in tickerStrings], ignore_index=True)

多代码、结构化 DataFrame 方法

当同时下载多个股票数据时，yfinance按股票分组数据，生成具有多级列标题的 DataFrame。此结构可以重新组织，以便于访问。

拆分列级别：

# Define a list of ticker symbols to download
tickerStrings = ['AAPL', 'MSFT']

# Download 2 days of data for each ticker, grouping by 'Ticker' to structure the DataFrame with multi-level columns
df = yf.download(tickerStrings, group_by='Ticker', period='2d')

# Transform the DataFrame: stack the ticker symbols to create a multi-index (Date, Ticker), then reset the 'Ticker' level to turn it into a column
df = df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)

处理具有多级列名的 CSV 文件

要读取已保存数据的 CSV 文件yfinance（通常包含多级列标题），需要进行调整以确保 DataFrame 可以以所需的格式访问。

读取和调整多级列：

# Read the CSV file. The file has multi-level headers, hence header=[0, 1].
df = pd.read_csv('test.csv', header=[0, 1])

# Drop the first row as it contains only the Date information in one column, which is redundant after setting the index.
df.drop(index=0, inplace=True)

# Convert the 'Unnamed: 0_level_0', 'Unnamed: 0_level_1' column (which represents dates) to datetime format.
# This assumes the dates are in the 'YYYY-MM-DD' format.
df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')] = pd.to_datetime(df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')])

# Set the datetime column as the index of the DataFrame. This makes time series analysis more straightforward.
df.set_index(('Unnamed: 0_level_0', 'Unnamed: 0_level_1'), inplace=True)

# Clear the name of the index to avoid confusion, as it previously referred to the multi-level column names.
df.index.name = None

扁平化多级列以便于访问

根据 DataFrame 的初始结构，许多多级列需要展平为单级，以增加数据集的清晰度和简单性。

根据股票代码级别进行扁平化和重组：

对于股票代码位于列标题顶层的 DataFrames：

df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)

如果股票代码位于最底层：

df.stack(level=1).rename_axis(['Date', 'Ticker']).reset_index(level=1)

个人股票文件管理

对于那些希望单独管理每个股票数据的人来说，将每个股票数据下载并保存到单独的文件中是一种简单的方法。

下载并保存单个股票数据：

for ticker in tickerStrings:
    # Downloads historical market data from Yahoo Finance for the specified ticker.
    # The period ('prd') and interval ('intv') for the data are specified as string variables.
    data = yf.download(ticker, group_by="Ticker", period='prd', interval='intv')

    # Adds a new column named 'ticker' to the DataFrame. This column is filled with the ticker symbol.
    # This step is helpful for identifying the source ticker when multiple DataFrames are combined or analyzed separately.
    data['ticker'] = ticker

    # Saves the DataFrame to a CSV file. The file name is dynamically generated using the ticker symbol,
    # allowing each ticker's data to be saved in a separate file for easy access and identification.
    # For example, if the ticker symbol is 'AAPL', the file will be named 'ticker_AAPL.csv'.
    data.to_csv(f'ticker_{ticker}.csv')

将多个 Ticker 文件合并到单个 DataFrame 中

如果每个股票行情的数据都存储在单独的文件中，则可以通过文件读取和连接将它们组合成单个 DataFrame。

将多个文件读入一个DataFrame：

# Import the Path class from the pathlib module, which provides object-oriented filesystem paths
from pathlib import Path

# Create a Path object 'p' that represents the directory containing the CSV files
p = Path('path_to_files')

# Use the .glob method to create an iterator over all files in the 'p' directory that match the pattern 'ticker_*.csv'.
# This pattern will match any files that start with 'ticker_' and end with '.csv', which are presumably files containing ticker data.
files = p.glob('ticker_*.csv')

# Read each CSV file matched by the glob pattern into a separate pandas DataFrame, then concatenate all these DataFrames into one.
# The 'ignore_index=True' parameter is used to reindex the new DataFrame, preventing potential index duplication.
# This results in a single DataFrame 'df' that combines all the individual ticker data files into one comprehensive dataset.
df = pd.concat([pd.read_csv(file) for file in files], ignore_index=True)

这种结构化方法确保无论初始数据格式或存储方式如何，您都可以使用和有效地组织和访问多个股票行情的财务yfinance数据pandas。

数据表示概述

本节展示了以多级和单级列格式表示的财务数据示例。这些表示对于理解不同的数据结构及其对金融环境中的数据分析的影响至关重要。

多级列数据

多级列数据可能很复杂，但允许将相关数据组织到更广泛的类别下。此结构对于每个实体（例如股票行情）具有多个属性（例如开盘价、最高价、最低价、收盘价）的数据集特别有用。

示例：具有多级列的 DataFrame

下面是一个示例 DataFrame，展示了两个股票代码 AAPL 和 MSFT 的多级列数据。每个股票代码都有多个属性，例如开盘价、最高价、最低价、收盘价、调整收盘价和成交量。

                AAPL                                                    MSFT                                
                Open      High       Low     Close Adj Close     Volume Open High Low Close Adj Close Volume
Date                                                                                                        
1980-12-12  0.513393  0.515625  0.513393  0.513393  0.405683  117258400  NaN  NaN NaN   NaN       NaN    NaN
1980-12-15  0.488839  0.488839  0.486607  0.486607  0.384517   43971200  NaN  NaN NaN   NaN       NaN    NaN
1980-12-16  0.453125  0.453125  0.450893  0.450893  0.356296   26432000  NaN  NaN NaN   NaN       NaN    NaN
1980-12-17  0.462054  0.464286  0.462054  0.462054  0.365115   21610400  NaN  NaN NaN   NaN       NaN    NaN
1980-12-18  0.475446  0.477679  0.475446  0.475446  0.375698   18362400  NaN  NaN NaN   NaN       NaN    NaN

示例：多级列的 CSV 格式

以 CSV 格式表示上述 DataFrame 带来了独特的挑战，如下所示。多层结构被扁平化为两个标题行，后面是数据行。

,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT
,Open,High,Low,Close,Adj Close,Volume,Open,High,Low,Close,Adj Close,Volume
Date,,,,,,,,,,,,
1980-12-12,0.5133928656578064,0.515625,0.5133928656578064,0.5133928656578064,0.40568336844444275,117258400,,,,,,
1980-12-15,0.4888392984867096,0.4888392984867096,0.4866071343421936,0.4866071343421936,0.3845173120498657,43971200,,,,,,
1980-12-16,0.453125,0.453125,0.4508928656578064,0.4508928656578064,0.3562958240509033,26432000,,,,,,

单级列数据

对于每个实体共享一组统一属性的数据集，单级列数据结构是理想的选择。这种更简单的格式更易于数据操作和分析，使其成为许多应用程序的常见选择。

示例：具有单级列的 DataFrame

下面是一个示例 DataFrame，显示 MSFT 股票代码的单级列数据。它包括开盘价、最高价、最低价、收盘价、调整收盘价和成交量等属性，以及每个条目的股票代码。这种格式很简单，可以直接访问股票数据的每个属性。

                Open      High       Low     Close  Adj Close      Volume ticker
Date                                                                            
1986-03-13  0.088542  0.101562  0.088542  0.097222   0.062205  1031788800   MSFT
1986-03-14  0.097222  0.102431  0.097222  0.100694   0.064427   308160000   MSFT
1986-03-17  0.100694  0.103299  0.100694  0.102431   0.065537   133171200   MSFT
1986-03-18  0.102431  0.103299  0.098958  0.099826   0.063871    67766400   MSFT
1986-03-19  0.099826  0.100694  0.097222  0.098090   0.062760    47894400   MSFT

示例：单级列的 CSV 格式

当将单级列数据导出为 CSV 格式时，会生成一个简单易读的文件。每行对应一个特定日期，每个列标题直接表示股票数据的一个属性。这种简单性增强了 CSV 对人类和软件应用程序的可用性。

Date,Open,High,Low,Close,Adj Close,Volume,ticker
1986-03-13,0.0885416641831398,0.1015625,0.0885416641831398,0.0972222238779068,0.0622050017118454,1031788800,MSFT
1986-03-14,0.0972222238779068,0.1024305522441864,0.0972222238779068,0.1006944477558136,0.06442664563655853,308160000,MSFT
1986-03-17,0.1006944477558136,0.1032986119389534,0.1006944477558136,0.1024305522441864,0.0655374601483345,133171200,MSFT
1986-03-18,0.1024305522441864,0.1032986119389534,0.0989583358168602,0.0998263880610466,0.06387123465538025,67766400,MSFT
1986-03-19,0.0998263880610466,0.1006944477558136,0.0972222238779068,0.0980902761220932,0.06276042759418488,47894400,MSFT

本节举例说明单级列数据的组织方式，提供一种直观且易于理解的财务数据集处理方法。无论是 DataFrame 还是 CSV 格式，单级数据结构都支持高效的数据处理和分析任务。

解决方案 2：

把它变成一个字典d[ticker]=df：

df = yf.download(tickers, group_by="ticker")
d = {idx: gp.xs(idx, level=0, axis=1) for idx, gp in df.groupby(level=0, axis=1)}

解决方案 3：

另一个选项是保留 pandas 数据框但删除不需要的数据，即将列索引从多索引更改为单索引。由于您只关心“Close”列，因此第一步是删除其他列：

df = yf.download(...)
df = df[['Close']]

这很好，但会使每列都留下一个多重索引，看起来像 (Close/AAPL) 或 (Close/MSFT) 等。你真正想要的只是股票代码。

df.columns = [col[1] for col in df.columns]

现在，如果您想将数据框拆分为每列单独的数据框，则可以使用列表理解来实现。

separated = [df.iloc[:,i] for i in range(len(df.columns))]

解决方案 4：

使用下面一行来写入和读取 CSV 文件。它们的格式与您从 yfinance API 下载的格式完全相同。

写入文件

data.to_csv('file_loc')

读取文件

data = pd.read_csv('file_loc', header=[0, 1], index_col=[0])