CQTで和音を解析する - TadaoYamaokaの開発日記

ミックスされた音源から和音の構成音を取り出すのは容易ではなく、古典的にはCQT や HPCPをピッチクラス化して推定する方法がある。
この方法では、倍音が構成音と誤検知される場合があり、精度を上げるのが難しい。

最新の精度の高い手法は、Deep Chromaのようなディープラーニングベースの手法が用いられる。

ディープラーニングベースの手法は計算量が大きくスマホでのリアルタイム解析には向かないため、軽量である程度の精度がでる手法がないか調査したところ、「Approximate Note Transcription for the Improved Identification of Difficult Chords」という方法を見つけた。

NNLS Chroma

上記の論文の手法では、NNLS Chromaという特徴量からコードを推定する。

NNLSは、Non-Negative Least Squares（非負最小二乗法）の略で、解が正の値のみという制約付き最適化問題である。

CQTのスペクトルから、A0からG#6までのどの音が鳴っているかを最適化問題として求める。

スペクトルは複数の音が混ざったものとして、理想の倍音構成の各音が、それぞれどの強さで鳴っているかを求める。

NNLSの解は、解析的には求めることはできないため、反復法で解く必要がある。

NNLSを解くアルゴリズムには、Solving Least Squares Problems(Lawson, C.L. & Hanson, R.J.)や、その改良アルゴリズム A Fast Non‐Negativity‐Constrained Least Squares Algorithm(Bro, R. & de Jong, S.)が用いられる。
後者は、scipy.optimize.nnlsに実装されている。

実験

CQTからピッチクラス化した場合と、NNLS Chromaを使用した場合の比較を行う。

CQTのみ

まずは、CQTからピッチクラス化した結果を示す。

音源は、librosa付属のnutcracker(くるみ割り人形「金平糖の精の踊り」)を使用する。
この曲のキーはEマイナーである。

import librosa
file_path = librosa.ex("nutcracker")
target_sr = 11025
y, sr = librosa.load(file_path, sr=target_sr, mono=True)

C = librosa.cqt(
    y=y,
    sr=sr,
    hop_length=2048,
    bins_per_octave=36,
    n_bins=36 * 7,
)
librosa_chroma = librosa.feature.chroma_cqt(
    C=C, sr=sr, hop_length=2048
)

librosa.display.specshow(
    librosa_chroma,
    sr=sr,
    hop_length=2048,
    x_axis="time",
    y_axis="chroma",
    cmap="viridis",
)

結果

NNLS Chroma

次に、NNLS Chromaからピッチクラス化した結果を示す。

NNLSの実装は、論文にならいChromaとBass chromaをそれぞれ求める。

@dataclass
class NNLSChroma:
    sr: int = 11025
    hop_length: int = 2048
    bpo: int = 36
    n_octaves: int = 7
    s_param: float = 0.6
    preprocessing: str = "std"
    chroma_norm: int = 3

    def compute(self, y):
        n_bins = self.bpo * self.n_octaves
        fmin = librosa.midi_to_hz(21)
        tuning = librosa.estimate_tuning(y=y, sr=self.sr, bins_per_octave=self.bpo)
        C = librosa.cqt(
            y=y,
            sr=self.sr,
            hop_length=self.hop_length,
            fmin=fmin,
            n_bins=n_bins,
            bins_per_octave=self.bpo,
            tuning=tuning,
        )
        Y = np.abs(C).T
        half_span = 18
        proc_Y = np.zeros_like(Y)
        for t in range(Y.shape[0]):
            mu, std = _running_mean_std_1d(Y[t], half_span)
            sub = Y[t] - mu
            sub[sub < 0] = 0.0
            if self.preprocessing == "std":
                std_safe = np.where(std > 1e-6, std, 1.0)
                proc_Y[t] = sub / std_safe
            else:  # "o" or "sub"
                proc_Y[t] = Y[t] if self.preprocessing == "o" else sub
        E = _build_note_dictionary(
            n_bins, bpo=self.bpo, n_octaves=self.n_octaves, s=self.s_param, fmin=fmin
        )
        n_semitones = self.n_octaves * 12
        S = np.zeros((proc_Y.shape[0], n_semitones), dtype=float)
        for t in range(proc_Y.shape[0]):
            x, _ = nnls(E, proc_Y[t])
            S[t] = x
        chroma = np.zeros((S.shape[0], 12), dtype=float)
        bass_chroma = np.zeros((S.shape[0], 12), dtype=float)
        pc_offset = 21 % 12  # fmin が A0 のため 9
        for t in range(S.shape[0]):
            for i in range(n_semitones):
                chroma[t, (i + pc_offset) % 12] += S[t, i] * TREBLE_WINDOW[i]
                bass_chroma[t, (i + pc_offset) % 12] += S[t, i] * BASS_WINDOW[i]
            for v in (chroma[t], bass_chroma[t]):
                if self.chroma_norm == 1:
                    m = v.max()
                    if m > 1e-6:
                        v /= m
                elif self.chroma_norm == 2:
                    s_norm = v.sum()
                    if s_norm > 1e-6:
                        v /= s_norm
                elif self.chroma_norm == 3:
                    n = np.linalg.norm(v)
                    if n > 1e-6:
                        v /= n
        times = librosa.frames_to_time(
            np.arange(S.shape[0]), sr=self.sr, hop_length=self.hop_length
        )
        return times, chroma, bass_chroma, S, tuning

extractor = NNLSChroma(
    sr=target_sr, preprocessing="std", s_param=0.6, chroma_norm=3
)
times, nnls_chroma, nnls_bass_chroma, semitone, tuning = extractor.compute(y)

librosa.display.specshow(
    nnls_chroma.T,
    sr=sr,
    hop_length=extractor.hop_length,
    x_axis="time",
    y_axis="chroma",
    ax=axes[1],
    cmap="viridis",
)

librosa.display.specshow(
    nnls_bass_chroma.T,
    sr=sr,
    hop_length=extractor.hop_length,
    x_axis="time",
    y_axis="chroma",
    ax=axes[2],
    cmap="viridis",
)

結果

※比較のため、CQTの結果も並べている。