| Kurzfassung | Optical Music Recognition (OMR) has seen considerable advancements with the adoption of Transformer-based architectures, following the successes witnessed in related fields like Optical Character Recognition (OCR) and Handwriting Recognition (HWR). However, the inherently hierarchical and complex nature of Common Western Music Notation (CWMN) poses significant challenges in framing OMR as a traditional sequence-to-sequence translation process, as it cannot be easily linearized into a sequential representation. Current end-to-end approaches in the field rely on adapted score exchange formats as token encodings, which are not specifically optimized for machine learning models. These formats often encode information in a way that introduces long-range dependencies, ambiguous element orders, and extensive dictionaries, making it challenging for statistical models to learn effectively. In this work, I introduce a new structured encoding for CWMN, named ScoreCode, that addresses these challenges and is specifically designed for modern sequence-to-sequence architectures. ScoreCode focuses on directly encoding the visual information present in the notation, while deferring the extraction of composite attributes such as pitch and duration to a subsequent post-processing step. This hybrid approach eliminates long-range dependencies and simplifies the learning process, ultimately resulting in improved performance and precision. By formalizing this encoding as a Context Free Grammar (CFG), specific musical engraving rules can be enforced, and it can be ensured that sentences produced within this grammar are parsable into a hierarchical data representation of music notation. The effectiveness of the encoding is verified through the introduction of a Transformer-based architecture designed to effectively generate output sentences in the ScoreCode grammar using a split encoding approach. To support the training of this model, a comprehensive data generation pipeline has been developed, leveraging open-source tools to automate the creation of large and diverse datasets from MusicXML files. The approach presented in this work achieves state-of-the-art results, reducing the error rate by approximately 90% compared to the current baseline. This highlights the encoding’s potential to enhance the performance of OMR systems and sets a new benchmark for future research and applications in the field.
|