Activities – User-Focused Marian

The following activities will extend the Marian NMT Toolkit:

Activity 1: Implementation of Factored Translation

Activity 1 will implement factors for both source and target tokens in Marian. As a result, the user will be able to provide factored parallel data and factored input during inference.

Tasks:

T1.1. Benchmarking of quality and speed with factors

T1.2. Extension of the software to support user-defined Factors

T1.3. Extension of output vocabularies for each factor

Activity 2: Forced Translation

This functionality aims to support users in improving translation quality by injecting translations from their own bilingual terminology dictionaries. To expand coverage and support morphologically rich languages, it will also support inflecting entries in the bilingual terminology dictionary.

Tasks:

T2.1. Study of the existing approaches and selection of the one that suites the technological aspects of Marian.

T2.2. Implementation of a Forced Translation Module as a sub-product of the Action, in the Marian package, including an API.

Activity 3: On-the-Fly Domain Adaptation

Under this activity, the self-adaptive feature already implemented in Marian will be integrated with a translation memory to enable real-world use. The activity will develop a workflow and the supporting tools that would allow the use of Marian’s self-adaptive feature, in concert with translation memories.

Tasks:

T3.1. Development of a workflow and the supporting tools that would allow the use of Marian’s self-adaptive feature, in concert with translation memories.

T3.2. Extension of Marian’s input interface to support receiving JSON-encoded structured data.

T3.3. Creation of a web-service that implements a REST API for interfacing with Marian and the translation memory.

T3.4. Preparing and extending documented user guides for allowing the use of Marian’s selfadaptive feature, in concert with translation memories.

Activity 4: Development of the Documentation

This Activity ensures compliance with ELRC-SHARE guidelines by documenting both the high-level use of the Marian toolkit for end-users and the code documentation for automated translation users, developers, and researchers.

Tasks:

T4.1. Developing low-level tutorials for new practitioners of the machine translation field covering training and translation pipelines,

T4.2. Improving installation guidelines and providing the list of tested system environment configurations,

T4.3. Creating end-to-end pipeline examples on popular and publicly-available data sets in the form of easy-to-run scripts,

T4.4. Preparing and extending documented user guides to training and translation features offered by Marian.

Activity 5: GPU Efficiency

This activity will integrate 8-bit integer support on GPUs using Tensor cores on Turing andnewer GPUs, hence ensuring a faster data processing. Moreover, this Activity will also implement support for 8-bit training in Marian to determine feasibility for production usage.

Tasks:

Т5.1. Implementation of inference on GPUs based on 8-bit matrix multiplication with Tensor cores.

T5.2. Measurement of quality and speed impact for inference of Task 5.1.

T5.3. Implementation support for 8-bit operations on GPUs in Marian training.

Activity 6: Dissemination

The activity will contribute to wider deployment and take-up of eTranslation by reaching out the stakeholders on national, regional, and local levels and raising awareness about the importance of eTranslation in general for cross-border interaction and cooperation across the EU and associated countries. Public administrations as primary users of eTranslation will be addressed via ELRC and national initiatives.

Tasks:

T6.1. Dissemination Strategy and Plan

T6.2. Creation of dissemination materials

T6.3. Stakeholder board engagement

T6.4. Planning and organization of a project in a hackathon

T6.5. Planning and dissemination in machine translation community events