Machine translation literacy for academics
Description
ObjectivesThis project investigates the potential of neural machine translation (NMT) for academic texts (abstracts, papers...) for publication purposes.Initial situation and hypothesisWell-known issues with neural machine translation are text cohesion, (terminology) and "hedging" (hedge terms). This could, amongst other things, be due to syntactic differences between languages. This project investigates the syntactic features of scientific abstracts and examines how they are treated by common free translation systems. Corpus300 abstracts of dissertations in German (DE, CH, AT) as well as their machine translation outputs (DeepL and Google translation) in English, morphosyntactic annotation (Treetagger - TagAnt) and sentence-by-sentence alinment (AntConc & AntPConc).Methodology: recursive-emergent approach1. Observations from preliminary studies served as a basis for explorations in AntConc and AntPConc.2. elaboration of potential problem constructions in German:a. Modal verbs ("sollen" and "können")b. Presenting constructions X verb Y(semantic subject)i. X verb Y(subordinate clause as semantic subject), e.g. : "Überdies wird der Frage nachgegangen, ob die Übertragbarkeit der Inhaberaktie […]"ii. X Verb Y(nominal phrase as semantic subject), e.g.: "Es liegen bioklastische, homogen strukturierten Wacke- bis Mudstones vor, deren Kalk-Mergel-Wechsellagerung auf einem „Verdünnungseffekt“ der Karbonatproduktion beruht."3. qualitative analysis (involving English language experts) of the corresponding English constructions produced by DeepL and Google Translate, regarding semantic agreement / syntactic idiomaticity / ambiguity.4. quantitative evaluation, based on the qualitative analysis.Findings1. Text cohesion: "Dabei" and "So" at the beginning of the sentence are omitted more often than average in the NMT process, which means that a left-linking device is being omitted. The overall text cohesion is thus reduced.2. Hedging: "sollen" in scientific abstracts carries the risk of an ambiguous translation into English. In about 50% of the cases, either an additional weakening of the statement or the possibility of misinterpretation has been found in this study.3. Presenting constructions: They are mostly translated by NMT in such a way that the target text is either not semantically accurate, not idiomatic or ambiguous.4. formulaic speech: the more fixed the idioms, the higher the probability that the NMT systems will generate a correct, idiomatic and unambiguous text. Example: extended hedged performatives ("Zusammenfassend kann festgehalten werden, dass…") are all translated correctly in the corpus.These findings only apply to the corpus of this study, which focussed solely on scientific abstracts.
Key Data
Projectlead
Project status
completed, 08/2020 - 01/2021
Funding partner
Kanton Zürich / Digitalisierungsinitiative DIZH