Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li. arXiv:2410.13198, 2024.</em> [abstract] [bibtex]@misc{ghosh2024failingforwardimprovinggenerative, title={Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation}, author={Sreyan Ghosh and Mohammad Sadegh Rasooli and Michael Levit and Peidong Wang and Jian Xue and Dinesh Manocha and Jinyu Li}, year={2024}, eprint={2410.13198}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.13198}, }
Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.
External Language Model Integration for Factorized Neural Transducers
Michael Levit, Sarangarajan Parthasarathy, Cem Aksoylar, Mohammad Sadegh Rasooli, and Shuangyu Chang.
arXiv:2305.17304, 2023. [abstract] [bibtex]@misc{levit2023external, title={External Language Model Integration for Factorized Neural Transducers}, author={Michael Levit and Sarangarajan Parthasarathy and Cem Aksoylar and Mohammad Sadegh Rasooli and Shuangyu Chang}, year={2023}, eprint={2305.17304}, archivePrefix={arXiv}, primaryClass={cs.CL} }
We propose an adaptation method for factorized neural transducers (FNT) with external language models. We demonstrate that both neural and n-gram external LMs add significantly more value when linearly interpolated with predictor output compared to shallow fusion, thus confirming that FNT forces the predictor to act like regular language models. Further, we propose a method to integrate class-based n-gram language models into FNT framework resulting in accuracy gains similar to a hybrid setup. We show average gains of 18% WERR with lexical adaptation across various scenarios and additive gains of up to 60% WERR in one entity-rich scenario through a combination of class-based n-gram and neural LMs.
Bidirectional Language Models Are Also Few-shot Learners
Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, Chris Callison-Burch.
ICLR, 2023. [abstract] [bibtex]@inproceedings{Patel2022BidirectionalLM, title={Bidirectional Language Models Are Also Few-shot Learners}, author={Ajay Patel and Bryan Li and Mohammad Sadegh Rasooli and Noah Constant and Colin Raffel and Chris Callison-Burch}, year={2022} }
Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation
Bryan Li, Mohammad Sadegh Rasooli, Ajay Patel, Chris Callison-Burch.
LoResMT, pp 16-31, 2023. [abstract] [bibtex]@article{li2022multilingual, title={Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation}, author={Li, Bryan and Patel, Ajay and Callison-Burch, Chris and Rasooli, Mohammad Sadegh}, journal={arXiv preprint arXiv:2209.02821}, year={2022} }
We propose a two-stage training approach for developing a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 25 languages to English. We find this model can generalize to zero-shot translations on unseen languages. For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then train with successive rounds of back-translation. The final model extends to the English-to-Many direction, while retaining Many-to-English performance. We term our approach EcXTra (English-centric Crosslingual (X) Transfer). Our approach sequentially leverages auxiliary parallel data and monolingual data, and is conceptually simple, only using a standard cross-entropy objective in both stages. The final EcXTra model is evaluated on unsupervised NMT on 8 low-resource languages achieving a new state-of-the-art for English-to-Kazakh (22.3 > 10.4 BLEU), and competitive performance for the other 15 translation directions.
The Persian Dependency Treebank Made Universal
Mohammad Sadegh Rasooli, Pegah Safari, Amirsaeid Moloodi, and Alireza Nourian.
LREC 2022. [abstract] [bibtex][code+data]@inproceedings{safari-etal-2022-persian, title = "The {P}ersian Dependency Treebank Made Universal", author = "Safari, Pegah and Rasooli, Mohammad Sadegh and Moloodi, Amirsaeid and Nourian, Alireza", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.766", pages = "7078--7087", abstract = "We describe an automatic method for converting the Persian Dependency Treebank (Rasooli et al., 2013) to Universal Dependencies. This treebank contains 29107 sentences. Our experiments along with manual linguistic analysis show that our data is more compatible with Universal Dependencies than the Uppsala Persian Universal Dependency Treebank (Seraji et al., 2016), larger in size and more diverse in vocabulary. Our data brings in labeled attachment F-score of 85.2 in supervised parsing. Also, our delexicalized Persian-to-English parser transfer experiments show that a parsing model trained on our data is {\mbox{$\approx$}}2{\%} absolutely more accurate than that of Seraji et al. (2016) in terms of labeled attachment score.", }
We describe an automatic method for converting the Persian Dependency Treebank (Rasooli et al, 2013) to Universal Dependencies. This treebank contains 29107 sentences. Our experiments along with manual linguistic analysis show that our data is more compatible with Universal Dependencies than the Uppsala Persian Universal Dependency Treebank (Seraji et al., 2016), and is larger in size and more diverse in vocabulary. Our data brings in a labeled attachment F-score of 85.2 in supervised parsing. Our delexicalized Persian-to-English parser transfer experiments show that a parsing model trained on our data is ~2% absolutely more accurate than that of Seraji et al. (2016) in terms of labeled attachment score.
"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks
Mohammad Sadegh Rasooli, Chris Callison-Burch and Derry Tanti Wijaya.
EMNLP 2021. [abstract] [bibtex][code]@inproceedings{rasooli-etal-2021-wikily, title = "{``}Wikily{''} Supervised Neural Translation Tailored to Cross-Lingual Tasks", author = "Rasooli, Mohammad Sadegh and Callison-Burch, Chris and Wijaya, Derry Tanti", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.124", doi = "10.18653/v1/2021.emnlp-main.124", pages = "1655--1670", abstract = "We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong \textit{supervised} baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our \textit{wikily} translation models to unsupervised image captioning, and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a \textit{wikily} translation of the English captioning data. Our captioning results on Arabic are slightly \textit{better} than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as an artificial training data in an \textit{annotation projection} framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.", }
We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily translation models to unsupervised image captioning, and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data. Our captioning results on Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as an artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.
Cultural and Geographical Influences on Image Translatability of Words across Languages
Nikzad Khani, Isidora Chara Tourni, Mohammad Sadegh Rasooli, Chris Callison-Burch and Derry Tanti Wijaya.
NAACL 2021. [abstract] [bibtex]tbd
Neural Machine Translation (NMT) models have been observed to produce poor translations when there are few/no parallel sentences to train the models. In the absence of parallel data, several approaches have turned to the use of images to learn translations. Since images of words, e.g., \textit{horse} may be unchanged across languages, translations can be identified via images associated with words in different languages that have a high degree of visual similarity. However, translating via images has been shown to improve upon text-only models only marginally. To better understand \textit{when} images are useful for translation, we study \textit{image translatability} of words, which we define as the translatability of words via images, by measuring intra- and inter-cluster similarities of image representations of words that are translations of each other. We find that images of words are not always invariant across languages, and that language pairs with shared culture, meaning having either a common language family, ethnicity or religion, have improved image translatability (i.e., have more similar images for similar words) compared to its converse, regardless of their geographic proximity. In addition, in line with previous works that show images help more in translating concrete words, we found that concrete words have improved image translatability compared to abstract ones.
ParsiNLU: A Suite of Language Understanding Challenges for Persian
Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, and Yadollah Yaghoobzadeh.
Transactions of the ACL, 9:1147–1162, 2021. [abstract] [bibtex][code+data]@article{khashabi-etal-2021-parsinlu, title = "{P}arsi{NLU}: A Suite of Language Understanding Challenges for {P}ersian", author = "Khashabi, Daniel and Cohan, Arman and Shakeri, Siamak and Hosseini, Pedram and Pezeshkpour, Pouya and Alikhani, Malihe and Aminnaseri, Moin and Bitaab, Marzieh and Brahman, Faeze and Ghazarian, Sarik and Gheini, Mozhdeh and Kabiri, Arman and Mahabagdi, Rabeeh Karimi and Memarrast, Omid and Mosallanezhad, Ahmadreza and Noury, Erfan and Raji, Shahab and Rasooli, Mohammad Sadegh and Sadeghi, Sepideh and Azer, Erfan Sadeqi and Samghabadi, Niloofar Safi and Shafaei, Mahsa and Sheybani, Saber and Tazarv, Ali and Yaghoobzadeh, Yadollah", journal = "Transactions of the Association for Computational Linguistics", volume = "9", year = "2021", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2021.tacl-1.68", doi = "10.1162/tacl_a_00419", pages = "1147--1162", abstract = "Abstract Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks{---}reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1", }
Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
Automatic Standardization of Colloquial Persian
Mohammad Sadegh Rasooli, Farzane Bakhtyari, Fatemeh Shafiei, Mahsa Ravanbakhsh, and Chris Callison-Burch.
arXiv:2012.05879, Dec. 2020. [abstract] [bibtex][code+data]@misc{rasooli2020automatic, title={Automatic Standardization of Colloquial Persian}, author={Mohammad Sadegh Rasooli and Farzane Bakhtyari and Fatemeh Shafiei and Mahsa Ravanbakhsh and Chris Callison-Burch}, year={2020}, eprint={2012.05879}, archivePrefix={arXiv}, primaryClass={cs.CL} }
The Iranian Persian language has two varieties: standard and colloquial. Most natural language processing tools for Persian assume that the text is in standard form: this assumption is wrong in many real applications especially web content. This paper describes a simple and effective standardization approach based on sequence-to-sequence translation. We design an algorithm for generating artificial parallel colloquial-to-standard data for learning a sequence-to-sequence model. Moreover, we annotate a publicly available evaluation data consisting of 1912 sentences from a diverse set of domains. Our intrinsic evaluation shows a higher BLEU score of 62.8 versus 61.7 compared to an off-the-shelf rule-based standardization model in which the original text has a BLEU score of 46.4. We also show that our model improves English-to-Persian machine translation in scenarios for which the training data is from colloquial Persian with 1.4 absolute BLEU score difference in the development data, and 0.8 in the test data.
Multitask Learning for Cross-Lingual Transfer of Broad-coverage Semantic Dependencies
Maryam Aminian, Mohammad Sadegh Rasooli, and Mona Diab.
EMNLP 2020. [abstract] [bibtex]@inproceedings{aminian-etal-2020-multitask, title = "Multitask Learning for Cross-Lingual Transfer of Broad-coverage Semantic Dependencies", author = "Aminian, Maryam and Rasooli, Mohammad Sadegh and Diab, Mona", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.663", doi = "10.18653/v1/2020.emnlp-main.663", pages = "8268--8274", abstract = "We describe a method for developing broad-coverage semantic dependency parsers for languages for which no semantically annotated resource is available. We leverage a multitask learning framework coupled with annotation projection. We use syntactic parsing as the auxiliary task in our multitask setup. Our annotation projection experiments from English to Czech show that our multitask setup yields 3.1{\%} (4.2{\%}) improvement in labeled F1-score on in-domain (out-of-domain) test set compared to a single-task baseline.", }
We describe a method for developing broad-coverage semantic dependency parsers for languages for which no semantically annotated resource is available. We leverage a multitask learning framework coupled with annotation projection. We use syntactic parsing as the auxiliary task in our multitask setup. Our annotation projection experiments from English to Czech show that our multitask setup yields 3.1% (4.2%) improvement in labeled F1-score on in-domain (out-of-domain) test set compared to a single-task baseline.
Low-Resource Syntactic Transfer with Unsupervised Source Reordering
Mohammad Sadegh Rasooli, and Michael Collins.
NAACL 2019. [abstract] [bibtex]@inproceedings{rasooli-collins-2019-low, title = "Low-Resource Syntactic Transfer with Unsupervised Source Reordering", author = "Rasooli, Mohammad Sadegh and Collins, Michael", booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)", month = jun, year = "2019", address = "Minneapolis, Minnesota", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N19-1385", doi = "10.18653/v1/N19-1385", pages = "3845--3856", abstract = "We describe a cross-lingual transfer method for dependency parsing that takes into account the problem of word order differences between source and target languages. Our model only relies on the Bible, a considerably smaller parallel data than the commonly used parallel data in transfer methods. We use the concatenation of projected trees from the Bible corpus, and the gold-standard treebanks in multiple source languages along with cross-lingual word representations. We demonstrate that reordering the source treebanks before training on them for a target language improves the accuracy of languages outside the European language family. Our experiments on 68 treebanks (38 languages) in the Universal Dependencies corpus achieve a high accuracy for all languages. Among them, our experiments on 16 treebanks of 12 non-European languages achieve an average UAS absolute improvement of 3.3{\%} over a state-of-the-art method.", }
We describe a cross-lingual transfer method for dependency parsing that takes into account the problem of word order differences between source and target languages. Our model only relies on the Bible, a considerably smaller parallel data than the commonly used parallel data in transfer methods. We use the concatenation of projected trees from the Bible corpus, and the gold-standard treebanks in multiple source languages along with cross-lingual word representations. We demonstrate that reordering the source treebanks before training on them for a target language improves the accuracy of languages outside the European language family. Our experiments on 68 treebanks (38 languages) in the Universal Dependencies corpus achieve a high accuracy for all languages. Among them, our experiments on 16 treebanks of 12 non-European languages achieve an average UAS absolute improvement of 3.3% over a state-of-the-art method.
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
Maryam Aminian, Mohammad Sadegh Rasooli, and Mona Diab.
IWCS 2019. [abstract] [bibtex]@inproceedings{aminian-etal-2019-cross, title = "Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles", author = "Aminian, Maryam and Rasooli, Mohammad Sadegh and Diab, Mona", booktitle = "Proceedings of the 13th International Conference on Computational Semantics - Long Papers", month = may, year = "2019", address = "Gothenburg, Sweden", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-0417", doi = "10.18653/v1/W19-0417", pages = "200--210", abstract = "We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.", }
We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.
Cross-Lingual Transfer of Natural Language Processing Systems
Mohammad Sadegh Rasooli.
PhD Thesis, Columbia Univeristy, 2019. [abstract] [bibtex]@PhdThesis{rasoolithesis, author = {Mohammad Sadegh Rasooli}, title = {Cross-Lingual Transfer of Natural Language Processing Systems}, school = {Columbia University}, year = {2018}, address = {New York}, month = {December}, }
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
• We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
• We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
• We develop different syntactic reordering models that can change the source tree- banks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
• We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain.Entity-Aware Language Model as an Unsupervised Reranker
Mohammad Sadegh Rasooli, and Sarangarajan Parthasarathy.
INTERSPEECH 2018. [abstract] [bibtex]@inproceedings{Rasooli2018, author={Mohammad Sadegh Rasooli and Sarangarajan Parthasarathy}, title={Entity-Aware Language Model as an Unsupervised Reranker}, year=2018, booktitle={Proc. Interspeech 2018}, pages={406--410}, doi={10.21437/Interspeech.2018-62}, url={http://dx.doi.org/10.21437/Interspeech.2018-62} }
In language modeling, it is difficult to incorporate entity relationships from a knowledge-base. One solution is to use a reranker trained with global features, in which global features are derived from n-best lists. However, training such a reranker requires manually annotated n-best lists, which is expensive to obtain. We propose a method based on the contrastive estimation method that alleviates the need for such data. Experiments in the music domain demonstrate that global features, as well as features extracted from an external knowledge-base, can be incorporated into our reranker. Our final model achieves a 0.44 absolute word error rate improvement on the blind test data.
Cross-Lingual Sentiment Transfer with Limited Resources
Mohammad Sadegh Rasooli, Noura Farra, Axinia Radeva, Tao Yu and Kathleen McKeown.
Machine Translation, Volume 32, Issue 1–2, pp 143–165, 2018. [abstract] [bibtex] [code]@Article{Rasooli2018, author="Rasooli, Mohammad Sadegh and Farra, Noura and Radeva, Axinia and Yu, Tao and McKeown, Kathleen", title="Cross-lingual sentiment transfer with limited resources", journal="Machine Translation", year="2018", month="Jun", day="01", volume="32", number="1", pages="143--165", issn="1573-0573", doi="10.1007/s10590-017-9202-6", url="https://doi.org/10.1007/s10590-017-9202-6" }
We describe two transfer approaches for building sentiment analysis systems without having gold labeled data in the target language. Unlike previous work that is focused on using only English as the source language and a small number of target languages, we use multiple source languages to learn a more robust sentiment transfer model for 16 languages from different language families. Our approaches explore the potential of using an annotation projection approach and a direct transfer approach using cross-lingual word representations and neural networks. Whereas most previous work relies on machine translation, we show that we can build cross-lingual sentiment analysis systems without machine translation or even high quality parallel data. %We have conducted experiments with and without parallel data (e.g. using comparable corpora). We have conducted experiments assessing the availability of different resources such as in-domain parallel data, out-of-domain parallel data, and in-domain comparable data. Our experiments show that we can build a robust transfer system whose performance can in some cases approach that of a supervised system.
Transferring Semantic Roles Using Translation and Syntactic Information
Maryam Aminian, Mohammad Sadegh Rasooli, and Mona Diab.
IJCNLP 2017. [abstract] [bibtex]@InProceedings{I17-2003, author = "Aminian, Maryam and Rasooli, Mohammad Sadegh and Diab, Mona", title = "Transferring Semantic Roles Using Translation and Syntactic Information", booktitle = "Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)", year = "2017", publisher = "Asian Federation of Natural Language Processing", pages = "13--19", location = "Taipei, Taiwan", url = "http://aclweb.org/anthology/I17-2003" }
Annotation projection for semantic role labeling is a transfer method that aims to develop systems for resource-poor languages using supervised annotations of a resource-rich language through parallel data. We propose a method that employs information from source and target syntactic dependencies as well as word alignment density to improve the quality of an iterative bootstrapping method. Our experiments yield a $3.5$ absolute labeled F-score improvement over a standard annotation projection method.
- Cross-Lingual Syntactic Transfer with Limited Resources
Mohammad Sadegh Rasooli and Michael Collins.
Transactions of the ACL, 5:279--293, 2017.[abstract] [bibtex] [code] Density-Driven Cross-Lingual Transfer of Dependency Parsers
Mohammad Sadegh Rasooli and Michael Collins.
EMNLP 2015.[abstract] [bibtex] [Slides] [Video] [Models & Runnable jar]On the Importance of Ezafe Construction in Persian Parsing
Alireza Nourian, Mohammad Sadegh Rasooli, Mohsen Imany and Heshaam Faili.
ACL-IJCNLP 2015.[abstract] [bibtex] [Poster]Yara Parser: A Fast and Accurate Dependency Parser
Mohammad Sadegh Rasooli and Joel Tetreault.
arXiv:1503.06733v2 [cs.CL], 2015.[abstract] [bibtex] [Code]Persian Syntactic Treebank: a Research Based on Dependency Grammar
Mohammad Sadegh Rasooli, Manouchehr Kouhestani and Amirsaeid Moloodi.
SCICT; in Persian; ISBN = 313-388-3388-81-3.Improving Deep Neural Network Acoustic Modeling For Audio Corpus Indexing Under The IARPA Babel Program
Xiaodong Cui, Brian Kingsbury, Jia Cui, Bhuvana Ramabhadran andrew Rosenberg, Mohammad Sadegh Rasooli, Owen Rambow, Nizar Habash and Vaibhava Goel.
INTERSPEECH 2014.[abstract] [bibtex]Unsupervised Morphology-Based Vocabulary Expansion
Mohammad Sadegh Rasooli, Thomas Lippincott, Nizar Habash and Owen Rambow.
ACL 2014.[abstract] [bibtex] [Poster]Non-Monotonic Parsing of Fluent Umm I mean Disfluent Sentences
Mohammad Sadegh Rasooli and Joel Tetreault.
EACL 2014.[abstract] [bibtex] [Slides]Joint Parsing and Disfluency Detection in Linear Time
Mohammad Sadegh Rasooli and Joel Tetreault.
EMNLP 2013.[abstract] [bibtex] [Slides]Orthographic and Morphological Processing for Persian to English Statistical Machine Translation
Mohammad Sadegh Rasooli, Ahmed El Kholy and Nizar Habash.
IJCNLP 2013.[abstract] [bibtex] [Poster]Development of a Persian Syntactic Dependency Treebank
Mohammad Sadegh Rasooli, Manouchehr Kouhestani and Amirsaeid Moloodi.
NAACL 2013.[abstract] [bibtex] [Poster] [Data]Unsupervised Induction of Persian Semantic Verb Classes Based on Syntactic Information
Maryam Aminian, Mohammad Sadegh Rasooli and Hossein Sameti.
International Conference Language Processing and Intelligent Information Systems, 2013.[abstract] [bibtex]Unsupervised Extraction of Verb Valency in Persian
Mohammad Sadegh Rasooli, Behrouz Minaei-Bidgoli, Heshaam Faili and Maryam Aminian.
Journal of Signal and Data Processing, 2(18), pp. 3-12, 2013; in Persian.Fast Unsupervised Dependency Parsing with Arc-Standard Transitions
Mohammad Sadegh Rasooli and Heshaam Faili.
Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP; 2012.[abstract] [bibtex]Persian Verb Valency Lexicon: An Attempt Toward Teaching Persian to Non-native Persian Speakers
Manouchehr Kouhestani, Amirsaeid Moloodi and Mohammad Sadegh Rasooli.
International Conference on Spread of Persian Language and Literature, 2012; in Persian.Unsupervised Identification of Persian Compound Verbs
Mohammad Sadegh Rasooli, Heshaam Faili and Behrouz Minaei-Bidgoli.
10th Mexican International Conference on Artificial Intelligence (MICAI 2011).[abstract] [bibtex]Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents
Mohammad Sadegh Rasooli, Omid Kashefi and Behrouz Minaei-Bidgoli.
The Seventh Asia Information Retrieval Societies Conference (AIRS 2011).[abstract] [bibtex]A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank
Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani and Behrouz Minaei-Bidgoli.
5th Language & Technology Conference (LTC 2011)[bibtex] [Data]Effect of Adaptive Spell Checking in Persian
Mohammad Sadegh Rasooli, Omid Kashefi and Behrouz Minaei-Bidgoli.
7th Conference on Natural Language Processing and Knowledge Engineering (NLPKE 2011)[abstract] [bibtex]A New Approach for Persian Spellchecking
Mohammad Sadegh Rasooli and Behrouz Minaei-Bidgoli.
2nd Data Mining Conference (IDMC 2008); in Persian.[abstract] [bibtex]
Mohammad Sadegh Rasooli
Principal applied scientist, Speech and Language Group at Microsoft, Mountain View, CA
Former postdoctoral researcher, University of Pennsylvania
Former research scientist at Facebook AI
PhD of Computer Science, Columbia University
mrasooli-at-microsoft.[com]
CV (Updated: October 2024)
Publications
We describe a simple but effective method for cross-lingual syntactic transfer of dependency parsers, in the scenario where a large amount of translation data is not available. The method makes use of three steps: 1) a method for deriving cross-lingual word clusters, which can then be used in a multilingual parser; 2) a method for transferring lexical information from a target language to source language treebanks; 3) a method for integrating these steps with the density-driven annotation projection method of Rasooli and Collins (2015). Experiments show improvements over the state-of-the-art in several languages used in previous work, in a setting where the only source of translation data is the Bible, a considerably smaller corpus than the Europarl corpus used in previous work. Results using the Europarl corpus as a source of translation data show additional improvements over the results of Rasooli and Collins (2015). We conclude with results on 38 datasets from the Universal Dependencies corpora.
@article{rasooli_16, author = {Rasooli, Mohammad Sadegh and Collins, Michael }, title = {Cross-Lingual Syntactic Transfer with Limited Resources}, journal = {Transactions of the Association for Computational Linguistics}, volume = {5}, year = {2017}, keywords = {}, issn = {2307-387X}, url = {https://transacl.org/ojs/index.php/tacl/article/view/922}, pages = {279--293} }
We present a novel method for the crosslingual transfer of dependency parsers. Our goal is to induce a dependency parser in a target language of interest without any direct supervision: instead we assume access to parallel translations between the target and one or more source languages, and to supervised parsers in the source language(s). Our key contributions are to show the utility of dense projected structures when training the target language parser, and to introduce a novel learning algorithm that makes use of dense structures. Results on several languages show an absolute improvement of 5.51% in average dependency accuracy over the state-of-the-art method of (Ma and Xia, 2014). Our average dependency accuracy of 82.18% compares favourably to the accuracy of fully supervised methods.
@InProceedings{rasooli-collins:2015:EMNLP, author = {Rasooli, Mohammad Sadegh and Collins, Michael}, title = {Density-Driven Cross-Lingual Transfer of Dependency Parsers}, booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, month = {September}, year = {2015}, address = {Lisbon, Portugal}, publisher = {Association for Computational Linguistics}, pages = {328--338}, url = {http://aclweb.org/anthology/D15-1039} }
Ezafe construction is an idiosyncratic phenomenon in the Persian language. It is a good indicator for phrase boundaries and dependency relations but mostly does not appear in the text. In this paper, we show that adding information about Ezafe construction can give 4.6% relative improvement in dependency parsing and 9% relative improvement in shallow parsing. For evaluation purposes, Ezafe tags are manually annotated in the Persian dependency treebank. Furthermore, to be able to conduct experiments on shallow parsing, we develop a dependency to shallow phrase structure convertor based on the Persian dependencies.
@InProceedings{nourian-EtAl:2015:ACL-IJCNLP, author = {Nourian, Alireza and Rasooli, Mohammad Sadegh and Imany, Mohsen and Faili, Heshaam}, title = {On the Importance of Ezafe Construction in Persian Parsing}, booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)}, month = {July}, year = {2015}, address = {Beijing, China}, publisher = {Association for Computational Linguistics}, pages = {877--882}, url = {http://www.aclweb.org/anthology/P15-2144} }
Dependency parsers are among the most crucial tools in natural language processing as they have many important applications in downstream tasks such as information retrieval, machine translation and knowledge acquisition. We introduce the Yara Parser, a fast and accurate open-source dependency parser based on the arc-eager algorithm and beam search. It achieves an unlabeled accuracy of 93.32 on the standard WSJ test set which ranks it among the top dependency parsers. At its fastest, Yara can parse about 4000 sentences per second when in greedy mode (1 beam). When optimizing for accuracy (using 64 beams and Brown cluster features), Yara can parse 45 sentences per second. The parser can be trained on any syntactic dependency treebank and different options are provided in order to make it more flexible and tunable for specific tasks. It is released with the Apache version 2.0 license and can be used for both commercial and academic purposes. The parser can be found at https://github.com/yahoo/YaraParser.
@article{DBLP:journals/corr/RasooliT15, author = {Mohammad Sadegh Rasooli and Joel R. Tetreault}, title = {Yara Parser: {A} Fast and Accurate Dependency Parser}, journal = {CoRR}, volume = {abs/1503.06733}, year = {2015}, url = {http://arxiv.org/abs/1503.06733}, timestamp = {Thu, 09 Apr 2015 11:33:20 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/RasooliT15}, bibsource = {dblp computer science bibliography, http://dblp.org} }
This paper is focused on several techniques that improve deep neural network (DNN) acoustic modeling for audio corpus indexing in the context of the IARPA Babel program. Specifically, fundamental frequency variation (FFV) and channelaware (CA) features and data augmentation based on stochastic feature mapping (SFM) are investigated not only for improved automatic speech recognition (ASR) performance but also for their impact to the final spoken term detection on the pre-indexed audio corpus. Experimental results on development languages of Babel option period one show that the improved DNN acoustic models can reduce word error rates in ASR and also help the keyword search performance compared to already competitive DNN baseline systems.
@inproceedings{cui2014improving, title={Improving deep neural network acoustic modeling for audio corpus indexing under the IARPA babel program.}, author={Cui, Xiaodong and Kingsbury, Brian and Cui, Jia and Ramabhadran, Bhuvana and Rosenberg, Andrew and Rasooli, Mohammad Sadegh and Rambow, Owen and Habash, Nizar and Goel, Vaibhava}, booktitle={INTERSPEECH}, pages={2103--2107}, year={2014} }
We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages. We test our vocabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depending on the morphological complexity of the language. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.
@InProceedings{rasooli-EtAl:2014:P14-1, author = {Rasooli, Mohammad Sadegh and Lippincott, Thomas and Habash, Nizar and Rambow, Owen}, title = {Unsupervised Morphology-Based Vocabulary Expansion}, booktitle = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = {June}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computational Linguistics}, pages = {1349--1359}, url = {http://www.aclweb.org/anthology/P14-1127} }
Parsing disfluent sentences is a challenging task which involves detecting disfluencies as well as identifying the syntactic structure of the sentence. While there have been several studies recently into solely detecting disfluencies at a high performance level, there has been relatively little work into joint parsing and disfluency detection that has reached that state-ofthe-art performance in disfluency detection. We improve upon recent work in this joint task through the use of novel features and learning cascades to produce a model which performs at 82.6 F-score. It outperforms the previous best in disfluency detection on two different evaluations.
@InProceedings{rasooli-tetreault:2014:EACL2014-SP, author = {Rasooli, Mohammad Sadegh and Tetreault, Joel}, title = {Non-Monotonic Parsing of Fluent Umm I mean Disfluent Sentences}, booktitle = {Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers}, month = {April}, year = {2014}, address = {Gothenburg, Sweden}, publisher = {Association for Computational Linguistics}, pages = {48--53}, url = {http://www.aclweb.org/anthology/E14-4010} }
We introduce a novel method to jointly parse and detect disfluencies in spoken utterances. Our model can use arbitrary features for parsing sentences and adapt itself with out-of-domain data. We show that our method, based on transition-based parsing, performs at a high level of accuracy for both the parsing and disfluency detection tasks. Additionally, our method is the fastest for the joint task, running in linear time.
@InProceedings{rasooli-tetreault:2013:EMNLP, author = {Rasooli, Mohammad Sadegh and Tetreault, Joel}, title = {Joint Parsing and Disfluency Detection in Linear Time}, booktitle = {Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing}, month = {October}, year = {2013}, address = {Seattle, Washington, USA}, publisher = {Association for Computational Linguistics}, pages = {124--129}, url = {http://www.aclweb.org/anthology/D13-1013} }
In statistical machine translation, data sparsity is a challenging problem especially for languages with rich morphology and inconsistent orthography, such as Persian. We show that orthographic preprocessing and morphological segmentation of Persian verbs in particular improves the translation quality of Persian-English by 1.9 BLEU points on a blind test set.
@InProceedings{rasooli-elkholy-habash:2013:IJCNLP, author = {Rasooli, Mohammad Sadegh and El Kholy, Ahmed and Habash, Nizar}, title = {Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation}, booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing}, month = {October}, year = {2013}, address = {Nagoya, Japan}, publisher = {Asian Federation of Natural Language Processing}, pages = {1047--1051}, url = {http://www.aclweb.org/anthology/I13-1144} }
This paper describes the annotation process and linguistic properties of the Persian syntactic dependency treebank. The treebank consists of approximately 30,000 sentences annotated with syntactic roles in addition to morpho-syntactic features. One of the unique features of this treebank is that there are almost 4800 distinct verb lemmas in its sentences making it a valuable resource for educational goals. The treebank is constructed with a bootstrapping approach by means of available tagging and parsing tools and manually correcting the annotations. The data is splitted into standard train, development and test set in the CoNLL dependency format and is freely available to researchers.
@InProceedings{rasooli-kouhestani-moloodi:2013:NAACL-HLT, author = {Rasooli, Mohammad Sadegh and Kouhestani, Manouchehr and Moloodi, Amirsaeid}, title = {Development of a Persian Syntactic Dependency Treebank}, booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2013}, address = {Atlanta, Georgia}, publisher = {Association for Computational Linguistics}, pages = {306--314}, url = {http://www.aclweb.org/anthology/N13-1031} }
Automatic induction of semantic verb classes is one of the most challenging tasks in computational lexical semantics with a wide variety of applications in natural language processing. The large number of Persian speakers and the lack of such semantic classes for Persian verbs have motivated us to use unsupervised algorithms for Persian verb clustering. In this paper, we have done experiments on inducing the semantic classes of Persian verbs based on Levin’s theory for verb classes. Syntactic information extracted from dependency trees is used as base features for clustering the verbs. Since there has been no manual classification of Persian verbs prior to this paper, we have prepared a manual classification of 265 verbs into 43 semantic classes. We show that spectral clustering algorithm outperforms KMeans and improves on the baseline algorithm with about 17% in Fmeasure and 0.13 in Rand index.
@inproceedings{aminian2013unsupervised, title={Unsupervised Induction of Persian Semantic Verb Classes Based on Syntactic Information}, author={Aminian, Maryam and Rasooli, Mohammad Sadegh and Sameti, Hossein}, booktitle={Language Processing and Intelligent Information Systems: 20th International Conference, IIS 2013, Warsaw, Poland, June 17-18, 2013, Proceedings}, volume={7912}, pages={112}, year={2013}, organization={Springer} }
Unsupervised dependency parsing is one of the most challenging tasks in natural languages processing. The task involves finding the best possible dependency trees from raw sentences without getting any aid from annotated data. In this paper, we illustrate that by applying a supervised incremental parsing model to unsupervised parsing; parsing with a linear time complexity will be faster than the other methods. With only 15 training iterations with linear time complexity, we gain results comparable to those of other state of the art methods. By employing two simple universal linguistic rules inspired from the classical dependency grammar, we improve the results in some languages and get the state of the art results. We also test our model on a part of the ongoing Persian dependency treebank. This work is the first work done on the Persian language.
@InProceedings{rasooli-faili:2012:ROBUS-UNSUP2012, author = {Rasooli, Mohammad Sadegh and Faili, Heshaam}, title = {Fast Unsupervised Dependency Parsing with Arc-Standard Transitions}, booktitle = {Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP}, month = {April}, year = {2012}, address = {Avignon, France}, publisher = {Association for Computational Linguistics}, pages = {1--9}, url = {http://www.aclweb.org/anthology/W12-0701} }
One of the main tasks related to multiword expressions (MWEs) is compound verb identification. There have been so many works on unsupervised identification of multiword verbs in many languages, but there has not been any conspicuous work on Persian language yet. Persian multiword verbs (known as compound verbs), are a kind of light verb construction (LVC) that have syntactic flexibility such as unrestricted word distance between the light verb and the nonverbal element. Furthermore, the nonverbal element can be inflected. These characteristics have made the task in Persian very difficult. In this paper, two different unsupervised methods have been proposed to automatically detect compound verbs in Persian. In the first method, extending the concept of pointwise mutual information (PMI) measure, a bootstrapping method has been applied. In the second approach, K-means clustering algorithm is used. Our experiments show that the proposed approaches have gained results superior to the baseline which uses PMI measure as its association metric.
@inproceedings{Rasooli:2011:UIP:2178197.2178234, author = {Rasooli, Mohammad Sadegh and Faili, Heshaam and Minaei-Bidgoli, Behrouz}, title = {Unsupervised Identification of Persian Compound Verbs}, booktitle = {Proceedings of the 10th Mexican International Conference on Advances in Artificial Intelligence - Volume Part I}, series = {MICAI'11}, year = {2011}, isbn = {978-3-642-25323-2}, location = {Puebla, Mexico}, pages = {394--406}, numpages = {13}, url = {http://dx.doi.org/10.1007/978-3-642-25324-9_34}, doi = {10.1007/978-3-642-25324-9_34}, acmid = {2178234}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {K-means, Persian, bootstrapping, light verb constructions, multiword expression, unsupervised identification}, }
The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.
@inproceedings{Rasooli:2011:EPP:2189339.2189398, author = {Rasooli, Mohammad Sadegh and Kashefi, Omid and Minaei-Bidgoli, Behrouz}, title = {Extracting Parallel Paragraphs and Sentences from English-persian Translated Documents}, booktitle = {Proceedings of the 7th Asia Conference on Information Retrieval Technology}, series = {AIRS'11}, year = {2011}, isbn = {978-3-642-25630-1}, location = {Dubai, United Arab Emirates}, pages = {574--583}, numpages = {10}, url = {http://dx.doi.org/10.1007/978-3-642-25631-8_52}, doi = {10.1007/978-3-642-25631-8_52}, acmid = {2189398}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {English, Persian, bilingual corpus, machine translation, paragraph alignment, parallel corpus, sentence alignment}, }
@inproceedings{rasooli2011syntactic, title={A syntactic valency lexicon for Persian verbs: The first steps towards Persian dependency treebank}, author={Rasooli, Mohammad Sadegh and Moloodi, Amirsaeid and Kouhestani, Manouchehr and Minaei-Bidgoli, Behrouz}, booktitle={5th Language \& Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics}, pages={227--231}, year={2011} }
In computers era, the flow of producing digital documents simply overwhelmed the traditional manual spell checking, the worst new type of misspelling called typographical errors have been created by machinery text production and management. Therefore, referring to human intolerable load of digital text's spell checking also the irrecusable ability of computers, including accuracy and speed, automatic spell checking using computer systems would be an important application of computer systems. Different users may have their own misspelling patterns or habits so we believe that using a traditional automatic spell checker using a fix set of rules may not be well performable for all kind of misspelling patterns. Therefore, in this paper, we investigate the effect of adaptive spell checking on Persian language comparing a non-adaptive traditional spell checking. Evaluation results show using adaptive spell checking is superior and more efficient than traditional spell checking with a fix set of rules after a short time of usage.
@INPROCEEDINGS{6138186, author={Mohammad Sadegh Rasooli and Omid Kashefi and Behrouz Minaei-Bidgoli}, booktitle={2011 7th International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE)}, title={Effect of adaptive spell checking in Persian}, year={2011}, pages={161-164}, doi={10.1109/NLPKE.2011.6138186}, month={Nov},}
In this paper a method for spellchecking is studied through surveying several methods of spellchecking in Persian language. Referring to several methods of spellchecking, challenges and problems facing these methods are reminded. Besides having special attributes, problems of Persian characters in computer editors -for characters which have more than one code in computer- are solved in this method. Therefore, the problem of portability of the program is removed completely. After spellchecking, the program presents the right suggestions to the user. Different approaches for giving suggestions to the users are studied and implemented. In order to find right suggestions of wrong words, the words before and next to the wrong word are used and the wrong word is also analyzed in order to derive three right words from those. Persian nouns, adjectives, adverbs and verbs stemming are studied and implemented in this spellchecker. Persian verb infinitives are divided into categories according to their tense. Stemming is also done based on the tense. Thus, two separated ways for recovering Persian verbs are used. For nouns, the states of being single or plural, definite or indefinite and having affixes are studied. This program can be implemented into Microsoft Office software.
@inproceedings{rasooli2008new, title={A new approach for Persian spellchecking}, author={Mohammad Sadegh Rasooli and Behrouz Minaei-Bidgoli}, booktitle={2nd Data Mining Conference}, address = {Tehran, Iran}, year={2008} }