# Nikola I. Nikolov, Yuhuang Hu, Mi Xue Tan and Richard H.R. Hahnloser

### Note

This project is initially my (Yuhuang) personal project. With Nikola and Mi Xue’s help, we finally put together a nice paper on Chinese-English translation using the Wubi encoding scheme. Part of content in this page is about the initial motivation and some results on text-generation which I did myself.

### What is THIS?

A way of doing Chinese text-generation and translation in character level.

So there are too many character-level natural language processing papers out there for English translation. And it seems matching the performance as the word-level translation. However, the progress of Chinese translation is rather behind the schedule.

One reason might be that there are simply TOO MANY CHINESE CHARACTERS! (according to one book 《中华字海》, there are total 85,568 characters). Modern unicode has more than 20,000 characters. Semantic embedding algorithms are simply hopeless to represent them.

In this project, I tried a different approach: using Chinese type method as character embedding. And turns out this embedding is surprisingly good in memorizing contents.

### Challenges in Chinese Translation

Although there are companies like Google, Baidu and Microsoft that are willing to collect and devote in Chinese translation, but there is no public dataset available for people to compete.

We have found several datasets that are fairly modern and well structured for this purpose. We even developed the pipeline to work with these datasets. Some early experiments show that the Wubi embedding has its advantage over plain embedding even with word-level embeddings.

We would like to explore the character level translation in Convolutional Sequence to Sequence models. Let us know if you are interested!

### en2pinyin

Please find this repository for the code:

git clone https://github.com/duguyue100/en2pinyin


### Text Translation

We applied this idea to English-Chinese Translation, please check out our paper:

N.I. Nikolov, Y. Hu, MX. Tan, R.H.R. Hahnloser, “Character-level Chinese-English Translation through ASCII Encoding” in The Third Conference on Machine Translation (WMT18), Brussels, Belgium, 2018.

We further investigated the character-level multi-lingual translation. The Chinese is modeled by the Wubi encoding!

Y. Gao, N.I. Nikolov, Y. Hu, R.H.R. Hahnloser, “Character-Level Translation with Self-attention” in 2020 Annual Conference of the Association for Computational Linguistics (ACL), Seattle, Washington, 2020.

### Text-generation

#### Example 1: 《史记 赵世家》

The text is taken from a chapter of a famous Chinese history book: Records of the Grand Historian.

The book is written in Literary Chinese, which is the language of the classic literature. As the writing style is so different from the modern Chinese writing, this is recognized as a particular difficult task to learn for both human and machine. However, our simple learning model performs particularly well.

#### Example 3: Poem Generation from Classical Chinese Poetry Form

I populated 200 classical Chinese poems where each of them is so-called “七言律诗” where each poem has 8 sentences and each sentence has 7 characters.

Instead of being creative, turns out the network is pretty good at storing and recall the poems.

The seed is “风”, in Chinese, the character means “wind”.

The above 10 poems that are recalled by the network are exactly identical to the original form with only roughly previous two sentences history.

