en2pinyin - Character-level Chinese text generation and translation

Yuhuang Hu

What is THIS?

A way of doing Chinese text-generation and translation in character level.

So there are too many character-level natural language processing papers out there for English translation. And it seems matching the performance as the word-level translation. However, the progress of Chinese translation is rather behind the schedule.

One reason might be that there are simply TOO MANY CHINESE CHARACTERS! (according to one book 《中华字海》, there are total 85,568 characters). Modern unicode has more than 20,000 characters. Semantic embedding algorithms are simply hopeless to represent them.

In this project, I tried a different approach: using Chinese type method as character embedding. And turns out this embedding is surprisingly good in memorizing contents.

Get Involved!

Currently I’m working on some other computer vision projects, therefore this project is not so active. If you want to continue the work on translation, you are very welcome to contribute.

I’m open to any form of calibration.

Challenges in Chinese Translation

Although there are companies like Google, Baidu and Microsoft that are willing to collect and devote in Chinese translation, but there is no public dataset available for people to compete.

(MAYBE I DIDN’T FIND SUCH DATASET SIMPLY BECAUSE I’M NOT A NLP PERSON, AND IF YOU ARE, PLEASE TELL ME WHICH DATASET YOU ARE USING FOR CHINESE TRANSLATION)

en2pinyin

Please find this repository for the code:

git clone https://github.com/duguyue100/en2pinyin

Text-generation

Example 1: 《史记 赵世家》

The text is taken from a chapter of a famous Chinese history book: Records of the Grand Historian.

The book is written in Literary Chinese, which is the language of the classic literature. As the writing style is so different from the modern Chinese writing, this is recognized as a particular difficult task to learn for both human and machine. However, our simple learning model performs particularly well.

赵氏孤兒良已死,皆喜。然赵氏真孤乃反在,程婴卒与俱匿山中。

居十五年,晋景公疾,卜之,大业之後不遂者为祟。 景公问韩厥,厥知赵孤在,乃曰:“大业之後在晋绝祀者,其赵氏乎? 夫自中衍者皆嬴姓也。中衍人面鸟噣,降佐殷帝大戊,及周天子,皆有明德。 下及幽厉无道,而叔带去周適晋,事先君文侯,至于成公,世有立功,未尝绝祀。 今吾君独灭赵宗,国人哀之,故见龟策。唯君图之。”景公问:“赵尚有後子孙乎?” 韩厥具以实告。於是景公乃与韩厥谋立赵孤兒,召而匿之宫中。 诸将入问疾,景公因韩厥之众以胁诸将而见赵孤。赵孤名曰武。 诸将不得已,乃曰:“昔下宫之难,皆能死。我非不能死,我思立赵氏之後。 今赵武既立,为成人,复故位,我将下报赵宣孟与公孙杵臼。” 赵武啼泣顿首固请,曰:“武原苦筋骨以报子至死,而子忍去我死乎!” 程婴曰:“死易,立孤难耳。”公孙杵臼曰:“立孤与死孰难?”

Example 2: 《史记》 in larger volume

十一年,秦将白起遂拔我郢,烧先王墓夷陵。楚襄王兵散,遂不复战,东北保于陈城。 二十二年,秦复拔我巫、黔中郡。

二十三年,襄王乃收东地兵,得十馀万,复西取秦所拔我江旁十五邑以为郡,距秦。 二十七年,使三万人助三晋伐燕。复与秦平,而入太子为质于秦。楚使左徒侍太子于秦。

三十六年,顷襄王病,太子亡归。秋,顷襄王卒,太子熊元代立,是为考烈王。 考烈王以左徒为令尹,封以吴,号春申君。

考烈王元年,纳州于秦以平。是时楚益弱。

六年,秦围邯郸,赵告急楚,楚使子西救郑,受赂而去。白公胜怒, 乃遂与勇力死士石乞等袭杀令尹子西、子綦于朝,因劫惠王,置之高府,欲弑之。 惠王从者屈固负王亡走昭王夫人宫。白公自立为王。月馀,会叶公来救楚, 楚惠王之徒与共攻白公,杀之。惠王乃复位。是岁也,灭陈而县之。

十三年,吴王夫差强,陵齐、晋,来伐楚。十六年,越灭吴。四十二年,楚灭蔡。 四十四年,楚灭杞。与秦平。是时越已灭吴而不能正江、淮北;楚东侵,广地至泗上。

五十七年,惠王卒,子王安立。

王安五年,秦攻韩,韩急,使韩非使秦,秦留非,因杀之。

九年,秦虏王安,尽入其地,为颍州郡。韩遂亡。

太史公曰:韩厥之感晋景公,绍赵孤之子武,以成程婴、公孙杵臼之义, 此天下之阴德也。韩氏之功,于晋未睹其大者也。然与赵、魏终为诸侯十馀世, 宜乎哉!

召公奭与周同姓,姓姬氏。其後苗裔事晋,得封於韩原,曰韩武子。 武子後三世有韩厥,从封姓为韩氏。

韩厥,晋景公之三年,晋司寇屠岸贾将作乱,诛灵公之贼赵盾。赵盾已死矣, 欲诛其子赵朔。韩厥止贾,贾不听。厥告赵朔令亡。朔曰:“子必能不绝赵祀, 死不恨矣。”韩厥许之。及贾诛赵氏,厥称疾不出。程婴、公孙杵臼之义, 此天下之阴德也。韩氏之功,于晋未睹其大者也。然与赵、魏终为诸侯十馀世, 宜乎哉!

Example 3: Poem Generation from Classical Chinese Poetry Form

I populated 200 classical Chinese poems where each of them is so-called “七言律诗” where each poem has 8 sentences and each sentence has 7 characters.

Instead of being creative, turns out the network is pretty good at storing and recall the poems.

The seed is “风”, in Chinese, the character means “wind”.

风波浩荡足行吟。

别梦依稀咒逝川,故园三十二年前。红旗卷起农奴戟,黑手高悬霸主鞭。

为有牺牲多壮志,敢叫日月换新天。喜看稻菽千重浪,遍地英雄下夕烟。

红军不怕远征难,万水千山只等闲。五岭逶迤腾细浪,乌蒙磅礴走泥丸。

金沙水拍云崖暖,大渡桥横铁索寒。更喜岷山千里雪,三军过后尽开颜。

记得当年草上飞,红军队里每相违。长征不是难堪日,战锦方为大问题。

斥鷃每闻欺大鸟,昆鸡长笑老鹰非。君今不幸离人世,国有疑难可问谁?

九嶷山上白云飞,帝子乘风下翠微。斑竹一枝千滴泪,红霞万朵百重衣。

洞庭波涌连天雪,长岛人歌动地诗。我欲因之梦寥廓,芙蓉国里尽朝晖。

饮茶粤海未能忘,索句渝州叶正黄。三十一年还旧国,落花时节读华章。

牢骚太盛防肠断,风物长宜放眼量。莫道昆明池水浅,观鱼胜过富春江。

钟山风雨起苍黄,百万雄师过大江。虎踞龙盘今胜昔,天翻地覆慨而慷。

宜将剩勇追穷寇,不可沽名学霸王。天若有情天亦老,人间正道是沧桑。

季冬除夜接新年,帝子王孙捧御筵。宫阙星河低拂树,殿廷灯烛上熏天。

弹弦奏节梅风入,对局探钩柏酒传。欲向正元歌万寿,暂留欢赏寄春前。

今年游寓独游秦,愁思看春不当春。上林苑里花徒发,细柳营前叶漫新。

公子南桥应尽兴,将军西第几留宾。寄语洛城风日道,明年春色倍还人。

迢峣太华俯咸京,天外三峰削不成。武帝祠前云欲散,仙人掌上雨初晴。

河山北枕秦关险,驿树西连汉畤平。借问路傍名利客,无如此处学长生。

嗟君此别意何如,驻马衔杯问谪居。巫峡啼猿数行泪,衡阳归雁几封书。

青枫江上秋帆远,白帝城边古木疏。圣代即今多雨露,暂时分手莫踌躇。

The above 10 poems that are recalled by the network are exactly identical to the original form with only roughly previous two sentences history.

Contacts

Yuhuang Hu
Email: duguyue100@gmail.com