Nltk: NLTK for vietnamese

Created on 31 May 2015 · 22Comments · Source: nltk/nltk

Have nltk supported vietnamese language?

In case it haven't. How can I contribute to make ntlk support vietnamese language?

It would be like this

>>> import nltk
>>> sentence = "Vào tám giờ thứ sáu, tôi cảm thấy không được khỏe."

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['Vào', 'tám', "giờ", 'sáng', 'thứ sáu', 'tôi', 'cảm thấy', 'không', 'được', 'khỏe', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:5]
[('Vào', 'IN'), ('tám', 'CD'), ("giờ", 'JJ'), ('sáng', 'NN'), ('thứ sáu', 'NNP'), ]

corpus enhancement inactive nice idea

Source

rain1024

Most helpful comment

@u8621011 Glad you asked. Our next steps in underthesea are integrating more modules such as speech synthesis, machine translation and (simple) chatbot for Vietnamese and improving speed and accuracy in current modules (word segmentation, pos tagging, chunking, named entity recognition, text classification and sentiment analysis).

About porting plan in nltk, I think we can write code in pure python to do word segmentation task (perhaps with cython to speed up performance) at the moment. I and my friend @trungtv have an accepted pull request in spacy 2 months ago.

rain1024 on 30 May 2018

❤4 👍3

All 22 comments

Hi @stevenbird,
What do you think ? Probably we can port these
http://jvntextpro.sourceforge.net/

longdt219 on 8 Jun 2015

@rain1024 would you like to do some porting, or contribute wrappers for external Java libraries?

stevenbird on 10 Jun 2015

@stevenbird : yes. I'm glad to do this.

@longdt219: can we do this together?

rain1024 on 10 Jun 2015

Yes sure @rain1024

longdt219 on 10 Jun 2015

hi @longdt219

can I have your email? I will contact to you for more information :smile:

rain1024 on 10 Jun 2015

Hi @rain1024,
I emailed you but probably we can discuss here so that others can join the discussion.

longdt219 on 11 Jun 2015

@rain1024 @longdt219,

How about porting this https://github.com/rockkhuya/DongDu as first step? Which is aimed for word segmentation and written in C++ by the way.

I don't know C++ or Java but that tool must have the best performace so far, according to http://xltiengviet.wikia.com/wiki/K%E1%BB%B7_l%E1%BB%A5c_t%C3%A1ch_t%E1%BB%AB

manhtai on 18 Jun 2015

Hi, me again,

After searching around for a while I found that word segmentation in Vietnamese is a really hard problem, not to mention POS tagging.

I had an idea inspire by https://github.com/mesnilgr/is13 for using deep learning to learn word embeddings, and I'll try to implement it. Some interesting may come, or not :smile_cat:

manhtai on 18 Jun 2015

I've implemented a neural net for Vietnamese word segmenting here https://github.com/manhtai/vietseg. Have a look!

It's not so good for now. But at least I've tried, huh? :smile:

manhtai on 23 Jun 2015

About the performance, it look OK though. However, what's the baseline ?
What is the dependencies ? using network.py from https://github.com/mnielsen/neural-networks-and-deep-learning probably is not a good way w.r.t maintenance and licensing. The idea is we don't want to rely on external code.
Using Theano (python based) for this might be a better (and simpler) solution.

longdt219 on 23 Jun 2015

Thanks, I'm looking for a baseline and will add it soon.

Theano may be better but not simpler, network.py is an independent file with less than 300 lines of code.

Anyway, it's only a quick and dirty implementation. I've added future works to README file, and that's for working in the future :smile_cat:

manhtai on 23 Jun 2015

@longdt219 @rain1024 I have been using jvntextpro2 for awhile and it's pretty decent. It's written in Java and also an opensource project. We may choose to port this as well.

letuananh on 14 Sep 2015

Bumping the issue ;P

I've written a JVnTextPro wrapper some time ago but it's not properly documented and the coding style is outdated but I hope it helps.

Would be great to see other Asian languages annotators wrappers/ports too =)

alvations on 28 Feb 2016

@alvations: are you interested in porting JVnTextPro to NLTK :P ?

letuananh on 29 Feb 2016

@letuananh after much thinking, yes. After the new PTB tokenizer is merged, interface to JVN would be something on my todo list. Care to help?

alvations on 5 May 2017

:+1: It would be great to support Vietnamese

stevenbird on 25 May 2017

wow... this is awesome stuff. Would love to have Vietnamese support!

toannguyenle on 4 Jun 2017

@manhtai do you plan to continue on your project. it sounds awesome.

vietzerg on 27 Jul 2017

Coming back to this issue after the next minor release =)
But meanwhile take a look at https://github.com/magizbox/underthesea

alvations on 6 Sep 2017

@rain1024 How about your original porting plan? I reached here because i have ported a python version vnTokenizer and planning if it's possible to port into nltk. I also saw your continuous good job of underthesea and have a question about your next step.

u8621011 on 30 May 2018

@u8621011 underthesea isn't my work but they're doing a good job =)

I'm not sure how much mileage we can get if we start porting from Jvntextpro. But I think I won't be able to take another try at porting until late July.

Vietnamese support is surely on the list of things I personally would like to see and work on in NLTK.

alvations on 30 May 2018