Nltk: NLTK for vietnamese

Created on 31 May 2015  ·  22Comments  ·  Source: nltk/nltk

Have nltk supported vietnamese language?

In case it haven't. How can I contribute to make ntlk support vietnamese language?

It would be like this

>>> import nltk
>>> sentence = "Vào tám giờ thứ sáu, tôi cảm thấy không được khỏe."

>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['Vào', 'tám', "giờ", 'sáng', 'thứ sáu', 'tôi', 'cảm thấy', 'không', 'được', 'khỏe', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:5]
[('Vào', 'IN'), ('tám', 'CD'), ("giờ", 'JJ'), ('sáng', 'NN'), ('thứ sáu', 'NNP'), ]
corpus enhancement inactive nice idea

Most helpful comment

@u8621011 Glad you asked. Our next steps in underthesea are integrating more modules such as speech synthesis, machine translation and (simple) chatbot for Vietnamese and improving speed and accuracy in current modules (word segmentation, pos tagging, chunking, named entity recognition, text classification and sentiment analysis).

About porting plan in nltk, I think we can write code in pure python to do word segmentation task (perhaps with cython to speed up performance) at the moment. I and my friend @trungtv have an accepted pull request in spacy 2 months ago.

All 22 comments

Hi @stevenbird,
What do you think ? Probably we can port these
http://jvntextpro.sourceforge.net/

@rain1024 would you like to do some porting, or contribute wrappers for external Java libraries?

@stevenbird : yes. I'm glad to do this.

@longdt219: can we do this together?

Yes sure @rain1024

hi @longdt219

can I have your email? I will contact to you for more information :smile:

Hi @rain1024,
I emailed you but probably we can discuss here so that others can join the discussion.

@rain1024 @longdt219,

How about porting this https://github.com/rockkhuya/DongDu as first step? Which is aimed for word segmentation and written in C++ by the way.

I don't know C++ or Java but that tool must have the best performace so far, according to http://xltiengviet.wikia.com/wiki/K%E1%BB%B7_l%E1%BB%A5c_t%C3%A1ch_t%E1%BB%AB

Hi, me again,

After searching around for a while I found that word segmentation in Vietnamese is a really hard problem, not to mention POS tagging.

I had an idea inspire by https://github.com/mesnilgr/is13 for using deep learning to learn word embeddings, and I'll try to implement it. Some interesting may come, or not :smile_cat:

I've implemented a neural net for Vietnamese word segmenting here https://github.com/manhtai/vietseg. Have a look!

It's not so good for now. But at least I've tried, huh? :smile:

About the performance, it look OK though. However, what's the baseline ?
What is the dependencies ? using network.py from https://github.com/mnielsen/neural-networks-and-deep-learning probably is not a good way w.r.t maintenance and licensing. The idea is we don't want to rely on external code.
Using Theano (python based) for this might be a better (and simpler) solution.

Thanks, I'm looking for a baseline and will add it soon.

Theano may be better but not simpler, network.py is an independent file with less than 300 lines of code.

Anyway, it's only a quick and dirty implementation. I've added future works to README file, and that's for working in the future :smile_cat:

@longdt219 @rain1024 I have been using jvntextpro2 for awhile and it's pretty decent. It's written in Java and also an opensource project. We may choose to port this as well.

Bumping the issue ;P

I've written a JVnTextPro wrapper some time ago but it's not properly documented and the coding style is outdated but I hope it helps.

Would be great to see other Asian languages annotators wrappers/ports too =)

@alvations: are you interested in porting JVnTextPro to NLTK :P ?

@letuananh after much thinking, yes. After the new PTB tokenizer is merged, interface to JVN would be something on my todo list. Care to help?

:+1: It would be great to support Vietnamese

wow... this is awesome stuff. Would love to have Vietnamese support!

@manhtai do you plan to continue on your project. it sounds awesome.

Coming back to this issue after the next minor release =)
But meanwhile take a look at https://github.com/magizbox/underthesea

@rain1024 How about your original porting plan? I reached here because i have ported a python version vnTokenizer and planning if it's possible to port into nltk. I also saw your continuous good job of underthesea and have a question about your next step.

@u8621011 underthesea isn't my work but they're doing a good job =)

I'm not sure how much mileage we can get if we start porting from Jvntextpro. But I think I won't be able to take another try at porting until late July.

Vietnamese support is surely on the list of things I personally would like to see and work on in NLTK.

@u8621011 Glad you asked. Our next steps in underthesea are integrating more modules such as speech synthesis, machine translation and (simple) chatbot for Vietnamese and improving speed and accuracy in current modules (word segmentation, pos tagging, chunking, named entity recognition, text classification and sentiment analysis).

About porting plan in nltk, I think we can write code in pure python to do word segmentation task (perhaps with cython to speed up performance) at the moment. I and my friend @trungtv have an accepted pull request in spacy 2 months ago.

Was this page helpful?
0 / 5 - 0 ratings