Method and arrangement for SIM algorithm automatic charset detection

The invention relates, in an embodiment, to a computer-implemented method for handling a target document, the target document having been transmitted electronically and involving an encoding scheme. The method includes training, using a plurality of text document samples, to obtain a set of machine...

Full description

Saved in:
Bibliographic Details
Main Author DIAO LILI
Format Patent
LanguageEnglish
Published 02.11.2010
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The invention relates, in an embodiment, to a computer-implemented method for handling a target document, the target document having been transmitted electronically and involving an encoding scheme. The method includes training, using a plurality of text document samples, to obtain a set of machine learning models. Training includes using SIM (Similarity Algorithm) to generate the set of machine learning models from feature vectors obtained from the plurality of text document samples. The method also includes applying the set of machine learning models against a set of target document feature vectors converted from the target document to detect the encoding scheme. The method including decoding the target document to obtain decoded content of the document based on at least the first encoding scheme.
Bibliography:Application Number: US20100714392