The API can be used to classify documents specified by text or URL. The classification has one input parameter, the URL or document you are classifying and two output parameters, tags and original tags.
original tags are some tags that are very close to the document you provided, they are picked based on similarity between the input document and set of documents that are user labeled. Those user labels represent the original tags.
But they tend to be too concrete, for example, a document about vegetables could be very close to document about diets, so you might get confusing results for certain cases, that is why we implemented another layer of generalization, in this case we will map diets to and vegetables to food, so both documents will be about food.
This is actually the second output, the tags. They are basically generalization of the original tags, they tend to be very accurate in regards to picking up the general topic of the document. We have around 500 general tags and 120 000 original tags, not all tags are necessarily generalized and we are adding more tags as time goes on, this is why we decided not to release a full list of tags.
To get started with the classification API, you can check out this python example
To evaluate similarity we use one of the modern algorithms in NLP called word vectors (aka word embedding).
We trained several such models and use them according to the specific text input. Most text similarity services do static analysis on the texts they are comparing, for example they count common words, look for similarities in the sentence structure, count usage of certain word types like verbs and many more such techniques. They are very effective in certain problems, for example if the two texts are very short, if you are validating the grammar, or you look for syntactic similarities. Our similarity API is evaluating semantic similarity or to what degree are the two documents about the same subject (doesn’t matter which subject). As input you have two documents as plain text or url and as result you get similarity from 0 to 1, 0 meaning they are completely different and 1 they are identical (0 is very hard to reach, but same texts have similarity coefficient 1).
Our similarity example is a great starting point if you need such service.
If you need to analyse very short sentences, like chat messages or tweets the classification and the similarity APIs won’t work very well, because they need sufficient context in order to pick up the subject (to be more precise the similarity API will evaluate if two texts are very similar even if they are short, but to it all short texts look quite similar even if you use antonyms for example). You can use this service to summarize a sentence in several words, very useful for data mining, chat bots, auto correct/suggestion, or paraphrasing. We have several examples on our GitHub page:
- Summarize a sentence using the python client.
- Cluster data based on its semantic features.
- Define a group of items and check weather some arbitrary elements belong to that group.
These are just few examples of what you can do, we plan to more. The endpoint for evaluating similarity vectors is also exposed and can be used by users with more knowledge in NLP and word vectors.
We hope this information is helpful, if you have more questions or something is not very clear, just ping us and we will make sure to help you as best as we can.