# Core Functions
Core functions are the main functions that are needed for any type of NLP task. We have defined few core functions in berkelium library.
# Tokenizer
Tokenizer function is used to tokenize (opens new window) text input and returns a Array<string>
of tokens. This is a core function and one of the basics steps in any NLP tasks.
To use tokenizer, use the following code:
const tokens = berkelium.tokenize(sentence);
# Parameters
sentence string
A string input to be tokenized.
# Returns
tokens Array<string>
Returns a string array of tokens.
# Encorder
Encoder function vectorize the string tokens. Which means, each string tokens will be assigned a unique number. This is important when preparing text data to be feed in to a machine learning model. Usually this function is used after tokenizing your string data using tokenize
function.
To vectorize the text, use the code below:
const vocab = await berkelium.encode(tokens);
# Parameters
tokens Array<string>
A string array of tokens.
# Returns
vocab DICTIONARY_BOOK
An dictionary of vocabulary with their assigned unique numeric value.
# Preprocessor
Preprocessor function process the text data and create a DATASET
object that can be used to train machine learning model.
To preprocess the text data, use the code below:
const vocab = await berkelium.preprocess(textData);
# Parameters
textData Array<Array<string>>
We can feed the training data we prepared in Preparing Data step.
# Returns
DATASET DATASET
returns a dataset object which contains following properties:
x
:Array<Array<number>>
feature data for training (vectorized)y
:Array<Array<number>>
label data for training (vectorized)labels
:Array<string>
label data in string formatvocab
:DICTIONARY_BOOK
dictionary of vocabulary found in the datasetlength
:number
sequence length