đĻ @bntk/tokenization
tokenizeToSentences()â
function tokenizeToSentences(text): string[];
Defined in: sentence.ts:50
Tokenizes a Bangla text into an array of sentences.
Parametersâ
Parameter | Type | Description |
---|---|---|
text | string | The input Bangla text to tokenize. Can contain mixed content including URLs, emails, and special characters. |
Returnsâ
string
[]
An array of cleaned and tokenized sentences, with duplicates removed.
Descriptionâ
This function performs the following steps:
- Splits text by line breaks
- Further splits by Bangla sentence separators
- Cleans each sentence by:
- Removing text within parentheses, brackets, braces, and angle brackets
- Removing URLs and email addresses
- Removing HTML entities
- Removing Latin characters
- Keeping only Bangla characters, spaces, and essential punctuation
- Normalizing spaces and punctuation
- Filters sentences based on the following criteria:
- Must contain Bangla characters (Unicode range: \u0980-\u09FF)
- Must have more than 3 words
- Must not be empty
- Returns a Set to remove duplicates
Examplesâ
Basic usage with simple Bangla text:
const text = "āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻāĨ¤ āĻ¤ā§āĻŽāĻŋ āĻāĻŋ āĻļā§āĻ¨āĻŦā§?";
console.log(tokenizeToSentences(text));
// Output: ["āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻ", "āĻ¤ā§āĻŽāĻŋ āĻāĻŋ āĻļā§āĻ¨āĻŦā§"]
Handling mixed content:
const mixedText =
"āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻāĨ¤ Visit https://example.com or email@example.com";
console.log(tokenizeToSentences(mixedText));
// Output: ["āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻ"]
Handling text with special characters:
const specialText =
"āĻŦāĻžāĻāĻ˛āĻž āĻā§āĻā§āĻ¸āĻ (āĻāĻāĻ°ā§āĻāĻŋ āĻā§āĻā§āĻ¸āĻ) [āĻŦāĻ¨ā§āĻ§āĻ¨ā§ āĻā§āĻā§āĻ¸āĻ] {āĻā§āĻāĻāĻĄāĻŧāĻž āĻā§āĻā§āĻ¸āĻ}";
console.log(tokenizeToSentences(specialText));
// Output: ["āĻŦāĻžāĻāĻ˛āĻž āĻā§āĻā§āĻ¸āĻ"]
tokenizeToWords()â
function tokenizeToWords(text): string[];
Defined in: word.ts:57
Tokenizes a Bangla text string into an array of words.
Parametersâ
Parameter | Type | Description |
---|---|---|
text | string | The input Bangla text to tokenize. Can contain mixed content including punctuation and special characters. |
Returnsâ
string
[]
An array of cleaned and tokenized words, with empty strings removed.
Descriptionâ
This function performs the following steps:
- Cleans the input text by:
- Removing non-Bangla characters (keeping only Unicode range: \u0980-\u09FF)
- Preserving essential punctuation marks (āĨ¤, ,, ;, :, ', ", ?, !)
- Preserving hyphens for compound words
- Splits the text by whitespace
- Further splits each segment by punctuation (excluding hyphens)
- Cleans each word by:
- Removing trailing hyphens
- Removing Bangla digits from start and end
- Trimming whitespace
- Filters out empty strings
Examplesâ
Basic usage with simple Bangla text:
const text = "āĻāĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ", "āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ", "āĻāĻžāĻ¨", "āĻāĻžāĻ"]
Handling text with punctuation:
const text = "āĻāĻŽāĻŋ, āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ āĻāĻžāĻ¨ āĻāĻžāĻāĨ¤ āĻ¤ā§āĻŽāĻŋ āĻāĻŋ āĻļā§āĻ¨āĻŦā§?";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ", "āĻŦāĻžāĻāĻ˛āĻžāĻ¯āĻŧ", "āĻāĻžāĻ¨", "āĻāĻžāĻ", "āĻ¤ā§āĻŽāĻŋ", "āĻāĻŋ", "āĻļā§āĻ¨āĻŦā§"]
Handling compound words with hyphens:
const text = "āĻāĻŽāĻŋ-āĻ¤ā§āĻŽāĻŋ āĻŦāĻžāĻāĻ˛āĻž-āĻāĻžāĻˇāĻž āĻļāĻŋāĻāĻāĻŋ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŽāĻŋ-āĻ¤ā§āĻŽāĻŋ", "āĻŦāĻžāĻāĻ˛āĻž-āĻāĻžāĻˇāĻž", "āĻļāĻŋāĻāĻāĻŋ"]
Handling text with Bangla digits:
const text = "ā§§āĻāĻŋ āĻŦāĻ ā§¨āĻāĻŋ āĻāĻžāĻ¤āĻž";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻāĻŋ", "āĻŦāĻ", "āĻāĻŋ", "āĻāĻžāĻ¤āĻž"]