Skip to main content

đŸ“Ļ @bntk/tokenization

tokenizeToSentences()​

function tokenizeToSentences(text): string[];

Defined in: sentence.ts:50

Tokenizes a Bangla text into an array of sentences.

Parameters​

ParameterTypeDescription
textstringThe input Bangla text to tokenize. Can contain mixed content including URLs, emails, and special characters.

Returns​

string[]

An array of cleaned and tokenized sentences, with duplicates removed.

Description​

This function performs the following steps:

  1. Splits text by line breaks
  2. Further splits by Bangla sentence separators
  3. Cleans each sentence by:
    • Removing text within parentheses, brackets, braces, and angle brackets
    • Removing URLs and email addresses
    • Removing HTML entities
    • Removing Latin characters
    • Keeping only Bangla characters, spaces, and essential punctuation
    • Normalizing spaces and punctuation
  4. Filters sentences based on the following criteria:
    • Must contain Bangla characters (Unicode range: \u0980-\u09FF)
    • Must have more than 3 words
    • Must not be empty
  5. Returns a Set to remove duplicates

Examples​

Basic usage with simple Bangla text:

const text = "āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡āĨ¤ āĻ¤ā§āĻŽāĻŋ āĻ•āĻŋ āĻļā§āĻ¨āĻŦā§‡?";
console.log(tokenizeToSentences(text));
// Output: ["āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡", "āĻ¤ā§āĻŽāĻŋ āĻ•āĻŋ āĻļā§āĻ¨āĻŦā§‡"]

Handling mixed content:

const mixedText =
"āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡āĨ¤ Visit https://example.com or email@example.com";
console.log(tokenizeToSentences(mixedText));
// Output: ["āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡"]

Handling text with special characters:

const specialText =
"āĻŦāĻžāĻ‚āĻ˛āĻž āĻŸā§‡āĻ•ā§āĻ¸āĻŸ (āĻ‡āĻ‚āĻ°ā§‡āĻœāĻŋ āĻŸā§‡āĻ•ā§āĻ¸āĻŸ) [āĻŦāĻ¨ā§āĻ§āĻ¨ā§€ āĻŸā§‡āĻ•ā§āĻ¸āĻŸ] {āĻ•ā§‹āĻāĻ•āĻĄāĻŧāĻž āĻŸā§‡āĻ•ā§āĻ¸āĻŸ}";
console.log(tokenizeToSentences(specialText));
// Output: ["āĻŦāĻžāĻ‚āĻ˛āĻž āĻŸā§‡āĻ•ā§āĻ¸āĻŸ"]

tokenizeToWords()​

function tokenizeToWords(text): string[];

Defined in: word.ts:57

Tokenizes a Bangla text string into an array of words.

Parameters​

ParameterTypeDescription
textstringThe input Bangla text to tokenize. Can contain mixed content including punctuation and special characters.

Returns​

string[]

An array of cleaned and tokenized words, with empty strings removed.

Description​

This function performs the following steps:

  1. Cleans the input text by:
    • Removing non-Bangla characters (keeping only Unicode range: \u0980-\u09FF)
    • Preserving essential punctuation marks (āĨ¤, ,, ;, :, ', ", ?, !)
    • Preserving hyphens for compound words
  2. Splits the text by whitespace
  3. Further splits each segment by punctuation (excluding hyphens)
  4. Cleans each word by:
    • Removing trailing hyphens
    • Removing Bangla digits from start and end
    • Trimming whitespace
  5. Filters out empty strings

Examples​

Basic usage with simple Bangla text:

const text = "āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻ†āĻŽāĻŋ", "āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ", "āĻ—āĻžāĻ¨", "āĻ—āĻžāĻ‡"]

Handling text with punctuation:

const text = "āĻ†āĻŽāĻŋ, āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡āĨ¤ āĻ¤ā§āĻŽāĻŋ āĻ•āĻŋ āĻļā§āĻ¨āĻŦā§‡?";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻ†āĻŽāĻŋ", "āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ", "āĻ—āĻžāĻ¨", "āĻ—āĻžāĻ‡", "āĻ¤ā§āĻŽāĻŋ", "āĻ•āĻŋ", "āĻļā§āĻ¨āĻŦā§‡"]

Handling compound words with hyphens:

const text = "āĻ†āĻŽāĻŋ-āĻ¤ā§āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻž-āĻ­āĻžāĻˇāĻž āĻļāĻŋāĻ–āĻ›āĻŋ";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻ†āĻŽāĻŋ-āĻ¤ā§āĻŽāĻŋ", "āĻŦāĻžāĻ‚āĻ˛āĻž-āĻ­āĻžāĻˇāĻž", "āĻļāĻŋāĻ–āĻ›āĻŋ"]

Handling text with Bangla digits:

const text = "ā§§āĻŸāĻŋ āĻŦāĻ‡ ā§¨āĻŸāĻŋ āĻ–āĻžāĻ¤āĻž";
const words = tokenizeToWords(text);
console.log(words);
// Output: ["āĻŸāĻŋ", "āĻŦāĻ‡", "āĻŸāĻŋ", "āĻ–āĻžāĻ¤āĻž"]