Link Search Menu Expand Document

Datasets

Table of contents

  1. Data Generation Technique
  2. Download the datasets
    1. Indic-to-Indic Dataset
    2. English-to-Indic Dataset

Data Generation Technique

For this work, we curated two datasets for performing the Indic-to-Indic decoding and English-to-Indic decoding tasks across the 7 Indic languages used in the study. The first dataset contains Indic words obtained from the Leipzig Corpora Collection. along with corresponding coordinate sequences depicting the path traced on an Indic character smartphone keyboard to input the word. The second dataset contains Indic-English transliteration pairs scraped from Wikidata along with corresponding coordinate sequences for the path traced on an English character keyboard. We use the principle that human motor control tends to maximise smoothness in movement to develop a synthetic path generation technique that models the gesture typing trajectory as a path that minimises the total jerk.

For simulating the path between two characters, a minimum jerk trajectory is first plotted between the character locations. We then add two types of noise to the path. The first type adds noise to the starting and ending positions of the path. The (x, y) coordinates for these positions are sampled from a 2D Gaussian distribution centred at the middle of the corresponding key on the keyboard. The second type adds noise to the path between the characters. To perform this, we first sample a coordinate pair (x′, y′) from the uniform distribution of points bounded by the x and y coordinates of the starting and ending points of the path. Following this, we sample points on a path of minimum jerk passing through the 3 points uniformly over time. This process is repeated for every pair of adjacent characters in the word to obtain a sequence of coordinates that describes the trace. The figure below shows the trace created for the word budhi using this method. Apart from the (x, y) coordinate pair, we augment the input features for each sampled point with the x and y derivatives at the point and a one-hot vector with value 1 at the index i corresponding to the character on which the point lies on the keyboard.

Download the datasets

Indic-to-Indic Dataset

We provide datafiles containing simulated swipe data for the Indic words used in the study. Each datafile contains Indic words with corresponding numerical sequences which describe the (x,y) coordinate locations over time where a touch is sensed on a touch-based Indic character keyboard. These sequences are generated by the simulation method described above. We also provide the code that is used to generate these datasets. The overall dataset contains keyboard traces for 193,658 words across 7 Indic languages.

LanguageSizeDataset (small size)Dataset (full size)Code for generation
Hindi32415downloaddownloadcode
Tamil24094downloaddownloadcode
Bangla15320downloaddownloadcode
Telugu28996downloaddownloadcode
Kannada26551downloaddownloadcode
Gujrati30024downloaddownloadcode
Malayalam36258downloaddownloadcode

English-to-Indic Dataset

We provide datafiles containing simulated swipe data for the Indic words used in the study. Each datafile contains Indic words, their English transliterations and corresponding numerical sequences which describe the (x,y) coordinate locations over time where a touch is sensed on a touch-based English character keyboard. These sequences are generated by the simulation method described above. The overall dataset contains keyboard traces for 104,412 words across 7 Indic languages. Note that the code links provided above for dataset generation can be used for the English-to-Indic decoding task as well because the swipe data is only generated for Indic words.

LanguageSizeDataset (small size)Dataset (full size)
Hindi29074downloaddownload
Tamil21523downloaddownload
Bangla17332downloaddownload
Telugu12733downloaddownload
Kannada4877downloaddownload
Gujrati8776downloaddownload
Malayalam10097downloaddownload