Datasets

Data Generation Technique
Download the datasets
1. Indic-to-Indic Dataset
2. English-to-Indic Dataset

Data Generation Technique

For this work, we curated two datasets for performing the Indic-to-Indic decoding and English-to-Indic decoding tasks across the 7 Indic languages used in the study. The first dataset contains Indic words obtained from the Leipzig Corpora Collection. along with corresponding coordinate sequences depicting the path traced on an Indic character smartphone keyboard to input the word. The second dataset contains Indic-English transliteration pairs scraped from Wikidata along with corresponding coordinate sequences for the path traced on an English character keyboard. We use the principle that human motor control tends to maximise smoothness in movement to develop a synthetic path generation technique that models the gesture typing trajectory as a path that minimises the total jerk.

For simulating the path between two characters, a minimum jerk trajectory is first plotted between the character locations. We then add two types of noise to the path. The first type adds noise to the starting and ending positions of the path. The (x, y) coordinates for these positions are sampled from a 2D Gaussian distribution centred at the middle of the corresponding key on the keyboard. The second type adds noise to the path between the characters. To perform this, we first sample a coordinate pair (x′, y′) from the uniform distribution of points bounded by the x and y coordinates of the starting and ending points of the path. Following this, we sample points on a path of minimum jerk passing through the 3 points uniformly over time. This process is repeated for every pair of adjacent characters in the word to obtain a sequence of coordinates that describes the trace. The figure below shows the trace created for the word budhi using this method. Apart from the (x, y) coordinate pair, we augment the input features for each sampled point with the x and y derivatives at the point and a one-hot vector with value 1 at the index i corresponding to the character on which the point lies on the keyboard.

Download the datasets

Indic-to-Indic Dataset

We provide datafiles containing simulated swipe data for the Indic words used in the study. Each datafile contains Indic words with corresponding numerical sequences which describe the (x,y) coordinate locations over time where a touch is sensed on a touch-based Indic character keyboard. These sequences are generated by the simulation method described above. We also provide the code that is used to generate these datasets. The overall dataset contains keyboard traces for 193,658 words across 7 Indic languages.

Language	Size	Dataset (small size)	Dataset (full size)	Code for generation
Hindi	32415	download	download	code
Tamil	24094	download	download	code
Bangla	15320	download	download	code
Telugu	28996	download	download	code
Kannada	26551	download	download	code
Gujrati	30024	download	download	code
Malayalam	36258	download	download	code

English-to-Indic Dataset

We provide datafiles containing simulated swipe data for the Indic words used in the study. Each datafile contains Indic words, their English transliterations and corresponding numerical sequences which describe the (x,y) coordinate locations over time where a touch is sensed on a touch-based English character keyboard. These sequences are generated by the simulation method described above. The overall dataset contains keyboard traces for 104,412 words across 7 Indic languages. Note that the code links provided above for dataset generation can be used for the English-to-Indic decoding task as well because the swipe data is only generated for Indic words.

Language	Size	Dataset (small size)	Dataset (full size)
Hindi	29074	download	download
Tamil	21523	download	download
Bangla	17332	download	download
Telugu	12733	download	download
Kannada	4877	download	download
Gujrati	8776	download	download
Malayalam	10097	download	download

Datasets

Table of contents

Data Generation Technique

Download the datasets

Indic-to-Indic Dataset

English-to-Indic Dataset