Computer Networking Research Lab

1. Search Terms - Raw Data

This dataset contains BitTorrent search terms extracted from kickasstorrents.com which is the only website with search terms and time information.

1.1 Data Format

search_term<tab>unixtime

search_term - Search term as a Unicode string. Search terms can be given in any language.
unixtime - Time represented as UNIX time based on when the data is collected at Colorado (MDT). Other than for dataset1, to get the actual Colorado time, convert unixtime to GMT. Then reduce 7:00 hours. Online converter is available here.

1.2 Datasets

dataset1.tgz	Start time: 2010-06-08 18:00 MDT End time: 2010-06-27 09:41 MDT Duration: 18 days 15 hours 41 minutes Sampling interval: 5 Seconds No of samples: 320,239 No of queries: 9,669,035 No of distinct search terms: 1,353,662
dataset2.tgz	Need to extract. Has some missing samples. Start time: 2010-06-29 11:53 MDT End time: 2010-07-05
dataset3.tgz	Still collecting. Start time: 2010-07-05 09:00 MDT

2. Search Clouds - Raw Data

This dataset contains BitTorrent search clouds from several websites. Some sites only indicate the font size and color (indicate type of content) while 2 sites also provide number of search requests for each search term. However, data collection time period is unknown.

2.1 Data Format

Line 1 - unixtime
Rest of the lines - search_term<tab>font_size<tab>no_of_queries

unixtime - Time represented as UNIX time based on when the data is collected at Colorado (MDT). To get the actual Colorado time, convert unixtime to GMT. Then reduce 7:00 hours. Online converter is available here.
search_term - Search term as a Unicode string. Files can be searched in any language.
font_size - Font size given either as an absolute value or as a percentage (%)
no_of_queries - No of queries for each term. Only some of the datasets have this.

2.2 Datasets

extratorrent.com	extratorrent_dataset1.tgz	Start time: 2010-06-29 16:30:26 EDT
fenopy.com	fenopy_dataset1.tgz
seedpeer.com	seedpeer_dataset1.tgz
tapedown.com	tapedown_dataset1.tgz
torrentbit.net	torrentbit_dataset1.tgz
torrentscan.com	torrentscan_dataset1.tgz
torrentsection.com	torrentsection_dataset1.tgz
torrentreactor.to	torrenttractor_dataset1.tgz
yourbittorrent.com	youbittorrent_dataset1.tgz

3. Scripts to Extract/Process

3.1 For Search Terms

buildHistogram.py - Extrat keywords from the dataset & put them into to a histogram based on common search terms and/or time that searches were issued.
buildHistogramKeyword.py - Generate the histogram of query rate for each hour/minute for a given search term. !!! Code need to be updated.
sampleKeywords.py - To sample keywords from the histogram of keywords to plot the Zipf's distribution as 1+ million point can't be plotted.

3.2 For Search Clouds

compare.py - Compare 2 search clouds to see how many search terms are common between them.

Computer Networking Research Laboratory

Dept. of Electrical & Computer Engineering, Colorado State University