1. Search Terms - Raw Data
This dataset contains BitTorrent search terms extracted from kickasstorrents.com which is the only website with search terms and time information.
1.1 Data Format
search_term<tab>unixtime
search_term
- Search term as a Unicode string. Search terms can be given in any language.unixtime
- Time represented as UNIX time based on when the data is collected at Colorado (MDT). Other than for dataset1, to get the actual Colorado time, convert unixtime to GMT. Then reduce 7:00 hours. Online converter is available here.
1.2 Datasets
dataset1.tgz | Start time: 2010-06-08 18:00 MDT End time: 2010-06-27 09:41 MDT Duration: 18 days 15 hours 41 minutes Sampling interval: 5 Seconds No of samples: 320,239 No of queries: 9,669,035 No of distinct search terms: 1,353,662 |
dataset2.tgz | Need to extract. Has some missing samples. Start time: 2010-06-29 11:53 MDT End time: 2010-07-05 |
dataset3.tgz | Still collecting. Start time: 2010-07-05 09:00 MDT |
2. Search Clouds - Raw Data
This dataset contains BitTorrent search clouds from several websites. Some sites only indicate the font size and color (indicate type of content) while 2 sites also provide number of search requests for each search term. However, data collection time period is unknown.
2.1 Data Format
Line 1 - unixtime
Rest of the lines - search_term<tab>font_size<tab>no_of_queries
unixtime
- Time represented as UNIX time based on when the data is collected at Colorado (MDT). To get the actual Colorado time, convert unixtime to GMT. Then reduce 7:00 hours. Online converter is available here.search_term
- Search term as a Unicode string. Files can be searched in any language.font_size
- Font size given either as an absolute value or as a percentage (%)no_of_queries
- No of queries for each term. Only some of the datasets have this.
2.2 Datasets
extratorrent.com | extratorrent_dataset1.tgz | Start time: 2010-06-29 16:30:26 EDT |
fenopy.com | fenopy_dataset1.tgz | |
seedpeer.com | seedpeer_dataset1.tgz | |
tapedown.com | tapedown_dataset1.tgz | |
torrentbit.net | torrentbit_dataset1.tgz | |
torrentscan.com | torrentscan_dataset1.tgz | |
torrentsection.com | torrentsection_dataset1.tgz | |
torrentreactor.to | torrenttractor_dataset1.tgz | |
yourbittorrent.com | youbittorrent_dataset1.tgz |
3. Scripts to Extract/Process
3.1 For Search Terms
- buildHistogram.py - Extrat keywords from the dataset & put them into to a histogram based on common search terms and/or time that searches were issued.
- buildHistogramKeyword.py - Generate the histogram of query rate for each hour/minute for a given search term. !!! Code need to be updated.
- sampleKeywords.py - To sample keywords from the histogram of keywords to plot the Zipf's distribution as 1+ million point can't be plotted.
3.2 For Search Clouds
- compare.py - Compare 2 search clouds to see how many search terms are common between them.