In our second blog post, we present our research findings as far as the regulation and the attitude of different countries towards the Right to be Forgotten (RTBF) are concerned. As for the RTBF-bot we plan to deliver, we present some details of data pre-processing.
Since the concept of the RTBF was put forward, countries worldwide hold different attitudes towards it. EU countries, Australia and Japan are some supportive representatives. The EU, as the initiator of the RTBF, has developed practical laws to support the concept. More concretely, in 1995, the EU introduced the “European Data Protection Regulations” which provide a law basis for data protection. These laws apply for more than 20 year, a fact that enables EU to have sufficient knowledge and experience in data protection. In the majority of the cases examined, the EU court decisions largely help the realization of the rights to privacy.
Australia has also developed a national legal system towards the direction of the RTBF. More precisely, Australia has introduced a very similar right under the name “Right To Be Deleted”. The difference is that this right allows people to delete only information they uploaded on their own. After EU formally recognized the RTBF, Japan set off a heated debate on the same issue, since similar lawsuits have often been submitted. The Japanese local court decisions provides examples for such cases, which implies the support of RTBF.
The opposition party is represented mainly by USA and UK. In the US, the first amendment provides unilateral supreme protection to the freedom to expression. Since RTBF contradicts to the freedom of expression, the RTBF has a small possibility to survive in the US. Although being an EU member, the UK is strongly opposed to the RTBF. In July 2014, the British House of Lords issued that the RTBF is a misreading to the EU data protection regulation.
So far, most of the countries have a neutral attitude towards the RTBF, although they are already aware of the importance of data protection issues. Finally, some countries and regions, like Argentina and Hong Kong, have made some steps towards the direction of the RTBF and try to adjust their legal system in order to be able to support cases related to personal data protection and deletion in the future.
After collecting the desired tweets via the twitter’s Search API , we need to extract their contents. Since the tweets are very noisy, they need a lot of pre-processing. Therefore, we perform the following steps before computing the polarity of each tweet:
- Tokenization: Given a character sequence, tokenization is the task of separating it into pieces, called tokens. In case the aforementioned sequence is a sentence, the resulting tokens are words.
- Stemming: Often we want to count inflected forms of a word together. This procedure is referred to as stemming. For example, stemming an english text treats the words “forest”, “forests”, “forested”, “forest’s”, “forests” as instances of the word “forest”. Stemming reduces the number of unique vocabulary items that need to be tracked and consequently speeds up a variety of computational operations.
- Replacing emoticons: Emoticons play an important role in determining the polarity of the tweet. For that reason, we replace the emoticons with a description that best conveys the polarity. For example is replaced by “grinning face with smiling eyes”.
- Removing URLs from tweet text: All URLs posted in tweets are shortened using the t.co service . We will extract this URL from the tweet, convert it to the original form and keep it in order to perform sentiment analysis to the web page’s text.
- Removing target mentions: The target mentions in a tweet done using ‘@’ are usually the twitter handle of people or organisations. This information is not needed to determine the sentiment of the tweet. Hence they are removed.
- Removing numbers: Numbers aren’t important when measuring sentiment.
- Removing stop words: Stop words do not carry any sentiment information hence they are removed. Examples are the words ‘the’, ‘of’, ‘and’, ‘to’, etc.
- Handling negative mentions: Negation plays an important role in determining the sentiment of each tweet. To be able to handle this, we have to create our own negation word list that includes words like ‘not’, ‘no’, ‘nothing’, etc. When a word from the list appears in tweet, we will check for associated word with that negation word, and change accordingly the polarity.
- Expanding acronyms: We expand acronyms to their full meaning in each tweet, if any are present, using our dictionary for acronyms.
Giannakopoulos Athanasios, Kyritsis Georgios, Zhang Fuzhi, Zhong Hua