Indico.io API + text analysis of definitions: privacy, surveillance, interrogation

Interesting introduction to thinking a bit more deeply about how algorithms are engineered to learn about sentiment and meaning analysis.  This small CSV data set was a set of definitions of 'privacy', 'surveillance', and 'interrogation'.  

I analyzed each definition as a line using the Indico API for "political analysis" and also "sentiment."  They seem pretty arbitrary but I suppose they gather the information from tags and things like that - but I think a point to explore further is: the internet is quite two-dimensional as a way to gather understanding of complex human sentiment and issues and tropes with a spatio-temporal history.  If AIs are the children of humanity - is that any way to teach a child? Come now.

 It is fun to play with the browser/GUI version and run words like "God" or "visage" or "face" or "help" (100% positive sentiment analysis) or "hurt" (1% sentiment analysis) though. https://indico.io

I just used some ngram examples I worked with a little in Python from Spring 2015's Python-based course Reading and Writing Electronic Text.  I dabbled a bit in using the API to analyze each definition in a row as a line.  Then I also analyzed each individual word which showed up more than twice in the text.  

  

FUTURE OF NEW MEDIA : FUTURE OF DATA COLLECTION & IMPLICATIONS FOR PRIVACY

Brief exploratory research paper for Art Kleiner's Future of New Media Seminar, March 9, 2015

Meticulous archive and record-keeping is the mark of economically powerful and innovative societies through time.  Since its innovation by far-flung ancient civilizations Romans, the Mesopotamians, the Shang dynasty through to modern European colonial powers, archiving, as UNESCO puts it, has historically served

    “to prove your right to the possession of a certain piece of land you needed title deeds; to determine the size of population being governed and therefore the taxes that should be collected required records of birth and death; to enforce government laws and regulations it was necessary to keep a record of the laws, decrees and edicts. The keeping of records and archives was therefore not a luxury but a necessity on which depended one' s ability to continue to rule and to have rights and privileges. The records and archives were also preserved in order to prove the rights and privileges of those who were being governed.” 

 

The industries of archiving and it’s lesser,  traditionally less permanent sibling, record keeping and collecting, therefore seem to have a heavy hand in what rights a being has to different types of resources and self-control, as well as control over others and rights to resources over others. Records supersede, subjectively and temporally, the physical or immediate social evidence of these rights for an individual existing in a society or societies. Record-keeping is a pillar of any given current state of any given current socio-economic system we exist within.  An important place to start when investigating trends in widespread technological innovations is at the top levels of research and development due to a trickle down type affect. This paper is a superficial survey of the increasingly massive data collections we have been living with and seeks to address what implications those might have for the individual and for entities like Facebook and Google.  

 

    Investigations into a very different space from personal user data as Big Data points to a summary of a 2014 Big Data in Materials Science workshop held by NASA, DoD, DHS, DARPA and other top scientific and technological bodies a number of telling priorities came up:  “How do we store, capture, and transmit data from extreme environments? How do we triage massive data for archiving? How can we use advanced data science methods to systematically derive scientific inferences from massive, distributed science measurements and models?”  The collaboration between these (perhaps) amply-funded research bodies is largely focused on the problems of increasingly massive amounts of big data, the need to analyze that data in a less centralized fashion - rather than constantly sending it to a server somewhere. Data from places such as a space station would be algorithmically analyzed in situ because “onboard data collection systems are likely to reach a capacity constraint in the near future, which will force a change in system implementation” which perhaps raises interesting questions about possibilities in the shift of these technological and physical considerations to big data across consumer markets and industries. Daniel Crichton of the Jet Propulsion Lab pointed out that a system developed for data analysis for NASA or the Air Force “could be used to study Twitter patterns for security purposes, for example.” He also predicted distribution of data would become an obsolete approach with things moving toward a service-based approach in analyzing and reducing data to relevant subsets. Other discussions at the 2014 workshop centered on “how to systematize massive data analysis and increase efficiencies…aided by international cooperation and coordination.” The Department of Homeland Security presented a model for an automatic data capturing project in biosecurity known as algorithms for analysis, an algorithm-oriented project to identify emerging technologies that could be used against our nation. It uses natural-language processing software to find descriptors in scientific literature and documentation.  It seems that with this level of sophistication these algorithms can also be used to detect emerging social trends and can give private and public sector leaders a heads up on how to handle situations before they materialize.  These trends in the industrial and scientific fields point to increasing anticipation of ever large data loads which Moore’s law effectively is yielding in our shared world. 

 

    As of 2012 there are 6 billion mobile phone subscriptions worldwide.  “Even the simplest phone leaves evidence of its owner’s location every time it pings a communication tower.” (Reality Mining, MIT, Eagle & Greene) Data from mobile phones provide “insight into when and where people move from one location to another - information that can be critical when developing models of the spread of diseases such as malaria and flu.”  The authors take that using personal data can make people’s lives easier and healthier— ignoring the darker implications of this level of personal monitoring by private and public entities.  The MIT-based authors of Reality Mining posit that tracked “changes in movement and conversation patterns captured by a phone with the appropriate sensors and software can indicate the onset of illnesses such as depression or Parkinson’s disease earlier than other medical tests.”   This is a pollyannaish, narrow view of the way this kind of inferential information will be used. 

 

    Another MIT experiment begun in 2004 interestingly preceded the PRISM revelations using the Nokia 6600 phone which allowed remote software update access capable of logging details such as when an individual’s phone’s battery is being charged.  The experiment also included a component called called Bluedar or BlueAware to scan for proximal blue tooth devices near to a user’s phone, installable by remote General Packet Radio Service.  “Findings indicated that a person’s location, proximity to other people, call logs, and phone activity at the beginning of the day often indicate only a small possibility of behaviors later in the evening.”  Another component of the long-running experiment involved SoundSense which used the microphone of a person’s mobile phone to infer a person’s location and activity information tat could be used to provide simple status updates, “such as whether a person is in a coffee shop, taking a walk outside, or brushing his or her hair.” On a related note,  a May 2014 article in the International Business review (“Electronic Surveillance Experts React To Smartphone Mic Data Collection”) reported that “Though Facebook guaranteed users that “no sound is stored” by the new opt-in feature, the social media giant confirmed to that “data is saved, but all data is anonymized and aggregated.” There was “no indication” in the article or in Facebook’s official commentary “on how that data would be used once it was gleaned.”  Another company LocAid was recently in popular news as a current service for advertisers and other firms to be able to track and connect a user’s location to identifier (phone number) via cell towers without smartphone GPS settings being turned ‘on.’  Verizon, a company with roughly 123 million paid-subscription users employs “supercookies”, an identifier, to let any website know who you are when you visit over Verizon’s cellular network - it is not anonymous.  Google’s AdID and iPhone’s iAd are “anonymous” universal identifiers which are work arounds to the limitations of mobile internet tracking versus the traditional first and third-person cookies of desktop internet activities. 

 

    There are at least 140 Google Analytics Technology Partners who provide data analysis services for desktop and mobile based business partners listed on the Google Analytics site.  A telling way to learn about the goals, if not the current legal technological means, of the vast information collection techniques being used by the private-sector information dominator Google is to read their Google Analytics website copy:

 

    “Know your audience: No two people think exactly alike. Google Analytics helps         you analyze visitor traffic and paint a complete picture of your audience and their         needs, wherever they are along the path to purchase.”  

 

    “Trace the customer path: Where customers are can be as critical as who they         are. Tools like Traffic Sources and Visitor Flow help you track the routes people         take to reach you, and the devices they use to get there, so you can meet them         where they are and improve the visitor experience.” 

 

    “See what they're up to: Do some types of people give you better results? You'll         find out with tools like In-Page Analytics, which lets you make a visual             assessment of how     visitors interact with your pages. Learn what they're looking         for and what they like, then     tailor all your marketing activities — from your         site to your apps to your ad campaigns — for maximum impact.”

 

    Google and the Culture of Search (Hillis, Petit, Michael, Jarrett, Kyle) suggests people have “a notion of cyberspace which is omniscient and infinite.” E.g. that Google could be a symbolic stand in for the authority of science and technology and god and the objectivity of that is problematic since the intense development of personalized algorithms based on user data collection creates a positive feedback loop of generalization of search result relevance despite Larry Page’s (disturbing) comment that “‘search’ will be included in people’s minds one day.” It is disturbing in the light of the fact that not only does Google parse information we want instantaneous access to, but they scrape and auto-request information about our habits, thoughts, curiosities to change how that information is algorithmically served back to us.  What happens when that information becomes so dense and comprehensive that there is a marked conflict of interest between Google’s corporate business partners and the user’s concept of Google’s utility which is impartial, objective, “trustworthy” search returns?  (Halvais 2009) notes “that there has been a loss in serendipity” when search engines determine the relevance of information retrieval on the user’s behalf.  Especially when it is paid for by advertising money.   Thus Google’s PageRank’s algorithms could be at risk in creating a  over-generalizing, reductive cycle which becomes both useless to users and potentially to advertisers and it is imaginable this could have an intentional or unintentional impact on culture at large. 

 

    Visiting Google’s privacy policy publicly informs the user that they use and collect information about the “services you use, how you use them, and how you view and interact with ads and content.”  This personally-linked data also includes: account holder phone number, IP address, account-identifying cookies, search queries, telephone log information like calling-party numbers, time, date, duration and types of calls as well as SMS routing information, real-time actual GPS/IP location information,  as well as “various technologies” which provide Google with information in nearby devices, information     about installed software or services which contact periodically Google’s servers.   These data sets match up precisely with PRISM’s specifications for the types of information they received from US tech companies (including Google, Facebook, Microsoft, Yahoo, AT&T, Apple, etc.) In addition, Google has access to locally stored browser storage and app data caches.  A disturbing sub-category on Google’s list includes “pixel tags” (aka “clear GIF”, “JavaScript tag”, “tracking pixel” or “1x1 gif”  which they describe as “a type of technology placed on a website you visit or when in the body of an email for the purpose of tracking activity…on an opened or accessed email or site.”  Tracking pixels sit in the HTML of a served page sends the image call back to its originating server - sending information about the interaction without your express consent.

 

    Google’s address of sensitive categories such as “race, religion, sexual orientation, or health” is minimal and simply says the they don’t tailor ads to these categories.  However, it is inexplicit whether Google refrains from actively collecting this type of sensitive personal information about their users.  Notably, Google claims “we will ask for your consent before using information for a purpose other than those that are set out in this Privacy Policy.”  Collection of this information is taken as granted. A distinction is made between sharing “personal information” with outside parties and sharing “sensitive personal information” — for the latter they require “opt-in” consent.  Google reserves the right to share the above information from your Google account in response to “enforceable governmental” or legal requests.  Personal user information is also provided to “trusted business” for “external processing” outside Google, but ostensibly within user instructions and agreement in regards to the Privacy Policy agreement.  They reserve the right to share your personal information with companies, organizations, or individuals outside of Google if they believe it is “reasonably necessary to…protect against harm to the rights, property or safety of Google.”  Google’s infrastructure is also is able to store voice “utterances” through Voice Search in concert with personal account ID information— although they claim not to unless you explicitly opt in.  Google’s facial recognition software implemented in the Find My Face feature can compare “known faces against a new face and see if there is a probable match or similarity.”  

 

    An interesting development to note alongside Google’s data dominance over the personal realm is Facebook’s logging of posts a Facebook user drafts— but does not publish (with the advent of HTML5 and Ajax) — what a 2013 TIME magazine article (Grossman, Dec. 2013) called “unpublished thoughts” and which Facebook calls “self-censorship.”  The original study and reportage by Slate  Magazine comes up with a “Not Found” page from the direct Slate link in Google’s search. 

 

    Within a few years of the large scale commercialization of the internet, Microsoft, Google, Facebook, Apple, Yahoo, AOL, PalTalk and Dropbox cooperated with the NSA as was publicly revealed in 2013. If one assumes any and all user information is being collected, or that infrastructure is engineered to be able to be collect data at any at some point in time, PRISM can be used as another measure of what that information is. A screen grab from the once ‘top secret’ PRISM website collection details page asserts that “email, chat (both video and voice), videos, photos, stored data, VoIP, file transfers, video conferencing, notifications of [a targeted user’s] activity such as logins, online social networking details, as well as any “special requests.”  Essentially every form of communication and activity on the internet by users via one of the above platforms  or corporate properties was potentially made available to the United States government and it must be assumed that the potential is there for the continuation and future honing of this practice for a multitude of means.  The question that remains is one of innovation in application and legislation.