Skip to content

Syllabus

chris wiggins edited this page Mar 7, 2020 · 95 revisions

Data: Past, Present, and Future

instructors

Matthew L. Jones (A&S) and Chris Wiggins (SEAS)

TA/graders: TBD

Course description

Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens.

The intellectual content of the class will comprise

  • the history of human use of data;

  • functional literacy in how data are used to reveal insight and support decisions;

  • critical literacy in investigating how data and data-powered algorithms shape, constrain, and manipulate our commercial, civic, and personal transactions and experiences; and

  • rhetorical literacy in how exploration and analysis of data have become part of our logic and rhetoric of communication and persuasion, especially including visual rhetoric.

While introducing students to recent trends in the computational exploration of data, the course will survey the key concepts of "small data" statistics.

Requirements

All students will be required to:

  • participate in laboratory hours including posting to Slack as assigned during class (10%)
  • respond to readings each week on Slack (15%)
  • write one 750 word op-ed on the ethics and practice of using data by midterm (15%)

Students will be assigned, typically based on their major, into one of two tracks. Students with less technical majors will do more technical work, including problem sets; students with more technical background will do more humanistic work, including longer writing assignments. Students for which there is ambiguity are encouraged to clarify with instructors and TAs before the 1st assignment is due.

a) more technical background track (60%)

  • pursue a semester long project culminating in a 15pp paper and any associated code

  • complete 3 problem sets

b) more humanistic background track (60%)

  • write a 10 pp paper on a topic of their choice

  • complete 5 problem sets, these problem sets will involve both computational work and writing work

Syllabus

Tentative and subject to change

Lecture 1 : intro to course

(See Slides)

Lecture 2 : setting the stakes

  1. Hanna Wallach (2014, December). Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. In NeurIPS Workshop on Fairness, Accountability, and Transparency in Machine Learning. Available via https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d .Dr. Wallach ( http://dirichlet.net/about/ ) is a former CS Professor now working in NYC at Microsoft Research. She's been a leader both in machine learning research and the emerging discipline of computational social science. This piece is an early example of technologists beginning to question data and propose a new research field.

  2. danah boyd and Kate Crawford. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, communication & society 15, no. 5 (2012): 662-679. Available via https://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878 .

  1. 14 very readable pages from 2014 by Zeynep Tufekci on tech & politics. This will wrap up our "setting the stakes" readings on our data-driven present.

Tufekci, Zeynep. "Engineering the public: Big data, surveillance and computational politics." First Monday 19, no. 7 (2014).

Lecture 3 : risk and social physics

  1. Desrosières, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998: Excerpt on 'statistics' vs 'vulgar statistics', pp19-23 (top of 23)

I have to warn you: Desrosières is not messing around. The book is a now-standard in the history of stats, but it's pretty scholarly, i.e., dense. It's is so good though we had to assign it, or at least the "Vulgar Statistics" and the Galton part of it (next week).

  1. Gigerenzer, Gerd, ed. The Empire of Chance: How Probability Changed Science and Everyday Life. Ideas in Context. Cambridge: Cambridge University Press, 1989, Section 1.6 ("Risk and Insurance") 8 very quick-moving and readable pages from a very quick-moving and readable book.

  2. Porter, Theodore. The Rise of Statistical Thinking, 1820-1900 (Princeton, N.J.: Princeton University Press, 1986), chap. 2 (40-70) + 100-109. Porter (one of the authors from the Empire of Chance) with lots of context around Quetelet's role in shaping our thinking about data, people, and policy.

  3. Quetelet, Adolphe “Preface” and “Introductory,” A Treatise on Man (1842), (full book available via https://ia801409.us.archive.org/27/items/treatiseonmandev00quet/treatiseonmandev00quet_bw.pdf ) A "game changer", one would now call it. Enjoy!

OPTIONAL

If you dig harder reading, get to know Desrosières even better:

Desrosières, Alain. "Averages and the Realism of Aggregates" in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998, ch 3. Desrosières on Quetelet, w/brief Durkheim.

Lecture 4 : statecraft and quantitative racism

  1. Desrosières, Alain. "Correlation and the Realism of Causes," in The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge, Mass.: Harvard University Press, 1998: excerpt from ch 4 on Galton, pp112-127

Great references as well, if you're inspired to dig in, please do enjoy!

  1. Galton, Francis. “Typical Laws of Heredity,” Royal Institution of Great Britain. Notices of the Proceedings at the Meetings of the Members 8 (February 16, 1877): 282ff. Primary text.

  2. Stephen J. Gould, Mismeasure of Man, ch. 3 Gould brings a sword in this chapter. Great stuff. Opens with a bang, doesn't let up for 40 pages. Warning that he deals head on with the racism of quantitative research of the late 19th c.

OPTIONAL

  1. Gillham, Nicholas. "Sir Francis Galton and the Birth of Eugenics." Ann. Rev. Genet. 35 (2001): 83-101. A scholarly engagement of Galton, his interests, and the consequences

Lecture 5 : intelligence, causality, and policy

  1. Gould, Stephen Jay. The mismeasure of man. WW Norton & Company, 1996. ONLY pp: 280-2, 286-288, 291-302, 347-350. This is Gould's treatment of Spearman, plus a 3-page addendum about a late 20th century revival, but minus two mathy bits (which is why I list the pages in 4 chunks).

Spearman invented PCA in order to come up with a single number representing "general intelligence". If of interest, see:

  1. pp 272-277 ONLY of: Spearman, Charles. ""General Intelligence," objectively determined and measured." The American Journal of Psychology 15, no. 2 (1904): 201-292, ( available at https://web.archive.org/web/20140407100036/http://www.psych.umn.edu/faculty/waller/classes/FA2010/Readings/Spearman1904.pdf )

  2. Freedman, David. From association to causation : some remarks on the history of statistics. Journal de la société française de statistique, Volume 140 (1999) no. 3, pp. 5-32. http://www.numdam.org/item/JSFS_1999__140_3_5_0/ Freedman ( https://en.wikipedia.org/wiki/David_A._Freedman ) was a great statistician as well as expository and historian of statistics. He writes so well! In this piece please focus on the parts about Yule (Sec 4, pp 11-14) but really the whole thing is great and sets you up well for the next several weeks (and life in general!!!!!!!!!!!!!!!)

Optional: Yule's actual paper

Yule, G. (1899). An Investigation into the Causes of Changes in Pauperism in England, Chiefly During the Last Two Intercensal Decades (Part I.). Journal of the Royal Statistical Society, 62(2), 249-295. doi:10.2307/2979889 (or at least read footnote 25:

Strictly, for " due to " read " associated with." Changes in proportion of old and in population were much the same in the two grouips in all cases. The assumption could not have been made, I imagine, had the kingdom been taken as a whole.

)

Lecture 6 : data gets real: mathematical baptism

Our reading for Tuesday 2020-02-25 moves from Spearman's mathematical definition of intelligence, and Yule's work implicitly applying mathematical models to policy decisions, to an explicitly mathematical "baptism" of statistics, applied directly both to decisions and to defining what is true, what is proven, and what is "science."

  1. Box, Joan Fisher. “Guinness, Gosset, Fisher, and Small Samples.” Statistical Science 2, no. 1 (1987): 45–52.

Joan was one of Fisher's 6 children with Eileen Guinness. She married the statistician George Box, now most well known for the quip "All models are wrong, but some are useful.", which you should keep in mind while thinking about hypothesis testing. This brief piece gives context into Fisher's writing and his half-century fight with Neyman.

  1. "The Controversy", 10 page excerpt from "The Empire of Chance", a secondary text you've already encountered (the book is a collaboration among the distinguished historians John Beatty, Lorraine Daston, Gerd Gigerenzer, Lorenz Kruger, Theodore Porter, and Zeno Swijtink). This section makes clear the ideas and animus of the fight.

Next, two documents from the two belligerents, summarizing the bitter mathbattle:

  1. Fisher, R. A. "Scientific thought and the refinement of human reasoning." (1960)., also available online here.

  2. Neyman, J. (1961). Silver jubilee of my dispute with Fisher, also available online here

optional:

A special treat: a preprint "Inference Rituals: Algorithms and the History of Statistics" (i.e., please do not distribute) from Chris Phillips, CMU's history department. Great summary with lots of citations.

Lecture 7 : WWII, dawn of digital computation

This week we go somewhat back in time, from this week’s “bratty” mathbattle of 1935-1960 in the academy to the actual behind-the-fence birth of computational statistics at Bletchley park during WWII.

  1. We’ll open this week with “Bayes goes to War”, Ch 4 of “The Theory That Would Not Die” (2011), a popular book written by the science writer Sharon Bertsch McGrayne. (please also read the 2-page dessert, ch 5) While you're reading this, consider

    1. the scientific and intellectual networks the characters came from: computing was another "trading zone", bringing together hardware engineers and mathematicians, but no statisticians. The vigorous academic debate we saw raging in "mathematical statistics" 1935-1960 is absent from the dawn of computing with data
    2. the role of hardware and physical labor vs. mathematics and philosophy, different from prior authors
    3. the focus on a job to be done (especially "decisions") rather than a scientific inquiry ("truth")
    4. the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware

McGrayne, Sharon Bertsch. The theory that would not die: how Bayes' rule cracked the enigma code, hunted down Russian submarines, & emerged triumphant from two centuries of controversy. Yale University Press, 2011.

  1. McGrayne doesn't go into much detail as to the technical matters -- what they actually did at Bletchley. For this, please read the first 11 pages of an unclassified report by the NSA.

Mowry, David P. German Cipher Machines of World War II: Description Based on Print Version Record. National Security Agency, Center for Cryptologic History, 2003.

  1. Mar Hicks has a great new book "Programmed Inequality" which you have hopefully heard of. She also has an article specifically about Bletchley and the dawn of computation. In this chapter please focus on pages 19-42.

Hicks, Marie. "War Machines: Women's Computing Work and the Underpinnings of the Data-Driven State, 1930–1946." (2017)

  1. The above give some view of the role of math, hardware, and physical labor at the UK dawn of digital computing -- which was computing with data! However this doesn't shed light on the "special relationship" between UK intelligence and US corporations -- the corporate contractors who participated in the early military-industrial complex. As we will see in the remaining weeks, these companies dominated what would become data science, particularly Bell Labs here in NYC (after the war Bell moved to NJ). In light of that please enjoy this breezy excerpt from Hodge's biography of Alan Turing.

Hodges, Andrew. Alan Turing: The Enigma: The Enigma. Random House, 2012.

While you're reading this, consider

  1. the "scaling up" of computing with data
  2. the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware and....
  3. who has the infrastructure to build and maintain such hardware? Communications (e.g., postal service in the UK, AT&T / Bell labs in the US) feature prominently in this chapter and in succeeding weeks!

Modern day relevance:

This is the birth of digital computation!! And computation -- including that driving the device on which you're reading these words -- was born of data; that is, the first arena of attack for building digital computers is breaking codes (data) using statistical methods (what we would now call a Bayesian inference problem), along with abundant "shoe leather" work (domain expertise). Completely breaking from the academic tribe setting the tone and values for making sense of the world through data, martial and industrial (as contractors to the military) concerns and skills are the primary movers at this point through present day. I claim that these readings reveal the origin of modern data science --- applied computational statistics, driven by concerns of a domain of application, including engineering concerns, rather than being driven by mathematical rigor or scientific hypotheses.

Optional: Mar Hicks's chapter complements the great direct quotes which drive "Breaking Codes and Finding Trajectories: Women at the Dawn of the Digital Age”, Ch 1 of “Recoding Gender: Women’s Changing Participation in Computing” (2012) By Janet Abbate, a professor at Virgina Tech.

Abbate, Janet. Recoding gender: Women's changing participation in computing. MIT Press, 2012.

Of this, please focus on

  • pp. 14-16,
  • pp. 21-22,
  • bottom of 26-29, and
  • pp. 33-35.

While you're reading this, consider 1) the role of physical labor and how it is valued 2) the biases about the different skills needed 3) the marked break-point in how we deal with data as it becomes a digital computation, with specialized hardware

Lecture 8 : birth and death of AI

This week's reading is about another present-shaping, history-changing innovation from Alan Turing (in addition to computation itself): artificial intelligence.

  1. Secondary reading: Artificial Intelligence by Stephanie Dick https://hdsr.mitpress.mit.edu/pub/0aytgrau

Context, including the primary literature

  1. Alan Turing: "Computing machinery and intelligence"" Mind 59, no. 236 (1950): 433. ( https://en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence ) In these few pages, Turing lays out a plan for what would later be called AI, and thinks through the necessary hardware, the computation, the capabilities, and even many of the critiques and doubts people would raise about the possibility of AI.

There are two ways to interpret the many connections between this almost 70-year old document and the present: the first is to say "wow! Turing was so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)

questions to ask yourself:

  • What was "the Turing test" initially? How did this relate to Turing's biography?
  • What objections did he expect against AI? Did he address them well?
  • Can you tell that there are multiple ways to achieve AI in this work?
  1. McCarthy, John, Marvin Minsky, Nathaniel Rochester, and Claude Shannon. "A proposal for the Dartmouth summer research project on artificial intelligence, august 1955." (1955). This is the only reading this term that is actually a funding proposal, not a scholarly work, but it has tremendous historical impact. It first introduces the phrase "artificial intelligence" itself. By Macarthy's own admission:

"I invented the term artificial intelligence. I invented it because ...we were trying to get money for a summer study in 1956...aimed at the long term goal of achieving human-level intelligence." -- JCM, during the Lighthill debate [1]

There are two ways to interpret the many connections between this 64-year old document and the present: the first is to say "wow! They were so prescient to have realized back then everything that would happen for the next 50 years!" The second is to say "wow! For the next 50 years all people did was execute the plan laid out in this document!" Either way you can see in this document what would be the future (and our present!)

questions to ask yourself:

  • What part of their goal overlaps with what we now think of as AI?
  • What parts relate to CS as a field?
  • Which goals have we not yet achieved?

The backlash against AI -- sometimes called "the first AI winter" -- accelerated in the early 1970s. A well-documented example from the UK is "the Lighthill report". There are copious assets from and reactions to this report, and it is a great illustration of how incumbents in the scientific establishment, including Lighthill himself [2], had a role in smiting the upstart field of AI. If you have time I strongly encourage you to watch the videos (see "6", below) of the televised debate itself, just amazing stuff, featuring a who's who of British AI work [3] along with McCarthy himself, trying to defend the nascent field. In prior years we've assigned the report along with reactions; please for Tuesday just read the report itself. A note: Don't expect it to make sense. It's a rant by someone who had been cramming on AI for 3 months before pontificating. He gets plenty wrong.

Let's just read JCM's reaction, written in 1973 (though typeset in 2000; ignore the year):

questions to ask yourself:

  • What did Lighthill get right? What did he get wrong?
  • How well do you think he argued against AI as a field?

Optional extras for further reading

  1. Professor Sir James Lighthill, FRS. "Artificial intelligence: a paper symposium. In: Science Research Council, 1973." (1974): 317-322.

  2. The full debate Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=03p2CADwGF8

  3. Pamela McCorduck (2009). Machines who think: A personal inquiry into the history and prospects of artificial intelligence. AK Peters/CRC Press.,

a) Chapter 5, "The Dartmouth Conference". McCorduck is an unusual secondary source in that she knows personally almost all the people she writes about -- unparalleled access to the minds and interests of the participants.

questions to ask yourself:

  • What were the interests and goals of the participants?

b) Chapter 9, "L’Affaire Dreyfus"

The backlash wasn't just in the UK, of course. A prominent example is the book Dreyfus, Hubert L. "What computers can't do." (1972), discussed at length by Pamela McCorduck.

References:

[1] Lighthill, J. "BBC TV–June 1973 ‘Debate at the Royal Institution" https://www.youtube.com/watch?v=pyU9pm1hmYs&t=266s , 3'00"

[2] https://en.wikipedia.org/wiki/James_Lighthill, who was Lucasian Professor of Mathematics ( https://en.wikipedia.org/wiki/Lucasian_Professor_of_Mathematics )

[3] One of the audience members called upon to speak ( https://youtu.be/3GZWFnWOqkA?t=407 ) is Chris Strachey https://en.wikipedia.org/wiki/Christopher_Strachey, whose whole life story is amazing, including

Lecture 9 : big data, old school (1958-1980)

It's time to get to know big data from a historical view. In our opening reading Hanna Wallach warned that big data is creepy because it's granular and because it's about people. This week's reading takes a historical look.

  1. Hans Peter Luhn. "A business intelligence system." IBM Journal of research and development 2, no. 4 (1958): 314-319.

Luhn's brief paper is credited with introducing the phrase Business Intelligence, now an important organizational unit at most major companies, and also a mindset about how corporations should make data a first-order priority. We'll come back to this mindset soon when we meet its modern incarnation.

Think through who was his audience and what was the status quo beforehand?

  1. Sarah Igo. The Known Citizen: A History of Privacy in Modern America. Harvard University Press, 2018. Chapter 6: The Record Prison.

Igo traces battles over corporate and state control of data, and public reactions, in 1960s and 1970s.

  1. Arthur Raphael Miller. "Assault on privacy: Computers, Data Banks, and Dossiers" (1971)., excerpt from Ch 2: "The New Technology's Threat to Personal Privacy".

Even in 1971 it was clear that technology was going to conflict directly with our notions of privacy. Miller was ahead of his time in sounding the alarm.

(This Arthur miller https://en.wikipedia.org/wiki/Arthur_R._Miller ; not this one: https://en.wikipedia.org/wiki/Arthur_Miller )

As you read, think through what has changed and what hasn't, both in tech and in public attitudes.

  1. Martha Poon. "Scorecards as devices for consumer credit: the case of Fair, Isaac & Company Incorporated." The sociological review 55 (2007): 284-306.

Extra creepy is when data about people are reduced to a single number. In the case of credit scoring it's not only a single number -- it's a very powerful single number. Poon reviews the development, breaking into 3 stages:

  • 1958-1974,
  • 1980-1985,
  • 1986-1991.

As you read think about

  • statistical measures: what is the "score" meant for as a summary? Is it a description? A prediction? A prescription?
  • ethical principles of fairness and beneficence

Please comment by 11:59 pm on Sunday Mar 24.

optional:

  1. Joanna Radin. "“Digital Natives”: How Medical and Indigenous Histories Matter for Big Data." Osiris 32, no. 1 (2017): 43-64.

Another creepy context is health data. This one touches directly on fair use and fair distribution of benefits by focusing on the members of the Gila River Indian Community Reservation.

Lecture 10: data science, 1962-2017

The last several weeks were about how communities other than statisticians thought about data. (In fact, the last time we read from or about statisticians was in February!). We traced the military-crypto-computational effort (March 5), the cognitive-computational birth of AI (March 12), and the government-industrial birth of big data (March 26). This week is about the reaction of industrial statisticians to big data and computation, and traces the origin of 'data science' both as a term and a mindset.

These readings span 67 years and are organized as:

  1. (1962) Tukey, the original east coast heretical statistician ( https://en.wikipedia.org/wiki/John_Tukey )

  2. (2001) Breiman, the west coast heretic ( https://en.wikipedia.org/wiki/Leo_Breiman )

  3. (2017) Donoho, the Respected Academic who baptizes data science as statistics ( https://en.wikipedia.org/wiki/David_Donoho )

  4. a recent piece by Professor Gina Neff (cc'93!!) with an astute study of the non-technical aspects that differentiate data science from statistics in practice. ( https://en.wikipedia.org/wiki/Gina_Neff )

  5. John W Tukey "The future of data analysis." The annals of mathematical statistics 33, no. 1 (1962):

This is a 67 page paper so please only read the 1st and last sections (12 pages and 4 pages): pp2-14 (end at "II. Spotty Data") pp60-64 (start at "VIII. How shall we Proceed?")

Tukey spent his whole career split between Princeton and Bell Labs. (You might recall that we encountered Tukey in our discussion of exploratory data analysis -- he wrote the book defining the field in 1977.) His PhD (1939) was in topology, but as he later put it "By the end of late 1945, I was a statistician rather than a topologist," having worked on code-breaking and other martial applications.

Despite being the founding chair of Princeton's statistics department, he had an adversarial relationship with mathematical statistics. This paper is the most-cited attack on the field.

  1. Leo Breiman "Statistical modeling: The two cultures (with comments and a rejoinder by the author)." Statistical science 16, no. 3 (2001): 199-231.

please read ONLY p 199-215; optional read the stats-fight 215-231

Breiman, like Tukey, was trained as a proper mathematician, then, like Tukey, worked on extremely applied problems. Also like Tukey, he wrote and spoke trying to get academic mathematical statisticians to embrace computation and data rather than just math.

  1. David Donoho "50 years of data science." Journal of Computational and Graphical Statistics 26, no. 4 (2017): 745-766.

Donoho worked with Tukey when he was an undergraduate and is at this point possibly the most anointed living computational statistician. This paper baptizes the heresy, bringing it into the church of statistics by providing his view on how statistics should be defined in a way to include data science as he sees it.

  1. Gina Neff, Anissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn. "Critique and contribute: A practice-based framework for improving critical data studies and data science." Big Data 5, no. 2 (2017): 85-97.

Professor Neff has a long history of understanding technology and scientific communities. Her PhD here at Columbia involved very applied ethnographic work hanging out with Silicon Alley people in the late 1990s and understanding their values and networks. In this piece she ties together data science in theory with data science in practice. Enjoy!!

Lecture 11: AI2.0

for the triumphant rise of AI2.0:

  1. Jones, Matthew L. "Querying the Archive: Data Mining from Apriori to PageRank." Science in the Archives: Pasts, Presents, Futures (2017): 311.

  2. Fast forward to (near) present day, we'll look at a 2015 review article by two of the most famous living names in machine learning -- Michael Jordan [1] and Tom Mitchell [2]. Mitchell founded the 1st Department of Machine Learning (at least until they change the name to "AI") and has been writing definitive books defining the field since the early 1980s (the Simon lecture we discussed appears in print in an edited volume of which Mitchell is editor). This article lays out the "modern" partitioning of machine learning into

  • unsupervised (descriptive)
  • supervised (predictive)
  • reinforcement (prescriptive) learnings, though the roots of this partitioning go back to the early 1970s (for super/unsuper) and 80s for "reinforcement learning" (though some would say that RL is just control theory or sequential decision processes, re-branded).

Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260.

  1. Missing from the above documents is the contemporary explosion of interest in "deep learning." An enjoyable nontechnical introduction is from New York Times Magazine last year. The piece is long (though readable and enjoyable!). Please for this class make sure you read
  • "Prologue", and
  • "A Deep Explanation of Deep Learning".

Lewis-Kraus, Gideon. "The great AI awakening." The New York Times Magazine (2016): 1-37. available online via http://publicservicesalliance.org/wp-content/uploads/2016/12/The-Great-A.I.-Awakening-The-New-York-Times.pdf

  1. The inscrutability of deep -- far more complex than the random forests discussed by Breiman -- raises question about interpretability, trust, and "Black Boxes." An expert assessment of the risks of black box models is in this recent piece by Prof Cynthia Rudin, whose work was mentioned in the optional reading by Radin a few weeks ago. Feel free to skip over the mathy bits.

Cynthia Rudin "Please Stop Explaining Black Box Models for High Stakes Decisions." arXiv preprint arXiv:1811.10154 (2018)

OPTIONAL:

In addition to the problems raised by interpretability that we mentioned in class (trust, causality, fairness, etc...), is the problem for the practitioners of not knowing how to use, tune, and optimize deep nets. This led to two experts accusing the field of being "alchemy" in the most prestigious and high visibility conference of machine learning, NeurIPS. To that end, as optional "reading", enjoy the video in which they call it out as that before a shocked and hushed crowd of true-believing deep learning techno-solutionists. Good fun, in 22 min or so: https://www.youtube.com/watch?v=Qi1Yry33TQE (gloves come off at 12'00": https://youtu.be/Qi1Yry33TQE?t=724 )

If you want to see the written version of the "alchemy" accusation, Ben Recht, a professor, cynic, and great writer, does so in blog form here:

http://www.argmin.net/2017/12/05/kitchen-sinks/

http://www.argmin.net/2017/12/11/alchemy-addendum/

I think the above speak to Gina Neff's comment about "reflexive" data scientists!

[1] https://en.wikipedia.org/wiki/Michael_I._Jordan

[2] https://en.wikipedia.org/wiki/Tom_M._Mitchell

Lecture 12: ethics

The readings all semester, but particularly this week, point to the need to refine ethics and make it not just an interjection ("that's unethical!") but a clearly-defined tool for problem solving. We hope this readings hope clarify things.

  1. Salganik Ch 6: Salganik, Matthew J. Bit by bit: social research in the digital age. Princeton University Press, 2017.

Matt Salganik [1] got his PhD at Columbia in 2006 in Sociology. Chapter 6 of his book 'Bit by bit' is explicitly about ethics. For Tuesday, please read sections 6.1 "Introduction" 6.2 "Three examples" 6.4 "Four Principles" 6.6 "Difficulty" (fun fact: Tukey invented the word 'bit'!)

  1. Sweeney on re-identification https://techscience.org/a/2015092903/ Sweeney, Latanya. "Only you, your doctor, and many others may know." Technology Science 2015092903, no. 9 (2015): 29.

I've mentioned Professor Latanya Sweeney [2] several times in class, e.g., when she was a grad student and sent the Governor of Massachusetts his medical records. This piece is technical but also legible. The section "approaches" makes clear one of the problems of "anonymous" data, as we discussed in the lab on the Netflix prize.

  1. ``How will AI change your life? AI Now Institute founders Kate Crawford and Meredith Whittaker explain." https://www.recode.net/podcasts/2019/4/8/18299736/artificial-intelligence-ai-meredith-whittaker-kate-crawford-kara-swisher-decode-podcast-interview

Just posted on Monday is a conversation with AI Now Institute [3] founders Kate Crawford [4] (whose piece with danah boyd appeared in the 1st week) and Meredith Whittaker [5]. You'll find that the issues in Salganik and this week's readings are at the fore.

[1] https://en.wikipedia.org/wiki/Matthew_J._Salganik

[2] https://en.wikipedia.org/wiki/Latanya_Sweeney

[3] https://en.wikipedia.org/wiki/AI_Now_Institute

[4] https://en.wikipedia.org/wiki/Kate_Crawford

[5] https://en.wikipedia.org/wiki/Meredith_Whittaker

Lecture 13: present problems: attention economy+VC=dumpsterfire

Two weeks ago we caught up to the present, including AI + personal data; this week we set a frame for thinking analytically about ethics. Next week it's time to see how the power of AI and personal data has been monetized. The readings cover:

  • the data-enabled advertising model;
  • the venture model, which accelerates the monetization of data; and
  • the consequences for information platforms.

Data will be in the background this week, making the transformations possible but mentioned only as an assumed accelerant (e.g., in references to machine learning and personalization).

  1. The advertising model, 1997: Michael Goldhaber "The attention economy and the net." First Monday 2, no. 4 (1997).

This is a very prescient article, written in early days of WWW. It's entertaining to see what was proven right, and what was not!

  1. The venture model, 2019 "The fundamental problem with Silicon Valley’s favorite growth strategy", February 5 2019 by Tim O'Reilly ( https://en.wikipedia.org/wiki/Tim_O%27Reilly )

This piece contrasts growth as a venture-backed startup with other models and points out consequences potentially bad for society. The author [2] has been a technical writer since 1977 and has also been a VC.

  1. consequences of the above two for information platforms

3a) long, with problems: Grimmelmann, James. "The platform is the message." Georgetown Law Technology Review (2018): 18-30.

3b) short, with suggestions for solutions: "Yes, Big Platforms Could Change Their Business Models" By Zeynep Tufekci December 17, 2018 ( https://en.wikipedia.org/wiki/Zeynep_Tufekci )

PS: Some optional extra readings if you're interested in more on any of the 3 subjects:

On Advertising

  1. 1973 (optional): Richard Serra and Carlotta Schoolman. Television Delivers People. Castelli-Sonnabend Films and Tapes, 1973.

On Venture:

  1. Anna Lee Saxenian. Regional advantage. Harvard University Press, 1996. Ch1 Great book fro AnnaLee Saxenian ( https://en.wikipedia.org/wiki/AnnaLee_Saxenian ), Dean of the UCB I-school. Predicted (accurately) that Silicon Valley would overtake Boston. This chapter is a brief history of venture capital, WWII -> 1970s.

Consequences for information platforms:

  1. "Ethical Principles, OKRs, and KPIs: what YouTube and Facebook could learn from Tukey" Chris Wiggins, April 2018 https://datascience.columbia.edu/ethical-principles-okrs-and-kpis-what-youtube-and-facebook-could-learn-tukey

Lecture 14: future solutions

Lecture 13 was about problems: specifically about the volatile mixture of the advertising model, the venture model, and data. Lecture 14 is about solutions.

A repeated framing of power used in this class is organized around 3 loci:

  • state power
  • people power
  • corporate power.

So we'll use this framing for our final 4 readings:

  1. a brief 2017 news item about one example corporate power,
  2. a brief 2019 blog post (with lots of links!!!) about state power
  3. a 2019 historical/ethnographic article (with a timeline!!) about one example of people power.
  4. To follow these 3, please read sections 2+3 (about "solutions") from the 2018 AI Now Institute report, which contains examples relevant to all 3.

1. Corporate power

Corporate power: corporations compete with each other by 1.1 using consumer protection as a competitive advantage 1.2 "de-platforming" other companies to cast light on bad-for-user policies 1.3 in the case of press: reporting on bad behavior by other companies

We've seen many examples of press (1.3) over the last few weeks; as an example of a brief analysis of the first of these (1.1), please read this brief 2017 FT piece from Rana Foroohar (BC'92!):

2. State power

State power: this is a large subject but the most visible and oft-discussed state powers are 2.1 regulation and 2.2 antitrust.

Renee Diresta does a good job giving a brief overview of current discussions of regulation (2.1) specifically with respect to disinformation in this piece:

OPTIONAL: For current changes in antitrust please see this piece in your local paper on Lina Kahn (currently at Columbia)

https://www.nytimes.com/2018/09/07/technology/monopoly-antitrust-lina-khan-amazon.html

3. People power

People deny companies 3.1 their money 3.2 their data 3.3 their talent.

On the last of these, Please read "The Tech Revolt — The California Sunday Magazine", January 23, 2019. https://story.californiasunday.com/tech-revolt

  1. all of the above 3 appear in various forms in the 2018 AI Now Institute report. Part 1 is about problems, which we've pretty well covered. Please read pages 24-44, Part 2 and 3, which are about solutions:

https://ainowinstitute.org/AI_Now_2018_Report.pdf

Hopefully this can give you some hope for the future, or at least a list of some of the weapons at your disposal.

Clone this wiki locally