Data Mining

  • Most Topular Stories

  • Getting Along

    Kevin Hillstrom: MineThatData
    Kevin Hillstrom
    20 Apr 2014 | 8:15 pm
    Our industry is prioritizing tactical nonsense over human interactions.Have you ever actually listened to employees? Have you ever actually listened to what they say? Sit in on an "omnichannel excellence meeting" someday, and then an hour later, interview the folks who attended. That's where truth emerges.IT Executive: "I read an article about this big data thing, but nobody wants to invest in the infrastructure necessary for my team to glean insights from big data. And my team could easily outperform the analytics team, we know analysis and code and database structure."Marketing Executive:…
  • The Good and the Bad of the Median

    Data Mining in MATLAB
    31 Jul 2012 | 9:08 am
    Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean.  The mean reduces a collection of numbers to a single value, and is one of a number of measures of location.  The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one.  In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.One important property of the median is that it is not affected- at all-…
  • PolitiFact: Rubio's claim that Snowden scandal 'most damaging' U.S. espionage not definitive

    Data mining News
    22 Apr 2014 | 10:01 pm
    … information detailed surveillance programs and data mining operations against world leaders and …
  • The Internet of Things meets disruptive technologies

    Computerworld BI and Analytics News
    21 Apr 2014 | 6:58 am
    How will the IoT affect social, mobile, the cloud and analytics? (Insider; registration required)
  • Probability: A Halloween Puzzle

    Data Mining in MATLAB
    12 Apr 2014 | 7:22 pm
    IntroductionThough Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.Recently, I was given the following probability question to answer:Halloween Probability PuzzlerThe number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces…
 
  • add this feed to my.Alltop

    Data Mining in MATLAB

  • Probability: A Halloween Puzzle

    12 Apr 2014 | 7:22 pm
    IntroductionThough Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.Recently, I was given the following probability question to answer:Halloween Probability PuzzlerThe number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces…
  • Reading lately (Nov-2013)

    23 Nov 2013 | 6:23 pm
    I read a great deal of technical literature and highly recommend the same for anyone interested in developing their skill in this field. While many recent publications have proven worthwhile (2011's update of Data Mining by Witten and Frank; and 2010's Fundamentals of Predictive Text Mining, by Weiss, Indurkhya and Zhang are good examples), I confess being less than overwhelmed by many current offerings in the literature. I will not name names, but the first ten entries returned from my search of books at Amazon for "big data" left me unimpressed. While this field has enjoyed popular…
  • Ranking as a Pre-Processing Tool

    13 Jan 2013 | 10:13 am
    To squeeze the most from data, analysts will often modify raw variables using mathematical transformations.  For example, in Data Analysis and Regression, Tukey described what he terms a "ladder of re-expression" which included a series of low-order roots and powers (and the logarithm) intended to adjust distributions to permit better results using linear regression.  Univariate adjustments using those particular transformations are fairly popular today, and are even incorporated directly into some machine learning software as part of the solution search.  The modified…
  • The Good and the Bad of the Median

    31 Jul 2012 | 9:08 am
    Perhaps the most fundamental statistical summary beyond simple counting or totaling is the mean.  The mean reduces a collection of numbers to a single value, and is one of a number of measures of location.  The mean is by far the most commonly used and widely understood way of averaging data, but it is not the only one, nor is it always the "best" one.  In terms of popularity, the median is a distant second, but it offers a mixture of behaviors which make it an appealing alternative in many circumstances.One important property of the median is that it is not affected- at all-…
  • Linear Discriminant Analysis (LDA)

    11 Dec 2010 | 2:03 am
    OverviewLinear discriminant analysis (LDA) is one of the oldest mechanical classification systems, dating back to statistical pioneer Ronald Fisher, whose original 1936 paper on the subject, The Use of Multiple Measurements in Taxonomic Problems, can be found online (for example, here).The basic idea of LDA is simple: for each class to be identified, calculate a (different) linear function of the attributes. The class function yielding the highest score represents the predicted class.There are many linear classification models, and they differ largely in how the coefficients are established.
 
  • add this feed to my.Alltop

    natural language processing blog

  • Waaaah! EMNLP six months late :)

    14 Apr 2014 | 9:56 am
    Okay, so I've had this file called emnlp.txt sitting in my home directory since Oct 24 (last modification), and since I want to delete it, I figured I'd post it here first. I know this is super belated, but oh well, if anyone actually reads this blog any more, you're the first to know how I felt 6 months ago. I wonder if I would make the same calls today... :)A Log-Linear Model for Unsupervised Text Normalization (Yi Yang and Jacob Eisenstein)Parsing entire discourses as very long strings: Capturing topic continuity in grounded language learning [TACL] (Minh-Thang Luong, Michael C. Frank,…
  • Active learning for positive examples

    16 Sep 2013 | 11:59 am
    I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.That is to say: this is really a find-a-needle-in-a-haystack problem. The best…
  • The *SEM 2013 Panel on Language Understanding (aka semantics)

    8 Jul 2013 | 2:41 pm
    One of the highlights for me at NAACL was the *SEM panel on "Toward Deep NLU", which had the following speakers: Kevin Knight (USC/ISI), Chris Manning (Stanford), Martha Palmer (UC Boulder), Owen Rambow (Columbia) and Dan Roth (UIUC). I want to give a bit of an overview the panel, interspersed with some opinion. I gratefully acknowledge my wonderful colleague Bonnie Dorr for taking great notes (basically a transcript) and sharing them with me to help my failing memory. For what it's worth, this basically seemed like the "here's what I'm doing for DEFT panel" .Here's the basic gist that I got…
  • My NAACL 2013 list...

    17 Jun 2013 | 12:11 pm
    I feel a bit odd doing my "what I liked at NAACL 2013" as one of the program chairs, but not odd enough to skip what seems to be the most popular type of post :). First, though, since Katrin Kirchhoff (my co-chair) and I never got a chance to formally thank Lucy Vanderwende (the general chair) and give her flowers (or wine or...) let me take this opportunity to say that Lucy was an amazing general chair and that working with her made even the least pleasant parts of PCing fun. So: thanks Lucy -- I can't imagine having someone better to have worked with! And all of the rest of you: if you see…
  • What is a sparse difference in probability distributions?

    30 Apr 2013 | 7:36 am
    Sparsity has been all the rage for a couple of years now. The standard notion of "sparse" vector u is that the number of non-zeros in u is small. This is simply the l_0 norm of u, ||u||_0. This norm is well studied, known to be non-convex, and often relaxed to the l_1 norm of u, ||u||_1: the sum of absolute values. (Which has the nice property of being the "tightest" convex approximation to l_0.)In some circumstances, it might not be that most of u is zero, but simply that most of u is some fixed scalar constant a. The "non-constant" norm of u would be something like "the number of components…
  • add this feed to my.Alltop

    Kevin Hillstrom: MineThatData

  • Last Chance Email Messages

    Kevin Hillstrom
    22 Apr 2014 | 8:15 pm
    97% of the folks in the email marketing community are trying to do their best - they're limited by company strategy and customer apathy.I know, it's hard. You have half your list who never bothers to pay attention to messages. It's easier to just get rid of them ... boom ... your metrics improve (but sales don't improve), and you can tell your boss you improved open/click/conversion rates. You get a bonus. Woo-hoo!There's two alternate strategies we can employ.Don't anger your customer - simply reduce frequency quietly. Why publicly suggest that you no longer value the customer enough to…
  • Gold, Green, Blue, and Red

    Kevin Hillstrom
    21 Apr 2014 | 8:20 pm
    I am always amazed at how infrequently catalogers analyze the performance of individual spreads.Here's something I was introduced to at Lands' End, back in 1992 (yes, 1992). We calculated the profitability of every spread. If a spread performed at 30% or greater variable profit, it earned a color code of GOLD. If a spread performed at 20% to 29% variable profit, it earned a color code of GREEN. If a spread performed at 10% to 19% variable profit, it earned a color code of BLUE. Finally, if the spread performed below 10% variable profit, it earned a color code of RED.You give the image above a…
  • The Wall St. Journal Loves Catalogs

    Kevin Hillstrom
    21 Apr 2014 | 8:15 pm
    Of course, you've already read this - heck, you are probably one of the folks who forwarded the article to me! (click here to read the article).Since you don't get to see many positive catalog messages, I thought I'd share this one with you. Put a smile on your face, and go generate some profit for your business!
  • Getting Along

    Kevin Hillstrom
    20 Apr 2014 | 8:15 pm
    Our industry is prioritizing tactical nonsense over human interactions.Have you ever actually listened to employees? Have you ever actually listened to what they say? Sit in on an "omnichannel excellence meeting" someday, and then an hour later, interview the folks who attended. That's where truth emerges.IT Executive: "I read an article about this big data thing, but nobody wants to invest in the infrastructure necessary for my team to glean insights from big data. And my team could easily outperform the analytics team, we know analysis and code and database structure."Marketing Executive:…
  • Attracting A Younger Customer

    Kevin Hillstrom
    17 Apr 2014 | 8:15 pm
    Inside a book store, we see a nice display, don't we?Top Teen Picks.New Teen Fiction.What is missing in this picture?Teens.It is good to offer merchandise designed to attract a younger audience.Then, we have to find a younger audience. It's really hard to do that within a framework designed to attract the loyal, core customer.
 
  • add this feed to my.Alltop

    Neoformix

  • Markham Winter of 2014

    1 Apr 2014 | 4:30 am
    Winter has finally ended in Markham where I live and it has seemed a very long and cold season this year. I decided to take a look at the weather data from Environment Canada and see whether my impression is supported by the data. The result is the graphic below. Click on it to see a larger version. Yes, 2014 was the coldest winter in Markham since 1994. We had an average temperature during the winter of -8.2 C this year and in 1994 it was -9.2 C. Both last year and especially 2012 were warmer than usual so it likely felt that much worse in comparison. We also had the 4th most snow in the…
  • Toronto Visible Minorities

    27 Sep 2013 | 4:30 am
    Toronto is the most multicultural city in the world. According to the 2011 National Household Survey, 46% of the population were foreign-born immigrants and 47% are members of a visible minority. (ref) These immigrants come from a wide variety of places across the globe and their diversity makes the city a truly remarkable place. I have created a Dot Map that shows a single point for every person in the Toronto area, coloured by visible minority status. There are 5,700,628 in all and they are positioned at their place of residence and coloured based on the information from the 2011 census and…
  • Toronto 311 Visualization

    6 Sep 2013 | 4:20 am
    The calls people make into the 311 service line in Toronto give an interesting glimpse into the pulse of the city. The City of Toronto makes this data available through their Open Data initiative. I did some analysis and design work with it to produce a visualization for illuminating time-based patterns during 2012. The visualization is a set of small multiple calendar heatmaps, one for each data series. The one shown above is for reports about 'long grass and weeds'. I was inspired to use this visual form by this example: Vehicles involved in fatal crashes by Nathan Yau. I experimented with…
  • Visual Book Selector

    8 May 2013 | 5:00 am
    One common pattern I see in many interactive applications is to support a person who is selecting a few items from some larger set. Often these items have various characteristics that the person wants to use in some way to guide their selection process. The characteristics can be numeric quantities, dates, categories, or names of things. Showing all the items in a list and allowing the person to sort by one of the attributes is often a decent default solution. In other cases it's more useful to consider multiple attributes at a time during the selection process. Maybe you want items that are…
  • Star Wars Movie Fingerprints

    27 Mar 2013 | 4:35 am
    Recently YouTube had a video that showed all six Star Wars movies at once. They were placed in a 2 by 3 matrix and had an audio track of all the movies superimposed. It was an interesting experiment that has since been removed based on copyright grounds. Before it was removed I was able to do some simple analysis on the video and extract some details of the individual episodes of the Star Wars series. Basically, I produced something very similar to a classic work called Cinema Redux™ by Brendan Dawes, done in 2004. Each individual movie in the series was reduced to a collection of small…
  • add this feed to my.Alltop

    Trends and Outliers

  • Location, Location, Location: Making Better Use of Geographic Data

    Spotfire Blogging Team
    22 Apr 2014 | 5:55 am
    Oil and gas companies rely heavily on geographic data to determine where to drill, contending with the location and positioning of rock formations before extraction, etc. Likewise, retailers are heavily dependent on locational data to ensure that they’re placing stores close to the target customers they’re trying to reach. Meanwhile, retailers also regularly use locational data to strategically position inventory in store aisles. The use of locational data across various vertical sectors can provide decision makers with meaningful insights they can use to optimize operations or…
  • Transforming Store Operations with Analytics

    Spotfire Blog Editor
    21 Apr 2014 | 5:55 am
    Retail stores today pulsate with activity and generate streams of data about customer behaviors, customer traffic, employee performance, and inventory placement and sales. This creates incredible opportunities for retailers, their regional managers and their store managers to leverage analytics to better understand shifts in customer preferences, employee productivity, and merchandising strategies that can help them gain a competitive advantage. Point-of-sales (POS) systems offer retailers a font of information about customer behaviors. Store managers and other retail executives can analyze…
  • Analytics Maturity, Stage 2: Diagnose: The Root Cause of Business, Operational Conditions

    Spotfire Blog Editor
    17 Apr 2014 | 5:55 am
    In a previous post, we explained how the first stage of the Analytics Maturity Model, “Measure,” enables executives and front-line managers to obtain a quick, current status of the operational and business performance of their company. The second stage of the Analytics Maturity Model, “Diagnose,” is where business leaders are able to visually interact and drill into their data to discover additional answers to questions that arose in the Measure stage, e.g., an increase or decrease in monthly revenue for a particular region. Executives and other decision makers may also diagnose why…
  • Conquering the Key Challenges of Big Data

    Spotfire Blog Editor
    16 Apr 2014 | 5:55 am
    Big data offers companies a number of useful benefits, including opportunities for decision makers to gain deep insights about customers and market opportunities. When used effectively with analytics tools, big data can also help business leaders identify and stem emerging issues (e.g., a developing bottleneck in a company’s supply chain) – even before they’ve reached the surface. Companies Still Need to Address Big Data Challenges  Still, despite the tremendous opportunities that it offers, big data also presents some heady challenges to organizations. These include struggles among…
  • Transform Energy 2014 – The Value of Shared Experiences

    Spotfire Blog Editor
    15 Apr 2014 | 5:55 am
    By Steve Farr, Oil & Gas Industry Expert Everyone in our industry attends functions, conventions, and shows that hint at the future of data analytics. Sure, we go to shows, but let’s be honest: who really remembers every detail of every presentation, 12 months earlier on a particular day? Very few people. But, sometimes you do. And when that happens, it’s because the presentation is insightful and has a deep impact. And that’s just what happened to me at last year’s Transform 2013. Transform Energy: Then & Now My Wow moment happened during the presentation of the MaraDrill…
 
  • add this feed to my.Alltop

    PolicyMap

  • Kentucky Department for Public Health: PolicyMap for Preparedness

    Morgan Robinson
    22 Apr 2014 | 12:44 pm
    The Kentucky Department for Public Health is one of PolicyMap’s newest subscribers in the public health realm. The Public Health Preparedness office uses PolicyMap’s health and demographic data offerings to provide useful information for citizens and practitioners. Health Risks and Resources Kentucky Public Health developed its Health Risks and Resources map to give local public health personnel greater knowledge of population characteristics to facilitate response to vulnerable populations in case of an emergency. By overlaying health resources with needs, public health practitioners…
  • PolicyMap’s new interface featured on Generocity.org

    Katie Nelson
    22 Apr 2014 | 7:05 am
    Read more on Generocity.org The Reinvestment Fund Updates PolicyMap for Easier Use By Andy Sharpe | Posted on Monday, April 21st, 2014 Philadelphia-based community development financial institution (CDFI) The Reinvestment Fund came out with a re-designed version of PolicyMap. The data mapping and analysis tool was tweaked to make it simpler to use and navigate. The website also has more functions, including advanced multi-tiered mapping and custom data tools, according to Maggie McCullough, president of PolicyMap. The mapping tool represents various data sets, including demographic, job,…
  • The Changing Face of the United States

    Kristin Crandall
    17 Apr 2014 | 9:31 am
    When it comes to data, some demographic trends are more easily captured than others. The country’s shifting racial and ethnic makeup is perhaps towards the top of this list. The fact that the US is an increasingly multiracial country has been discussed in many forums, such as the Smithsonian, the New York Times, and other news outlets. Last October, National Geographic published an interesting article called “The Changing Face of America” in which the article’s author, Lise Funderburg, and photographer, Martin Schoeller, attempt to put a human face on the country’s increasingly…
  • Use the Data Loader and upload your data today!

    Phil Vu
    17 Apr 2014 | 7:55 am
    PolicyMap’s data loader lets subscribers easily load their own address level files to view on top of any of the over 15,000 indicators available in PolicyMap. Choose to keep your data private, share it confidentially within your organization or post it for the public to access. Watch our video to see just how easy it is to use! But, don’t take our word for it. Try it out for yourself with FREE 7-day trial! Visit our support page! Want to learn more about the many features available to you on PolicyMap? Go to the Support Page to find; the calendar of available training sessions free for…
  • Take a Virtual Historic Tax Credit Road Trip

    Morgan Robinson
    11 Apr 2014 | 1:05 pm
    The Historic Tax Credit program brings history to life, providing a 20% tax credit for the restoration of a certified historic structure that complies with rehabilitation guidelines. Good news for those of us who are both historic preservation nerds and PolicyMap users: historic tax credit sites have recently been updated to include projects approved during the 2013 fiscal year, making it easy to plan your next road trip from the comfort of your home or office. If you travel along Route 66, for example, you’ll pass through a great deal of history. Although many iconic Route 66 buildings…
  • add this feed to my.Alltop

    Revolutions

  • Simpson's Paradox in a nutshell

    David Smith
    22 Apr 2014 | 4:02 pm
    Norm Matloff points us to a pithy example that sums up Simpson's Paradox perfectly, captured in the title of a medical paper: "Good for Women, Good for Men, Bad for People". He explains how Simpson's Paradox isn't a paradox at all, but just the consequence of including a minor variable in a model ahead of a more significant variable, and illustrates this with an R analysis of the UCB admissions data. You can also see an interactive analysis of the same data here.
  • Webinar: Big-Data Trees for R

    David Smith
    21 Apr 2014 | 8:10 am
    If you missed last week's webinar presented by Revolution Analytics' US Chief Scientist Mario Inchiosa, Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise, the slides and webinar replay are now available for download. The webinar includes a demo of building decision trees and regression trees in Revolution R Enterprise, and using the Tree Viewer to inspect the resulting tree, starting at the 30:20 mark. If you'd like to learn more about the big-data tree models in Revolution R Enterprise, check out the white paper, Big Data Decision Trees…
  • Because it's Friday: This is why dogs hate wizards

    David Smith
    18 Apr 2014 | 1:48 pm
    What happens when you offer a dog a treat, but then make it vanish via sleight of hand? This: Like Sullivan, I'm surprised these dogs are fooled at all, and can't tell where the treat is by scent. That's all for this week. See you on Monday!
  • R and the weather in the local news

    David Smith
    18 Apr 2014 | 10:13 am
    The Mountain View Voice is a weekly newspaper serving the Silicon Valley area, and is a familiar sight to anyone wandering the streets of Palo Alto or Menlo Park. Angela Hey writes for 'Hey Tech!', an online blog of the Voice, and has just published a feature on R and the local Bay Area User Group (BARUG). It includes a nice history of R, and an in-depth recap of Ram Narasimhan's lightning talk on the weatherData package and his weatherCompare app at the last BARUG meeting. (You can read about other talks at that BARUG meetup in Joe Rickert's recap.) Read Angela…
  • DM Radio on Data Science

    David Smith
    18 Apr 2014 | 9:31 am
    A couple of weeks ago, I participated in a panel discussion for DM Radio: "Still Sexy? How's that Data Scientist Gig Working Out?". The title was provocative, but the discussion mostly revolved around the rise of data science and how advanced analytics (often implemented with R) is changing the way many companies do business today. Also on the panel hosted by Eric Kavanagh were Geoffrey Malafsky of Phasic Systems, John Whittaker of Dell, Chandran Saravana of SAP. The podcast is now available online, which you can listen to at the link below. (And the answer is: Yes, still…
  • add this feed to my.Alltop

    iTrend Blog

  • Twitter analysis made easier

    iTrend LLC
    21 Apr 2014 | 6:27 am
    I was just sent a link to the following article on Gigaom: http://gigaom.com/2014/04/19/i-analyzed-more-than-a-million-bitcoin-tweets-heres-what-that-looks-like/ While the effectiveness of some of the charts presented is debatable, it was great a analysis overall – and you should definitely check it out. It also sounded like a lot of work. The author states: I’m neither a statistician nor a programmer, so I used relatively simple tools for analysis and worked with companies on other aspects. Gnip, which is now part of Twitter, supplied the data. I used Chartio’s cloud-based…
  • microtrending tool update – compress your Twitter timeline into Top 5

    iTrend LLC
    20 Jan 2014 | 6:05 pm
    ‘Basic’ version is free, you can try it here: http://micro.itrend.tv Now you can tell what it’s doing too – see status in the header:hat
  • Problems using our new micro trending tool? Twitter is having API issues.

    iTrend LLC
    31 Dec 2013 | 12:54 pm
    If our new micro trending tool (http://micro.itrend.tv/) isn’t displaying Top 5 most interesting tweets properly – that’s because Twitter has been having performance issues with their API: We’ve already designed a workaround and will be updating the app over next week or so. Happy New Year!
  • Following hundreds (or thousands) of people on Twitter? You must try this app

    iTrend LLC
    27 Dec 2013 | 9:11 pm
    Try beta version here: http://micro.itrend.tv
  • Sneak preview: new micro trending tool

    iTrend LLC
    19 Dec 2013 | 3:14 pm
    We are nearing the final stage of testing our new Twitter micro trending tool.  It turns keeping track of your friends (people you follow) into a magical experience. Try beta version here: http://micro.itrend.tv
 
Log in