Data Mining

  • Most Topular Stories

  • Why Overfitting is More Dangerous than Just Poor Accuracy, Part I

    Data Mining and Predictive Analytics
    1 May 2014 | 1:20 pm
    Arguably, the most important safeguard in building predictive models is complexity regularization to avoid overfitting the data. When models are overfit, their accuracy is lower on new data that wasn’t seen during training, and therefore when these models are deployed, they will disappoint, sometimes even leading decision makers to believe that predictive modeling “doesn’t work”. Overfit, however, is thankfully a well-known problem and every algorithm has ways to avoid it. CART® and C5 trees use pruning to remove branches that are prone to overfitting, CHAID trees require splits are…
  • Data Mining's Forgotten Step-Children

    Data Mining and Predictive Analytics
    17 Jul 2015 | 8:15 pm
    Depending on whose definition one reads, the list of activities which comprise data mining will vary, but the first two items are always the same...Number 1: Prediction The most common data mining function, by far, is prediction (or, more esoterically, supervised learning), which is sometimes listed twice, depending on the type of variable being predicted: classification (when the target is categorical) vs. regression (when the target is numerical). Predictive models learned by machines from historical examples easily occupy the most of almost any measure of data mining: time, money,…
  • 5th International Workshop on Mobile Entity Localization and Tracking in GPS-less Environments

    Data mining News
    27 Jul 2015 | 4:54 pm
    … mapping, efficient fingerprinting techniques, location data mining and prediction, location data acquisition …
  • Behind The Numbers: GDP, the Sequel

    The Numbers
    Brian Hershberg
    24 Jul 2015 | 9:57 am
    When the Commerce Department releases its first reading of second-quarter gross domestic product this coming Thursday, there will also be a spate of annual revisions. This happens every year and serves to give more retroactive perspective on the U.S.'s economic growth trajectory.
  • The Screen Savers

    Kevin Hillstrom: MineThatData
    Kevin Hillstrom
    26 Jul 2015 | 8:10 pm
    I'll bet almost none of you remember this little-watched show from 1999. Here, take a look at this episode, assuming you have nothing better to do with the next 42 minutes of your time (click here if you do not see the video box).Pretty compelling stuff, wouldn't you say? How about that CD-R / CD-RW technology they're talking about? Or converting sound to digital? Wow. Or go to the 43 minute mark and take a look at the sophistication of the websites that are being featured.Why in the name of all that is good am I interrupting your pending omnichannel integration meeting (#seamlesscommerce) to…
 
  • add this feed to my.Alltop

    The Numbers

  • Behind The Numbers: GDP, the Sequel

    Brian Hershberg
    24 Jul 2015 | 9:57 am
    When the Commerce Department releases its first reading of second-quarter gross domestic product this coming Thursday, there will also be a spate of annual revisions. This happens every year and serves to give more retroactive perspective on the U.S.'s economic growth trajectory.
  • Pi the Numbers: 22/7

    Brian Hershberg
    22 Jul 2015 | 8:27 am
    In the day/month format that is used in some parts of the world, 22/7 is celebrated in some circles (ahem) as Pi Approximation Day, where 22 divided by 7 yields a quotient that represents pi.
  • Behind The Numbers: Baby-Name Data

    Jo Craven McGinty
    17 Jul 2015 | 10:00 am
    One reason social scientists and others use U.S. baby names to study cultural trends is simply because the data are readily available.
  • Behind The Numbers: a PSA on SPFs

    Jo Craven McGinty
    10 Jul 2015 | 9:06 am
    One thing fair-skinned beach lovers should keep in mind is that the surest way to prevent sunburn is to avoid exposure entirely: Seek out shade and wear protective clothing and sunglasses that filter out UV rays.
  • June Jobs Report – The Numbers

    Kate Davidson
    2 Jul 2015 | 6:06 am
    Jobs, wages, labor-force participation and more.
  • add this feed to my.Alltop

    Data Mining and Predictive Analytics

  • Data Mining's Forgotten Step-Children

    17 Jul 2015 | 8:15 pm
    Depending on whose definition one reads, the list of activities which comprise data mining will vary, but the first two items are always the same...Number 1: Prediction The most common data mining function, by far, is prediction (or, more esoterically, supervised learning), which is sometimes listed twice, depending on the type of variable being predicted: classification (when the target is categorical) vs. regression (when the target is numerical). Predictive models learned by machines from historical examples easily occupy the most of almost any measure of data mining: time, money,…
  • Similarities and Differences Between Predictive Analytics and Business Intelligence

    24 Jul 2014 | 10:56 pm
    I’ve been reminded recently of the overlap between business intelligence and predictive analytics. Of course any reader of this blog (or at least the title of the blog) knows I live in the world of data mining (DM) and predictive analytics (PA), not the world of business intelligence (BI). In general, I don’t make comments about BI because I am an outsider looking in. Nevertheless, I view BI as a sibling to PA because we share so much in common: we use the same data, often use similar metrics and even sometimes use the same tools in our analyses. I was interviewed by Victoria Garment of…
  • Why Overfitting is More Dangerous than Just Poor Accuracy, Part II

    26 May 2014 | 9:07 am
    In Part I, I explained one problem with overfitting the data: estimates of the target variable in regions without any training data can be unstable, whether those regions require the model to interpolate or extrapolate. Accuracy is a problem, but more precisely, the problems in interpolation and extrapolation are not revealed using any accuracy metrics and only arise when new data points are encountered after the model is deployed.This month, a second problem with overfitting is described: unreliable model interpretation. Predictive modeling algorithms find variables that associate or…
  • Why Overfitting is More Dangerous than Just Poor Accuracy, Part I

    1 May 2014 | 1:20 pm
    Arguably, the most important safeguard in building predictive models is complexity regularization to avoid overfitting the data. When models are overfit, their accuracy is lower on new data that wasn’t seen during training, and therefore when these models are deployed, they will disappoint, sometimes even leading decision makers to believe that predictive modeling “doesn’t work”. Overfit, however, is thankfully a well-known problem and every algorithm has ways to avoid it. CART® and C5 trees use pruning to remove branches that are prone to overfitting, CHAID trees require splits are…
  • Data Science and Big Data Search Trends

    16 Jan 2014 | 7:20 am
    These are from google trends.Data science is growing, but still way behind other traditional terms for our field such as data mining, predictive analytics and machine learning. Big data on the other hand is growing rapidly and outpacing other fields.This page for now is just an FYI that I may refer to in today's DM Radio show, "Blinded by Data Science"  (http://bit.ly/KfK74l ). I hope to turn it into a more cogent blog post soon (but no promises!)“data science”“data science” vs. “data mining”“data science vs. predictive analytics”“data science” vs. “machine…
 
  • add this feed to my.Alltop

    Kevin Hillstrom: MineThatData

  • The Screen Savers

    Kevin Hillstrom
    26 Jul 2015 | 8:10 pm
    I'll bet almost none of you remember this little-watched show from 1999. Here, take a look at this episode, assuming you have nothing better to do with the next 42 minutes of your time (click here if you do not see the video box).Pretty compelling stuff, wouldn't you say? How about that CD-R / CD-RW technology they're talking about? Or converting sound to digital? Wow. Or go to the 43 minute mark and take a look at the sophistication of the websites that are being featured.Why in the name of all that is good am I interrupting your pending omnichannel integration meeting (#seamlesscommerce) to…
  • The Omnichannel Customer Value Query

    Kevin Hillstrom
    23 Jul 2015 | 8:10 pm
    When a customer purchases for the first time, the customer has a whopping total of one purchase. This means that the customer cannot have purchased from more than one channel.When a customer purchases for the second time, the customer may elect to use the same channel (an affiliate), or the customer may use a different channel (email). If you are a credible marketer, you probably encouraged the customer to sign up for email marketing messages, so it is very possible that you nudged the customer into a second purchase via a second channel ... even if the customer was going to purchase…
  • A Contrast Between Catalog Marketers and E-Commerce Marketers

    Kevin Hillstrom
    22 Jul 2015 | 8:10 pm
    Two recent conversations illustrate the significant difference between how catalogers operate, and how e-commerce companies operate. I will outline each comment, and then not offer my thoughts about each comment. Instead, I am asking you to think - think about the downstream impact of each comment.Catalog Marketer Comment: "We're in the process of merchandising our Spring 2016 catalogs, and we're plotting out page counts for Fall 2016 catalogs. It's important that we know what merchandise will be in the catalogs, so that we can accurately forecast segment-level demand. Then, we can get busy…
  • Revenue Streams / Tolls

    Kevin Hillstrom
    21 Jul 2015 | 8:10 pm
    When you watch something on ESPN, you are part of a large toll-based ecosystem. Click here to see how much you pay, each month, even if you never ever watch ESPN. Basically, you are spending somewhere between $5.75 per month and $7.00 per month, depending upon how many ESPN channels are available to you.Then think about what ESPN does:Collect monthly money from individuals who never, ever watch the channel (eight billion dollars per year ... how much money do you collect from folks who never buy your merchandise?)Collect advertising revenue from brands looking to connect with those who love…
  • The Death Of An Item

    Kevin Hillstrom
    20 Jul 2015 | 8:10 pm
    Every item has a birth, and a death. Maybe that's the way everything is on Planet Earth.Now, the life of an item is messy. Undeserving items receive disproportionate attention, and consequently, perform above their pay grade. Highly deserving items are ignored, literally dying from starvation. And as anybody knows, when a Management change happens, items die and items are born.The key, of course, is to identify "peak life". For so many companies I work with, the curve looks something like what is depicted in the bar chart above. Often, items show significant promise, from the get-go. That…
 
  • add this feed to my.Alltop

    TIBCO Spotfire's Trends and Outliers

  • Drilling Into Insights at the 13th Annual TIBCO Houston Energy Forum

    Spotfire Blogging Team
    27 Jul 2015 | 6:49 am
    Executives and professionals in the energy industry are continuously uncovering new opportunities to leverage data sources and analytics to find new energy sources, drive growth opportunities, and improve operational efficiency. On September 1 and 2, TIBCO will bring together industry leaders from all segments of the energy industry for our 13th Annual Houston Energy Forum where attendees will network, listen to industry luminaries, and learn best practices directly from TIBCO customers. This must-attend event, taking place at the Norris Conference Center in downtown Houston, will spotlight…
  • Spotfire Tips & Tricks: Adding R Graphics to Spotfire (Case Study: Dendrograms)

    Jagrata Minardi
    23 Jul 2015 | 7:32 am
    In this post we explore adding R graphics to Spotfire. This is a case study of using dendrograms for the Spotfire Classification Modeling tool. We will see how to augment the output of Classification Tree modeling in Spotfire with a commonly used graphic that summarizes complex models in an intuitive way. The steps to accomplish this will reveal several very useful ideas. The Spotfire Analysis that we create can be used either on the desktop or through the web. Usage through the web requires TIBCO Statistics Services, configured to run on TERR. The Spotfire Classification Modeling tool (Tools…
  • Patient-Generated Health Data, Analytics Boost Organizations’ ROI

    Spotfire Blogging Team
    22 Jul 2015 | 6:54 am
    Patient-generated health data from such things as wearable devices, mHealth apps, and remote sensors is poised to become a key asset for healthcare organizations, particularly those that understand and embrace the benefits that Big Data analytics have to offer. In fact, 73% of organizations that have adopted personalization technologies have seen positive financial results, according to a new report from Accenture. The “Accenture Healthcare Technology Vision 2015,” report highlights emerging technology trends that will affect the health industry in the next three to five years. Just about…
  • 4 Ways Big Data Is Transforming the Insurance Industry

    Spotfire Blogging Team
    20 Jul 2015 | 6:46 am
    In order to succeed and be competitive in the ever-changing insurance industry, it’s a no-brainer that insurers must leverage Big Data and analytics. The insights gleaned from Big Data play a pivotal role in helping insurance companies solve some of the industry’s biggest challenges, according to an article in Investopedia. Capturing and analyzing structured data associated with their policyholders and unstructured data from various sources—including social media—can help insurers evaluate the risks of insuring a particular person and set the premium for the policy accordingly, the…
  • Citibank Brazil Banks on Digital Business and TIBCO for Increased Competitiveness

    Ann Scheuerell
    16 Jul 2015 | 5:55 am
    In Brazil, Citibank provides banking services like insurance, loans, credit cards, and investment products, to the commercial banking market with a focus on medium to large businesses. Banking in Brazil is competitive. “We’re competing against other large banks that are in more cities than we are,” says Roberto Mercadante, senior vice president of Citibank Brazil. “They have branches, and we are going digital. We’re looking to significantly improve customer satisfaction, and for that, we need efficiency gains.” In 2013, to gain a competitive edge, Citibank began implementing a…
  • add this feed to my.Alltop

    PolicyMap

  • Winner of the First #DataWiz Contest!

    Lauren Parker
    27 Jul 2015 | 11:33 am
    We just wrapped up our first DataWiz contest and wow, the responses definitely set the bar high. We received submissions in the form of tweets, emails, Census FIPS codes, and even original maps. It was, in a word, impressive. In case you didn’t catch the quiz question, we asked, “What are the most racially diverse places in the U.S.?” Granted, this was an open-ended question and that was intentional on our part. We know that diversity and place can be interpreted in a number of different ways: is diversity most meaningfully measured at the scale of a neighborhood? County? City? The…
  • Looking for the Local Boundaries?

    Phil Vu
    24 Jul 2015 | 12:50 pm
    You might have noticed that our Map Boundaries menu has been cleaned up a bit. We have removed the Local Boundaries menu. As the name implies, these boundaries are mainly localize boundaries and can only be seen if you are in the city or state that they represent which was why we have removed them for all users. We have only removed them from display so don’t worry, they are still there and can be added to your account easily. Below is the list of available Local Boundaries (in alphabetical order) that can be added to your account; Updated on 7/24/2015 Camden Parcels Chicago Community…
  • Americans with Disabilities Act Turns 25

    Aaron King
    24 Jul 2015 | 10:09 am
    This upcoming Sunday, July 26th marks the 25th anniversary of the signing of the Americans with Disabilities Act (ADA) by President George H.W. Bush.  Since its signing into law, the ADA has enabled millions of Americans with disabilities to participate in the workforce by removing legal barriers to employment. This landmark piece of legislation prohibits discrimination against people with disabilities within the workplace as well as in receiving services from federal, state, and local governments. According to the US Census Bureau, in 2010, there were 56.7 million Americans with…
  • Think You Know Data? Show Your Skills In Our Data Wizards (#DataWiz) Contests

    Lauren Parker
    22 Jul 2015 | 10:35 am
    Knowing about data is just too much fun. Often, too much fun to just keep it to yourself. We know how you feel. Show off your nerd knowledge in our new series of #DataWiz contests! We have officially announced our first #DataWiz contest, and it is well underway! The question is: What are the most racially diverse places in America? We’ve had great responses so far – keep them coming! Submit your answers by the stroke of midnight on Sunday, July 26th for a chance to win fame, glory, and even a prize from PolicyMap! Answers can be submitted via Tweet with the hashtag #datawiz, email…
  • The Art of Maps: Making Effective Visualizations

    Bernie Langer
    21 Jul 2015 | 12:21 pm
    Data is the key to maps. But show that data in a confusing, unattractive, or misleading way, and the power of your data is lost. Normally, Mapchats focus on using good data, but this time we’ll focus on the nuts and bolts of making good maps. Tuesday, 7/28 | 3 PM EST PolicyMap’s popular Mapchats series continues next week with a panel of leaders in online mapping, including Robert Cheetham from Azavea, Jake Garcia from Foundation Center, and PolicyMap’s own Bernie Langer. The topics to be discussed will include picking the right colors for a map, choosing the right map for the right…
  • add this feed to my.Alltop

    Revolutions

  • Hadley Wickham on why he created all those R packages

    David Smith
    27 Jul 2015 | 11:50 am
    Priceonomics published on Friday an in-depth profile of Hadley Wickham, author of many of the most popular R packages including ggplot2, dplyr and devtools. In the article, he reveals that his motivation for creating these packages was primarily to provide better ways of accomplishing routine tasks in R, an immensely useful contribution that sadly wasn't recognized in an academic setting. He said: “There are definitely some academic statisticians who just don’t understand why what I do is statistics, but basically I think they are all wrong . What I do is fundamentally statistics.
  • Because it's Friday: Good vibrations

    David Smith
    24 Jul 2015 | 1:50 pm
    If you've ever wanted to see how guitar strings actually move as they make music, it turns out you don't need an expensive high-speed camera. All you need to do is set your smartphone to record video, and put it inside: While some have suggested this is due to the rolling shutter in an iPhone camera, I'm not so sure. I think this is just an example of motion frequencies resonating with the video frame rate — the same effect that makes wagon wheels appear to move without rotating in old black-and-white films. Whatever the cause, it's cool to see the shapes the strings…
  • R #6 in IEEE 2015 Top Programming Languages, Rising 3 Places

    David Smith
    24 Jul 2015 | 8:05 am
    IEEE Spectrum has published its 2015 list of Top Programming Languages, and R ranks in 6th place, jumping 3 places from its 2014 ranking. Here's what the IEEE has to say about the top 10 from the table above: The big five—Java, C, C++, Python, and C#—remain on top, with their ranking undisturbed, but C has edged to within a whisper of knocking Java off the top spot. The big mover is R, a statistical computing language that’s handy for analyzing and visualizing big data, which comes in at sixth place. Last year it was in ninth place, and its move reflects the growing importance of…
  • Revolution R Open 3.2.1 now available

    David Smith
    23 Jul 2015 | 9:52 am
    The latest update to Revolution R Open, RRO 3.2.1, is now available for download from MRAN. This release upgrades to the latest R engine (3.2.1), enables package downloads via HTTPS by default, and adds new supported Linux platforms. Revolution R Open 3.2.1 includes:  The latest R engine, R 3.2.1. Improvements in this release include more flexible character string handling, reduced memory usage, and some minor bug fixes. Multi-threaded math processing, reducing the time for some numerical operations on multi-core systems. A focus on reproducibility, with access to a fixed CRAN snapshot…
  • Sunbelt XXXV, Social Network Analysis, Statnet and R

    Joseph Rickert
    23 Jul 2015 | 8:30 am
    by Joseph Rickert The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.) A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that: and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton. Social network…
 
  • add this feed to my.Alltop

    Data Science Notes

  • Kansas Teacher Salaries: The $7,000 Mistake

    27 Jul 2015 | 8:13 am
    Last Friday, Governor Brownback in Kansas held a press conference to cover various topics from abortion to education.  Recent news stories in Kansas had held that teachers were leaving the state  (and profession) in record numbers due to a de-funding of education.  Brownback moved to address this issue with the following chart:Brownback touting teacher pay in KS. Recent reports say teachers may be leaving KS for MO. #ksleg pic.twitter.com/c1T8h4qmyb— Stephen Koranda (@kprkoranda) July 24, 2015So, if you follow this blog you know I am already annoyed at the chart that's not…
  • Text Mining: Lafayette Shooting

    24 Jul 2015 | 2:39 pm
    Today I learned that Zayn Malik may be coming back to One Direction.  That's not the point of this post, but unfortunately, more on that later.This morning I noticed that the #Lafayetteshooting hashtag was getting a lot of action, and at least on my Twitter timeline, the talk was skewed towards a gun control conversation.  I've heard two kinds of accusations around this, that people are either too willing or not willing enough to talk about gun control after a mass shooting.People are obviously talking gun control already, but can we quantify it?  Hey, that code I've been…
  • Text Mining: Mining the BlackLivesMatter Hashtag

    24 Jul 2015 | 7:32 am
    Last night I asked my wife if she had read yesterday's post on topic mining the #ksleg hashtag, remarking I have never had topic models converge to news stories quite that well, blah blah blah; nerd nerd nerd. Honestly, she looked a little unimpressed.  And then finally said, "why don't you do that on the #blacklivesmatter hashtag?"She was right.  I had mined a policy-wonk centered, Kansas specific hashtag.  Why?  I should do something with more volume, that people actually care about.  And so I went to downloading.  NERD STUFFJust a few nerd notes on mining…
  • Text Mining Part Three: A Better Use Case

    22 Jul 2015 | 6:00 pm
    Mining my own tweets yesterday was interesting, but wasn't the most illustrative example of good text analysis.  There are a few reasons for that, but it's mainly about the weird way that I use twitter.  So what happens with a "tighter" and more topic friendly list of tweets?  Here's a quick look using some Kansas government tweets.DATA SETFor this dataset I used a more "topical" set of tweets, specifically with the hashtag: "#ksleg" meaning that they have something to do with Kansas government.  I've blogged about these kind of topics before regarding taxes, elections,…
  • Text Mining Part 2: Fitting Topic Models

    22 Jul 2015 | 8:36 am
    Finally have time to write my second post on text mining.  My first post was late last week, and dealt with the rather boring subject of importing and cleaning data.  This one gets into some actual analysis.THE BASICSSo what types of models will I be creating here?  Topic models.  Topic models are a way to analyze the terms (words) used in documents (for this purpose, tweets) and group the documents together into topics.  In simple terms, the algorithms used find which words are often used together and cluster those word commonalities into topics.  It's a way to…
  • add this feed to my.Alltop

    Data Science Notes

  • Kansas Teacher Salaries: The $7,000 Mistake

    27 Jul 2015 | 8:13 am
    Last Friday, Governor Brownback in Kansas held a press conference to cover various topics from abortion to education.  Recent news stories in Kansas had held that teachers were leaving the state  (and profession) in record numbers due to a de-funding of education.  Brownback moved to address this issue with the following chart:Brownback touting teacher pay in KS. Recent reports say teachers may be leaving KS for MO. #ksleg pic.twitter.com/c1T8h4qmyb— Stephen Koranda (@kprkoranda) July 24, 2015So, if you follow this blog you know I am already annoyed at the chart that's not…
  • Text Mining: Lafayette Shooting

    24 Jul 2015 | 2:39 pm
    Today I learned that Zayn Malik may be coming back to One Direction.  That's not the point of this post, but unfortunately, more on that later.This morning I noticed that the #Lafayetteshooting hashtag was getting a lot of action, and at least on my Twitter timeline, the talk was skewed towards a gun control conversation.  I've heard two kinds of accusations around this, that people are either too willing or not willing enough to talk about gun control after a mass shooting.People are obviously talking gun control already, but can we quantify it?  Hey, that code I've been…
  • Text Mining: Mining the BlackLivesMatter Hashtag

    24 Jul 2015 | 7:32 am
    Last night I asked my wife if she had read yesterday's post on topic mining the #ksleg hashtag, remarking I have never had topic models converge to news stories quite that well, blah blah blah; nerd nerd nerd. Honestly, she looked a little unimpressed.  And then finally said, "why don't you do that on the #blacklivesmatter hashtag?"She was right.  I had mined a policy-wonk centered, Kansas specific hashtag.  Why?  I should do something with more volume, that people actually care about.  And so I went to downloading.  NERD STUFFJust a few nerd notes on mining…
  • Text Mining Part Three: A Better Use Case

    22 Jul 2015 | 6:00 pm
    Mining my own tweets yesterday was interesting, but wasn't the most illustrative example of good text analysis.  There are a few reasons for that, but it's mainly about the weird way that I use twitter.  So what happens with a "tighter" and more topic friendly list of tweets?  Here's a quick look using some Kansas government tweets.DATA SETFor this dataset I used a more "topical" set of tweets, specifically with the hashtag: "#ksleg" meaning that they have something to do with Kansas government.  I've blogged about these kind of topics before regarding taxes, elections,…
  • Text Mining Part 2: Fitting Topic Models

    22 Jul 2015 | 8:36 am
    Finally have time to write my second post on text mining.  My first post was late last week, and dealt with the rather boring subject of importing and cleaning data.  This one gets into some actual analysis.THE BASICSSo what types of models will I be creating here?  Topic models.  Topic models are a way to analyze the terms (words) used in documents (for this purpose, tweets) and group the documents together into topics.  In simple terms, the algorithms used find which words are often used together and cluster those word commonalities into topics.  It's a way to…
Log in