Website overview

In this website, you will find:

You can also reach to us by clicking the email icon on top right corner of the website. Our github page can also be found be clicking the github icon.

Research question

We divide our main research topic, the relationship between smoking and health insurance, into two parts:

  • Smoker insurance preferences: If smokers tend to get insurance and what type of insurance they prefer
  • Insurance charges prediction: personal factors that affect insurance purchasing and premiums.

In the first part, insurance preferences, we first explore people’s insurance preferences and their smoking status in order to investigate if there is potential association between smoking and insurance. We also investigate some other variables that could be potential confounders or interaction factors between the relationship of smoking and insurance.

The second part, Insurance charges, assists our main research topic, since we can use the factors (age, bmi, smoker, etc) to not only explain what kind of personal characteristics would affect insurance cost, but also build a prediction model for health insurance charges. By doing such, we would gain better insights to the underlying relationship between health insurance and smoking, along with other important variables that contributes to the cost variation of insurance.


Datasets for analysis:

  • Community Health Survey Public Use Data. This dataset contains survey questions regarding smoking and health insurance. We extracted related questions from the survey for analysis.

  • Medical Cost Personal Dataset. This dataset sheds lights on insurance charges for different personal characteristics, such as smoking, age, bmi, region, etc.

Datasets for mapping:


  • Jing Lyu
  • Mengfan Luo
  • Yushan Wang
  • Yiqun Jin

We are Biostatistics students from Columbia University Mailman School of Public Health. This is a project for the course P8105 Data Science.