Analysis of Iris data¶
The dataset we have contains information about three species of iris: Iris setosa, Iris versicolor, and Iris virginica. The data include measurements of four features: the length and width of the sepal, and the length and width of the petal. Each row in the dataset represents a single flower, and the measurement values are given in centimeters.
To begin with, it would be good to get an idea of what we are actually talking about. What are these "irises"? Well, the iris is an ornamental plant of the temperate zone, meaning it is also found in Europe. It is closely related to saffron, freesia, and gladiolus. Since it is not very demanding, it is very easy to cultivate. It is valued for its interesting and striking flowers. It is used, among other things, in the perfume industry and herbal medicine. The photo below shows a typical representative of the species.
There are several hundred species of iris in the world. Today, we will focus on just three of them: Iris setosa, Iris versicolor, i Iris virginica:
Looking at them through the eyes of a layperson, I have to admit that they do indeed look quite pretty and could be the ornament of any garden. And from what I can see, it takes a professional botanist to tell them apart.
Returning to our numbers, we have a dataset that contains 150 records. Each record represents the data of a single plant. This data consists of measurements in centimeters of four features: the length and width of the flower petal, as well as the length and width of the sepal. And again, just so we know what we're talking about: flower petals are the elements located higher up, they are fragrant, and their function is to attract pollinating insects. On the other hand, sepals are located below. They are harder and more rigid than the petals. Their purpose, in the early stage, is to protect the flower bud. After the flower opens, they support the petals from below by bending downwards. The image below explains the situation:
# sekcja importowa
import pandas as pd
import os
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# wczytanie bazy danych, tworzenie DataFrame
df = pd.read_csv('25__iris.csv', sep=",", encoding='utf8')
Hypothesis:¶
Is there any relationship between the species of iris and the size of its sepal or petal? We will check this by analyzing the available dataset.
1.1 Conclusions from the analysis of basic data information:¶
Our dataset contains 150 records, with 50 for each iris species. Besides the 'class' column (iris species), we have 4 columns containing numerical data: sepal length, sepal width, petal length, and petal width. These measurements are given in centimeters, with an accuracy of 1 millimeter. The data is complete, there are no missing values, so there is no need for imputation.
Looking at the basic statistics, we immediately notice a gigantic spread (a 25-fold difference) between the minimum and maximum value of petal width. A glance at the first and third quartile confirms that this is not an outlier but a substantial regularity. Furthermore, a very large standard deviation tells us that the data in this case is highly dispersed. We will therefore take a closer look at this later.
Large, although not as gargantuan, but still several-fold differences can also be seen in the case of petal length. When it comes to the sepal, however, the data is much more concentrated and the differences are small. This means that even small flowers, with small petals, are protected by relatively large sepals. In other words, until a given flower opens, we have no idea about the size of its petals.
1.2 Analysis of basic information about the data:¶
# podstawowe informacje o DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 długość kielicha (sepal length) 150 non-null float64 1 szerokość kielicha (sepal width) 150 non-null float64 2 długość płatka (petal length) 150 non-null float64 3 szerokość płatka (petal width) 150 non-null float64 4 klasa (class) 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
# lista kolumn
df.columns
Index(['długość kielicha (sepal length)', 'szerokość kielicha (sepal width)',
'długość płatka (petal length)', 'szerokość płatka (petal width)',
'klasa (class)'],
dtype='object')
# kilka losowych rekordow, zeby zorientowac sie z czym mamy do czynienia
df.sample(5)
| długość kielicha (sepal length) | szerokość kielicha (sepal width) | długość płatka (petal length) | szerokość płatka (petal width) | klasa (class) | |
|---|---|---|---|---|---|
| 55 | 5.7 | 2.8 | 4.5 | 1.3 | Iris-versicolor |
| 99 | 5.7 | 2.8 | 4.1 | 1.3 | Iris-versicolor |
| 78 | 6.0 | 2.9 | 4.5 | 1.5 | Iris-versicolor |
| 57 | 4.9 | 2.4 | 3.3 | 1.0 | Iris-versicolor |
| 121 | 5.6 | 2.8 | 4.9 | 2.0 | Iris-virginica |
# sprawdzenie ilosci unikatowych wartosci
df.nunique()
długość kielicha (sepal length) 35 szerokość kielicha (sepal width) 23 długość płatka (petal length) 43 szerokość płatka (petal width) 22 klasa (class) 3 dtype: int64
# sprawdzenie brakujacych wartosci
df.isnull().sum()
długość kielicha (sepal length) 0 szerokość kielicha (sepal width) 0 długość płatka (petal length) 0 szerokość płatka (petal width) 0 klasa (class) 0 dtype: int64
# sprawdzenie liczby rekordow dla kazdego gatunku irysa
df['klasa (class)'].value_counts()
Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 Name: klasa (class), dtype: int64
# podstawowe statystyki
df.describe().round(2).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| długość kielicha (sepal length) | 150.0 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 |
| szerokość kielicha (sepal width) | 150.0 | 3.05 | 0.43 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 |
| długość płatka (petal length) | 150.0 | 3.76 | 1.76 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
| szerokość płatka (petal width) | 150.0 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 |
2.1 Conclusions from the analysis of individual variables:¶
When analyzing the available data for each species separately, we immediately see that the species Iris-setosa has very small flower petals, falling within very narrow ranges—both in terms of length and width. Additionally, its longest petals are over 50% shorter than the shortest ones of the other two species. As for petal width, the differences here are even greater, reaching almost 100%. This is immediately visible in the charts below. Therefore, it will be quite easy to distinguish this species based on the measurement of its flower petal size. We can thus assume that an iris with petals less than 2.5 cm long and less than 0.9 cm wide certainly belongs to the Iris-setosa species.
This partially confirms our hypothesis about the possibility of distinguishing species based on flower size measurements. Unfortunately, when it comes to the other two species (Iris-versicolor and Iris-virginica), the situation is not as clear. However, by looking at the charts below, we can see that the Iris-virginica species clearly surpasses its cousin in terms of both the length and width of the flower petals.
2.2 Analysis of individual variables:¶
# pogrupowanie danych wedlug gatunkow
pd.set_option('max_colwidth', None)
grouped = df.groupby('klasa (class)', as_index=False)
statystyki_gatunkami = grouped.agg(['mean', 'std', 'min', 'max']).round(2)
statystyki_gatunkami
| długość kielicha (sepal length) | szerokość kielicha (sepal width) | długość płatka (petal length) | szerokość płatka (petal width) | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | min | max | mean | std | min | max | mean | std | min | max | mean | std | min | max | |
| klasa (class) | ||||||||||||||||
| Iris-setosa | 5.01 | 0.35 | 4.3 | 5.8 | 3.42 | 0.38 | 2.3 | 4.4 | 1.46 | 0.17 | 1.0 | 1.9 | 0.24 | 0.11 | 0.1 | 0.6 |
| Iris-versicolor | 5.94 | 0.52 | 4.9 | 7.0 | 2.77 | 0.31 | 2.0 | 3.4 | 4.26 | 0.47 | 3.0 | 5.1 | 1.33 | 0.20 | 1.0 | 1.8 |
| Iris-virginica | 6.59 | 0.64 | 4.9 | 7.9 | 2.97 | 0.32 | 2.2 | 3.8 | 5.55 | 0.55 | 4.5 | 6.9 | 2.03 | 0.27 | 1.4 | 2.5 |
# tworzenie macierzy wykresow z 1 wierszem i 4 kolumnami
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
columns = df.columns[0:4]
# rysowanie linii KDE
for i, col in enumerate(columns):
sns.kdeplot(data=df, x=col, hue=df.columns[4], ax=axes[i])
Hypothesis¶
Perhaps it is possible to distinguish the two remaining iris species using some relationships between the data we have? For example, the ratio of petal length to its width might be completely different for both species.
Therefore, in the next step, we will look at all possible relationships between the data.
3.1 Conclusions from the analysis of dependencies between the data:¶
So far, we have determined how to distinguish the species Iris-setosa from our sample. Its petals are so small that a simple size criterion is sufficient in this case. Therefore, in further analysis, we can focus on the remaining two species of iris.
I started the analysis of dependencies between the data in a somewhat unusual way, beginning with checking for outliers. I wanted to see if it was these values that might be distorting our perception of the data. However, a glance at the boxplots showed that outliers are not relevant here. The petal and sepal sizes for the species Iris-versicolor and Iris-virginica are quite similar to each other. When it comes to sepal parameters, these values are so intermixed that their separation is rather unlikely. However, in the case of the petals, it seems there may be some room for differentiation.
However, to be completely certain, I generated scatterplots for all possible pairs of data. From the images we obtained, it is clear that the most promising pair is: petal length and petal width. I then enlarged this image by removing the Iris-setosa species for clarity. Immediately, I noticed the possibility of outlining areas occupied by only one species. I chose the simplest possible variant—a rectangle.
I arbitrarily set its boundaries at: from 0 to 1.7 cm for petal width and from 0 to 5.1 cm for petal length. All Iris-versicolor samples are within this area, as well as only 3 specimens of Iris-virginica. I concluded that an approximately 5% error for one species and zero for the other is quite promising. Of course, I would like to remind you that we are operating on a sample of only 150 individuals. It is possible that with larger numbers, the boundaries of the area we have defined will need to be modified.
3.2 Analysis of the relationships between the data:¶
# tworzenie macierzy wykresow z 1 wierszem i 4 kolumnami
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
columns = df.columns[0:4]
fig.suptitle('Boxploty poszczegolnych cech kwiatu podzielone na klasy (gatunki)', fontsize=20)
# rysowanie boxplotow
for i, col in enumerate(columns):
sns.boxplot(x='klasa (class)', y=col, data=df, ax=axes[i])
# tworzenie wszystkich mozliwych scatterplotow, zeby zobaczyc jak wyglada rozklad danych
# zmienna do sledzenia, ktory subplot program aktualnie rysuje
current_subplot = 1
hue = df['klasa (class)']
# tworzenie macierzy wykresow
fig, axs = plt.subplots(2, 6, figsize=(18, 6))
# rysowanie scatterplotow
for i, j in itertools.product(range(len(columns)), range(len(columns))):
if i != j:
ax = axs[(current_subplot - 1) // 6, (current_subplot - 1) % 6]
sns.scatterplot(data=df, x=columns[j], y=columns[i], hue=hue, ax=ax, legend=False)
current_subplot += 1
if current_subplot > 12:
break
if current_subplot > 12:
break
plt.tight_layout()
plt.show()
# tworzenie jednego, duzego wykresu scatterplot z wybrana para danych
plt.figure(figsize=(8, 6))
# odsiewanie z bazy 'Iris-setosa'
filtered_df = df[~df['klasa (class)'].isin(['Iris-setosa'])]
# agresywne i mocno kontrastowe kolorki
colors = {'Iris-versicolor': 'red', 'Iris-virginica': 'green'}
# rysowanie wykresu
plt.scatter(filtered_df['szerokość płatka (petal width)'],
filtered_df['długość płatka (petal length)'],
c=filtered_df['klasa (class)'].map(colors),
label=filtered_df['klasa (class)'])
# etykiety i tytul
plt.xlabel('Szerokość płatka (petal width)')
plt.ylabel('Długość płatka (petal length)')
plt.title('Scatter Plot: Szerokość płatka (petal width) i Długość płatka (petal length)')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, label=label) for label, color in colors.items()])
plt.grid(True)
# pionowy odcinek od osi x dla wartosci 1.7 do wysokosci 5.1
plt.axvline(x=1.7, ymin=0, ymax=(5.1-plt.ylim()[0])/(plt.ylim()[1]-plt.ylim()[0]), color='black', linestyle='--')
# poziomy odcinek od osi y na wysokosci 5.1 do wartosci 1.7
plt.axhline(y=5.1, xmin=0, xmax=(1.7-plt.xlim()[0])/(plt.xlim()[1]-plt.xlim()[0]), color='black', linestyle='--')
plt.show()
# policzenie ilosci blednie zaklasyfikowanych danych
df[(df['długość płatka (petal length)'] <= 5.1) & (df['szerokość płatka (petal width)'] <= 1.7) & (df['klasa (class)'] == 'Iris-virginica')].value_counts()
długość kielicha (sepal length) szerokość kielicha (sepal width) długość płatka (petal length) szerokość płatka (petal width) klasa (class) 4.9 2.5 4.5 1.7 Iris-virginica 1 6.0 2.2 5.0 1.5 Iris-virginica 1 6.3 2.8 5.1 1.5 Iris-virginica 1 dtype: int64
4.1 Final conclusions from the data analysis:¶
Given a dataset containing information about three species of iris: Iris setosa, Iris versicolor, and Iris virginica (150 records, 50 for each species), with the data including only four features: sepal length and width, and petal length and width (accurate to 0.1 cm), where each record represented a single flower, we have proven that it is possible to distinguish flowers from the sample into their respective species.
Specifically, the Iris setosa species has petals with a length of <1 cm, 1.9 cm> and a width of <0.1 cm, 0.6 cm>, Iris versicolor has petals with a length of <3 cm, 5.1 cm> and a width of <1 cm, 1.7 cm>, while the remaining specimens belong to the Iris virginica species.
By somewhat arbitrarily setting the boundary petal length at 5.1 cm and the width at 1.7 cm, we separated 100% of the representatives of the Iris-versicolor species. Unfortunately, these measurements also included three representatives of the Iris-virginica species. However, an error rate of about 5% with such a simplified verification should be considered completely acceptable. Additionally, during the analysis it turned out that the size of the flower’s calyx has no significance for interspecies differentiation, as it is practically identical across all species.
Thus, the hypothesis posed at the beginning regarding the possibility of distinguishing between species is considered proven. QED
Finally, below are schematic drawings of all three species. Now, the differences between them are immediately visible :-)
!jupyter nbconvert 25__zadanie_domowe__modul_4_1_RafalNey.ipynb --to slides --no-input --no-prompt
[NbConvertApp] Converting notebook 25__zadanie_domowe__modul_4_1_RafalNey.ipynb to slides [NbConvertApp] WARNING | Alternative text is missing on 2 image(s). [NbConvertApp] Writing 1718066 bytes to 25__zadanie_domowe__modul_4_1_RafalNey.slides.html