-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathStrategy.txt
60 lines (37 loc) · 2.15 KB
/
Strategy.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Hi Pauline,
I was thinking that maybe we should devise a strategy wrt analyzing the missing data.
We should focus on those features that show a **strong** correlation with the `Overall_Experience`, and have
a **large number** of missing values.
1. I think we should focus on the correlated features that are purple, and maybe green, if time permits.
2. Let's say features X and Y are correlated.
Then,
df.groupby([X, Y])[X, Y].count().reset_index()
Will give us an indication of how the values are distributed.
3. Now, we need to identify the missing values.
df[df[X].isna()].groupby(Y)[Y].count()
Should give us the values for Y when X is null.
Use the distribution on step #2 to imput values for X.
Of course, there's the case when both X and Y are null, and that should be treated further (or not).
4. Repeat step #3 for the Y feature.
====================
Added on April 21st.
====================
Hi Pauline,
I looked at the impute documentation and I came across the following.
My understanding is that it does what I've described above, and that makes me happy!
https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation
======================
A more sophisticated approach is to use the IterativeImputer class, which models each
feature with missing values as a function of other features, and uses that estimate for
imputation. It does so in an iterated round-robin fashion: at each step, a feature column
is designated as output y and the other feature columns are treated as inputs X.
A regressor is fit on (X, y) for known y.
Then, the regressor is used to predict the missing values of y.
This is done for each feature in an iterative fashion, and then is repeated for max_iter
imputation rounds. The results of the final imputation round are returned.
**Note**
This estimator is still experimental for now: default parameters or details of behaviour
might change without any deprecation cycle. Resolving the following issues would help
stabilize IterativeImputer: convergence criteria (#14338), default estimators (#13286),
and use of random state (#15611). To use it, you need to explicitly
import enable_iterative_imputer.