Train-Test Split and Data Preparation
Train-Test Split और Data Preparation
What is Train Test Split?
Train Test Split means train-test split divides data into training data for learning and testing data for checking performance.
In real programs, this topic helps in testing on unseen data. Learn the idea first, then type the program yourself and compare the output.
| Point | Details |
|---|---|
| Course Area | Machine Learning + AI Concepts used for prediction, classification, clustering and AI-based projects. |
| Main Use | testing on unseen data |
| Example File | train-test-split.py |
| Practice Focus | Run, change values, and explain the output line by line. |
Why should you learn this?
- It is useful for testing on unseen data.
- It connects with checking fair performance.
- It improves your ability to read, write and debug Python programs.
Important Terms
These terms are used directly in this lesson. Understand them before memorising the code.
| Term | Meaning |
|---|---|
| training set | training set is an important term in this topic. |
| testing set | testing set is an important term in this topic. |
| test_size | test_size is an important term in this topic. |
| random_state | Fixed seed used to make data splitting repeatable. |
| generalization | generalization is an important term in this topic. |
Syntax / Basic Pattern
The simple pattern is: prepare data, apply the concept, then show the result.
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5]]
y = [40, 50, 60, 70, 80]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print("Train size:", len(X_train))
print("Test size:", len(X_test))Complete Example Program
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5]]
y = [40, 50, 60, 70, 80]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print("Train size:", len(X_train))
print("Test size:", len(X_test))Expected Output
Program Explanation
from sklearn.model_selection import train_test_splitimports ready-made features from a module/library.X = [[1], [2], [3], [4], [5]]stores a value in X.y = [40, 50, 60, 70, 80]stores a value in y.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_stores a value in X_train, X_test, y_train, y_test.print("Train size:", len(X_train))displays information or calculated result on the screen.print("Test size:", len(X_test))displays information or calculated result on the screen.
Where will you use it?
- Testing on unseen data.
- Checking fair performance.
- Avoiding overfitting.
Common Mistakes
- Training and testing the model on the same data.
- Using an algorithm without understanding the input features.
- Reporting only accuracy without checking actual mistakes and limitations.
Practice Tasks
- Type the program in
train-test-split.pyand run it. - Change input values or sample data and observe the new output.
- Create one example related to testing on unseen data.
- Write 5 lines explaining the logic in your own words.
Summary
Train Test Split is not a theory-only topic. You should be able to explain the meaning, write the example, run it successfully, and use it in a small practical program.
Train Test Split क्या है?
Train Test Split ka matlab hai: Train-test split divides data into training data for learning and testing data for checking performance. Simple words me, ye topic practical Python programs likhne me direct use hota hai.
Is topic ko sirf definition ke liye nahi, balki testing on unseen data jaise real examples ke liye practice karein.
यह क्यों सीखना जरूरी है?
- Ye testing on unseen data me kaam aata hai.
- Ye checking fair performance se bhi connected hai.
- Isse aap code ka output aur errors better samajh paate hain.
Important Terms
| Term | Meaning |
|---|---|
| training set | training set is an important term in this topic. |
| testing set | testing set is an important term in this topic. |
| test_size | test_size is an important term in this topic. |
| random_state | Fixed seed used to make data splitting repeatable. |
| generalization | generalization is an important term in this topic. |
Syntax / Basic Pattern
Basic idea: pehle data तैयार करें, phir Python logic apply करें, aur finally result display करें.
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5]]
y = [40, 50, 60, 70, 80]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print("Train size:", len(X_train))
print("Test size:", len(X_test))Complete Example Program
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5]]
y = [40, 50, 60, 70, 80]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print("Train size:", len(X_train))
print("Test size:", len(X_test))Expected Output
Program Explanation
from sklearn.model_selection import train_test_splitimports ready-made features from a module/library.X = [[1], [2], [3], [4], [5]]stores a value in X.y = [40, 50, 60, 70, 80]stores a value in y.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_stores a value in X_train, X_test, y_train, y_test.print("Train size:", len(X_train))displays information or calculated result on the screen.print("Test size:", len(X_test))displays information or calculated result on the screen.
Practical Uses
- Testing on unseen data.
- Checking fair performance.
- Avoiding overfitting.
Common Mistakes
- Training and testing the model on the same data.
- Using an algorithm without understanding the input features.
- Reporting only accuracy without checking actual mistakes and limitations.
Practice Tasks
- Program ko
train-test-split.pyfile me type karke run karein. - Values change karke output compare karein.
- testing on unseen data par ek छोटा example banayen.
- Logic ko apne words me 5 lines me likhein.
सारांश
Train Test Split ko tab complete maanenge jab aap iska meaning, example, output aur practical use clearly explain kar saken.