Detailed explanation of the advantages of scikit-learn operation flow and integration implementation

Introduction to scikit-learn

Scikit-learn is Python's most popular machine learning library. It has the following attractive features:

Simple, efficient and exceptionally rich data mining/data analysis algorithms;

Based on NumPy, SciPy, and matplotlib, from data exploratory analysis, data visualization to algorithm implementation, the whole process is integrated;

Especially when we want to compare and evaluate the effects of various algorithms, the advantages of this integrated realization can be more prominent.

Since the scikit-learn module is so important, there is not much nonsense to say, just open it below!

Project organization and file loading

Project organization

Working path: `D:\my_python_workfile\Thesis\sklearn_exercise` |--data: used to store data |--20news-bydate: Practice data set |--20news-bydate-train: training set |--20news-bydate -test: test set

File loading

Suppose we need to load the data, the organization is as follows:

Container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ...

You can use the following functions to load data:

Sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)

Parameter explanation:

The path of `container_path`:container_folder;

`load_content = True`: Whether to load the contents of the file into memory;

`encoding = None`: encoding method. The current text file is generally encoded as "utf-8". If you do not specify the encoding method (encoding=None), the file contents will be processed in bytes instead of unicode.

Return value: Bunch Dictionary-like object. The main attributes are

Data: raw data;

Filenames: the name of each file;

Target: category tag (integer index starting from 0);

Target_names: The specific meaning of the category label (determined by the subfolder name `category_1_folder`, etc.).

In this way, an example demonstration is performed using the test data set [The 20 Newsgroups data set] (Home Page for 20 Newsgroups Data Set: http://qwone.com/~jason/20Newsgroups/). First download the data set from the Internet, and then load the data locally.

```python# load library import osimport sys##configure utf-8 output environment#reload(sys)#sys.setdefaultencoding("utf-8")# Set the current working path os.chdir("D:\\my_python_workfile\ \Thesis\\sklearn_exercise")# Load data from sklearn import datasetstwenty_train = datasets.load_files("data/20news-bydate/20news-bydate-train")twenty_test = datasets.load_files("data/20news-bydate/20news-bydate- Test")```````pythonlen(twenty_train.target_names), len(twenty_train.data), len(twenty_train.filenames),len(twenty_test.data)```

(20, 11314, 11314, 7532)

```python print("".join(twenty_train.data[0].split("")[:3])) ```

From: keley.edu ( )

Subject: Re: Cubs behind Marlins? How?

Article-ID: agate.1pt592$f9a

```python print(twenty_train.target_names[twenty_train.target[0]]) ```

Rec.sport.baseball

```python twenty_train.target[:10] ```

Array([ 9, 4, 11, 4, 0, 4, 5, 5, 13, 12])

It can be seen that the file has been successfully loaded.

Of course, as a training for getting started, we can also use the `tochi example` dataset that comes with `scikit-learn` for testing and playing. Below, I will introduce how to load the data set that comes with it.

```python from sklearn.datasets import fetch_20newsgroups categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories =categories, shuffle=True, random_state=42) ```text feature extraction

Text data belongs to unstructured data, and is generally converted into structured data in order to implement machine learning algorithms to achieve text classification.

A common practice is to convert the text into a "document-term matrix". The elements in the matrix can use word frequency, or TF-IDF value, and so on.

Calculating word frequency

```python from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words="english",decode_error='ignore') X_train_counts = count_vect.fit_transform(twenty_train.data) X_train_counts.shape ```

(11314, 129783)

Feature extraction using TF-IDF

```python from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) X_train_tf.shape ```

(11314, 129783)

The above program uses two steps to formally represent the text: first use the `fit()` method to make the model applicable to the data; then use the `transform()` method to re-express the word frequency matrix into TF-IDF.

It can also be set in one step as shown below.

```python tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_train_tfidf.shape ```

(11314, 129783)]

Classifier training

```python from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB().fit(X_train_tfidf,twenty_train.target) ``````python #Predicting new samples docs_new = ['God is love','OpenGL on The GPU is fast'] X_new_counts = count_vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = clf.predict(X_new_tfidf) for doc,category in zip(docs_new,predicted): print("%r => % s") %(doc,twenty_train.target_names[category]) ```

'God is love' => soc.religion.christian

'OpenGL on the GPU is fast' => comp.graphics

Classification effect evaluation

Building a pipeline

```python from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect',CountVectorizer(stop_words="english",decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , MultinomialNB()), ]) text_clf = text_clf.fit(twenty_train.data,twenty_train.target) ```

Test set classification accuracy

```python import numpy as np docs_test = twenty_test.data predicted = text_clf.predict(docs_test) np.mean(predicted == twenty_test.target) ```

0.81691449814126393

Using the Naive Bayes classifier, the accuracy of the test set classification is 81.7%, and the effect is not bad!

Below, use the linear kernel support vector machine to see how it works.

```python from sklearn.linear_model import SGDClassifier text_clf_2 = Pipeline([('vect',CountVectorizer(stop_words='english',decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , SGDClassifier(loss = 'hinge',penalty = 'l2', alpha = 1e-3,n_iter = 5, random_state = 42)), ]) _ = text_clf_2.fit(twenty_train.data,twenty_train.target) predicted = text_clf_2 .predict(docs_test) np.mean(predicted == twenty_test.target) ```

0.82355284121083383

The classification accuracy of the support vector machine has been improved.

More detailed evaluation metrics are provided in `scikit-learn`, such as the accuracy of each category, the recall rate, and the F value.

Below, let's take a look at how the more detailed indicators perform.

```python from sklearn import metrics print(metrics.classification_report(twenty_test.target,predicted, target_names = twenty_test.target_names)) ```

Precision recall f1-score support

Alt.atheism 0.71 0.71 0.71 319

Comp.graphics 0.81 0.69 0.74 389

Comp.os.ms-windows.misc 0.72 0.79 0.75 394

Comp.sys.ibm.pc.hardware 0.73 0.66 0.69 392

Comp.sys.mac.hardware 0.82 0.83 0.82 385

Comp.windows.x 0.86 0.77 0.81 395

Misc.forsale 0.80 0.87 0.84 390

Rec.autos 0.91 0.90 0.90 396

Rec.motorcycles 0.93 0.97 0.95 398

Rec.sport.baseball 0.88 0.91 0.90 397

Rec.sport.hockey 0.87 0.98 0.92 399

Sci.crypt 0.85 0.96 0.90 396

Sci.electronics 0.80 0.62 0.70 393

Sci.med 0.90 0.87 0.88 396

Sci.space 0.84 0.96 0.90 394

Soc.religion.christian 0.75 0.93 0.83 398

Talk.politics.guns 0.70 0.93 0.80 364

Talk.politics.mideast 0.92 0.92 0.92 376

Talk.politics.misc 0.89 0.56 0.69 310

Talk.religion.misc 0.81 0.39 0.53 251

Avg / total 0.83 0.82 0.82 7532

The accuracy of the test set and the recall rate are both good.

Let's take a look at the results of the "confusion matrix".

```python metrics.confusion_matrix(twenty_test.target, predicted)```Use grid search for parameter optimization

In the process of classifying texts using classifiers, some parameters need to be specified. The penalty coefficient `alpha` in the smoothing parameter `alpha`;`SGClassifier()` in `use_idf`;`MultinomialNB()` in `TfidfTransformer()`. However, the parameter setting is not so much that you can't make a head decision directly. Because the setting of the parameters may cause the results to be different.

In order not to degenerate into a "tuning dog", let's look at how to use the violent "grid search algorithm" to let the computer help us to optimize the parameters.

```python from sklearn.grid_search import GridSearchCV parameters = { 'vect__ngram_range':[(1,1),(1,2)], 'tfidf__use_idf':(True,False), 'clf__alpha':(1e-2, 1e-3) } ```

If you want to exhaust all the combinations of parameters, it will take a lot of time to wait for the results. Some local tyrants may think: Can I use money to change time?

The answer is yes. If you have an 8-core computer, use all the cores!

```python gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs = -1) ``` ```python gs_clf = gs_clf.fit(twenty_train.data,twenty_train.target) ```

Set `n_jobs = -1` and the computer will automatically detect and use all your cores for parallel computing.

```python best_parameters,score,_ = max(gs_clf.grid_scores_,key = lambda x:x[1]) for param_name in sorted(parameters.keys()): print("%s: %r" %(param_name ,best_parameters[param_name])) ```

Clf__alpha: 0.01

Tfidf__use_idf: True

Vect__ngram_range: (1, 1)

```python score ```

0.90516174650875025

Light-duty Shelf

The light-duty shelf is a very economical, practical and convenient warehouse shelf. Because of its light load, it is often called a light-duty shelf. The shelf is composed of three components: uprights, beams, and laminates. The light shelf adopts a screwless combined design as a whole, which is convenient and quick to install and disassemble. It is suitable for storing some light and small items, and it is a kind of laminate shelf. The structure of the light-duty shelf is the same as that of the medium-sized shelf. It can be disassembled and assembled by yourself. The height of the floor can be adjusted freely at a distance of 5cm. However, the material specification is relatively small and the bearing capacity is relatively small. The price of this light-duty shelf is relatively economical, and the structure is relatively universal. To be stable and reliable, it can carry 100-200kg. The surface is treated with powder coated, which is beautiful and neat. It is suitable for storing light and medium-sized items and manual access.

Features:

1. Fully assembled structure, easy to combine, install and disassemble at will.

2. The cross-section of the shelf material is optimized, and a variety of pipe materials are selected to improve the bearing capacity.

3. There are wood and steel plates for choice of laminates. The height of the column is set to adjust the hole spacing, which can be used according to the height of the goods.

4. The surface treatment adopts polishing, surface rust removal, oil removal, electrostatic powder coated treatment, and can be sprayed according to the color specified by the customer, which has the advantages of beautiful color, not easy to oxidize, and easy to clean.

5. Less capital investment, widely used in factories, shopping malls and other storage needs.

Light-Duty Shelf,Light-Duty Shelf For Storage,Warehouse Light-Duty Shelf,Commercial Metal Light-Duty Shelf

Wuxi Lerin New Energy Technology Co.,Ltd. , https://www.lerin-tech.com