python笔记

[TOC]

基础知识

运算符

三元运算符

1
2
3
4
其他语言:
a = x>y:x?y
python:
a = x if x>y else y

set()

装饰器和神奇的@

  • 装饰器 Decorator
    在程序运行期间动态增加功能的方式称为‘装饰器’。其实就是高阶函数,把原来的函数作为装饰器的参数运行,然后返回一个闭包代替原来函数。装饰器可以在运行函数前进行预处理,如参数类型检查等。

使用装饰器的时候,定义函数/对象方法前使用 @ 。

  • 简单装饰器及装饰器运行机制

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    >>> def log(func):
    ... def wrapper(*args,**kw):
    ... print('call %s():' %func.__name__)
    ... return func(*args,**kw)
    ... return wrapper
    ...
    >>> @log #装饰一下
    ... def now():
    ... print('2018-7-24')
    ... return 'done'
    ...
    >>> now() #运行时相当于运行 log(now()) -> wrapper(now())
    call now():
    2018-7-24
    'done'
  • 传参装饰器

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    >>> def log(text):
    ... def decorator(func):
    ... def wrapper(*args, **kw):
    ... print '%s %s():' % (text, func.__name__)
    ... return func(*args, **kw)
    ... return wrapper
    ... return decorator
    ...
    >>> @log('execute')
    ... def now():
    ... print '2015-10-26'
    ... return "done"
    ...
    >>> now() #运行时相当于运行 log(now()) -> wrapper(now())
    excute now()
    2018-7-24
    'done'

tips

list to str

''.join(str(x) for x in list)

builtins 内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
'ArithmeticError'
'AssertionError'
'AttributeError'
'BaseException'
'BlockingIOError'
'BrokenPipeError'
'BufferError'
'BytesWarning'
'ChildProcessError'
'ConnectionAbortedError'
'ConnectionError'
'ConnectionRefusedError'
'ConnectionResetError'
'DeprecationWarning'
'EOFError'
'Ellipsis'
'EnvironmentError'
'Exception'
'False'
'FileExistsError'
'FileNotFoundError'
'FloatingPointError'
'FutureWarning'
'GeneratorExit'
'IOError'
'ImportError'
'ImportWarning'
'IndentationError'
'IndexError'
'InterruptedError'
'IsADirectoryError'
'KeyError'
'KeyboardInterrupt'
'LookupError'
'MemoryError'
'NameError'
'None'
'NotADirectoryError'
'NotImplemented'
'NotImplementedError'
'OSError'
'OverflowError'
'PendingDeprecationWarning'
'PermissionError'
'ProcessLookupError'
'RecursionError'
'ReferenceError'
'ResourceWarning'
'RuntimeError'
'RuntimeWarning'
'StopAsyncIteration'
'StopIteration'
'SyntaxError'
'SyntaxWarning'
'SystemError'
'SystemExit'
'TabError'
'TimeoutError'
'True'
'TypeError'
'UnboundLocalError'
'UnicodeDecodeError'
'UnicodeEncodeError'
'UnicodeError'
'UnicodeTranslateError'
'UnicodeWarning'
'UserWarning'
'ValueError'
'Warning'
'WindowsError'
'ZeroDivisionError'
'_'
'__build_class__'
'__debug__'
'__doc__'
'__import__'
'__loader__'
'__name__'
'__package__'
'__spec__'
'abs'
'all'
'any'
'ascii'
'bin'
'bool'
'bytearray'
'bytes'
callable()
1
2
3
4
5
6
7
    函数用于检查一个对象是否是可调用的。如果返回True,object仍然可能调用失败;但如果返回False,调用对象ojbect绝对不会成功。

对于函数, 方法, lambda 函式, 类, 以及实现了 __call__ 方法的类实例, 它都返回 True。

callable(object)
para: object -- 对象
返回值:可调用返回 True,否则返回 False。
1
2
3
4
5
6
7
8
9
10
11
'chr'
'classmethod'
'compile'
'complex'
'copyright'
'credits'
'delattr'
'dict'
'dir'
'divmod'
'enumerate'
eval()

把输入字符串当作表达式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
'exec'
'exit'
'filter'
'float'
'format'
'frozenset'
'getattr'
'globals'
'hasattr'
'hash'
'help'
'hex'
'id'
'input'
'int'

int()

int()用于取整,且只能向下取整,并不能四舍五入
round() 有四舍五入功能
若想实现向上取整的功能,可以使用round(values + 0.5)

1
2
3
4
5
6
7
8
9
10
11
>>> int(1.3)
1
>>> int(1.8)
1
>>> round(1.3)
1
>>> round(1.8)
2
>>> round(1.3 + 0.5)
2
>>>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
'isinstance'
'issubclass'
'iter'
'len'
'license'
'list'
'locals'
'map'
'max'
'memoryview'
'min'
'next'
'object'
'oct'
'open'
'ord'
'pow'
'print'
property()
  • 将类的方法转为只读属性
  • 重新实现一个属性的setter和getter方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class Person(object):

def __init__(self, first_name, last_name):
self.first_name = first_name
self.last_name = last_name

@property
def full_name(self):
return "%s %s" % (self.first_name, self.last_name)

>>> person = Person("Mike", "Driscoll")
>>> person.full_name
'Mike Driscoll'
>>> person.first_name
'Mike'
>>> person.full_name = "Jackalope"
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
AttributeError: can't set attribute
>>> person.first_name = "Dan"
>>> person.full_name
'Dan Driscoll'

#原来一个类都是这样
>>> class Fees(object):
... def __init__(self):
... self._fee = None
... def get_fee(self):
... return self._fee
... def set_fee(self,value):
... self._fee=value
...
>>> f = Fees()
>>> f.set_fee(1)
>>> f.get_fee()
1

#用property改一下
>>> class Fees(object):
... def __init__(self):
... self._fee = None
... def get_fee(self):
... return self._fee
... def set_fee(self,value):
... self._fee=value
... fee = property(get_fee,set_fee)
...
>>> f = Fees()
>>> f.fee = 1
>>> f.fee
1

>>> class Fees(object):
... def __init__(self):
... self._fee = None
... @property
... def fee(self):
... return self._fee
... @fee.setter
... def fee(self,value):
... self._fee=value
...
>>> f = Fees()
>>> f.fee = 2
>>> f.fee
2
1
2
3
4
'quit'
'range'
'repr'
'reversed'
round()

round()四舍五入,参考int()

1
2
3
4
5
6
'set'
'setattr'
'slice'
'sorted'
'staticmethod'
'str'

sum()

1
2
3
4
5
>>> a
[1, 2, 3]
>>> sum(a)
6
>>>
1
2
3
4
5
'super'
'tuple'
'type'
'vars'
'zip'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
To solve this problem, we need to understand "What is the use of median". In statistics, the median is used for dividing a set into two equal length subsets, that one subset is always greater than the other. If we understand the use of median for dividing, we are very close to the answer.

First let's cut A into two parts at a random position i:

left_A | right_A
A[0], A[1], ..., A[i-1] | A[i], A[i+1], ..., A[m-1]
Since A has m elements, so there are m+1 kinds of cutting( i = 0 ~ m ). And we know: len(left_A) = i, len(right_A) = m - i . Note: when i = 0 , left_A is empty, and when i = m , right_A is empty.

With the same way, cut B into two parts at a random position j:

left_B | right_B
B[0], B[1], ..., B[j-1] | B[j], B[j+1], ..., B[n-1]
Put left_A and left_B into one set, and put right_A and right_B into another set. Let's name them left_part and right_part :

left_part | right_part
A[0], A[1], ..., A[i-1] | A[i], A[i+1], ..., A[m-1]
B[0], B[1], ..., B[j-1] | B[j], B[j+1], ..., B[n-1]
If we can ensure:

1) len(left_part) == len(right_part)
2) max(left_part) <= min(right_part)
then we divide all elements in {A, B} into two parts with equal length, and one part is always greater than the other. Then median = (max(left_part) + min(right_part))/2.

To ensure these two conditions, we just need to ensure:

(1) i + j == m - i + n - j (or: m - i + n - j + 1)
if n >= m, we just need to set: i = 0 ~ m, j = (m + n + 1)/2 - i
(2) B[j-1] <= A[i] and A[i-1] <= B[j]
ps.1 For simplicity, I presume A[i-1],B[j-1],A[i],B[j] are always valid even if i=0/i=m/j=0/j=n . I will talk about how to deal with these edge values at last.

ps.2 Why n >= m? Because I have to make sure j is non-nagative since 0 <= i <= m and j = (m + n + 1)/2 - i. If n < m , then j may be nagative, that will lead to wrong result.

So, all we need to do is:

Searching i in [0, m], to find an object `i` that:
B[j-1] <= A[i] and A[i-1] <= B[j], ( where j = (m + n + 1)/2 - i )
And we can do a binary search following steps described below:

<1> Set imin = 0, imax = m, then start searching in [imin, imax]

<2> Set i = (imin + imax)/2, j = (m + n + 1)/2 - i

<3> Now we have len(left_part)==len(right_part). And there are only 3 situations
that we may encounter:
<a> B[j-1] <= A[i] and A[i-1] <= B[j]
Means we have found the object `i`, so stop searching.
<b> B[j-1] > A[i]
Means A[i] is too small. We must `ajust` i to get `B[j-1] <= A[i]`.
Can we `increase` i?
Yes. Because when i is increased, j will be decreased.
So B[j-1] is decreased and A[i] is increased, and `B[j-1] <= A[i]` may
be satisfied.
Can we `decrease` i?
`No!` Because when i is decreased, j will be increased.
So B[j-1] is increased and A[i] is decreased, and B[j-1] <= A[i] will
be never satisfied.
So we must `increase` i. That is, we must ajust the searching range to
[i+1, imax]. So, set imin = i+1, and goto <2>.
<c> A[i-1] > B[j]
Means A[i-1] is too big. And we must `decrease` i to get `A[i-1]<=B[j]`.
That is, we must ajust the searching range to [imin, i-1].
So, set imax = i-1, and goto <2>.
When the object i is found, the median is:

max(A[i-1], B[j-1]) (when m + n is odd)
or (max(A[i-1], B[j-1]) + min(A[i], B[j]))/2 (when m + n is even)
Now let's consider the edges values i=0,i=m,j=0,j=n where A[i-1],B[j-1],A[i],B[j] may not exist. Actually this situation is easier than you think.

What we need to do is ensuring that max(left_part) <= min(right_part). So, if i and j are not edges values(means A[i-1],B[j-1],A[i],B[j] all exist), then we must check both B[j-1] <= A[i] and A[i-1] <= B[j]. But if some of A[i-1],B[j-1],A[i],B[j] don't exist, then we don't need to check one(or both) of these two conditions. For example, if i=0, then A[i-1] doesn't exist, then we don't need to check A[i-1] <= B[j]. So, what we need to do is:

Searching i in [0, m], to find an object `i` that:
(j == 0 or i == m or B[j-1] <= A[i]) and
(i == 0 or j == n or A[i-1] <= B[j])
where j = (m + n + 1)/2 - i
And in a searching loop, we will encounter only three situations:

<a> (j == 0 or i == m or B[j-1] <= A[i]) and
(i == 0 or j = n or A[i-1] <= B[j])
Means i is perfect, we can stop searching.

<b> j > 0 and i < m and B[j - 1] > A[i]
Means i is too small, we must increase it.

<c> i > 0 and j < n and A[i - 1] > B[j]
Means i is too big, we must decrease it.
Thank @Quentin.chen , him pointed out that: i < m ==> j > 0 and i > 0 ==> j < n . Because:

m <= n, i < m ==> j = (m+n+1)/2 - i > (m+n+1)/2 - m >= (2*m+1)/2 - m >= 0
m <= n, i > 0 ==> j = (m+n+1)/2 - i < (m+n+1)/2 <= (2*n+1)/2 <= n
So in situation <b> and <c>, we don't need to check whether j > 0 and whether j < n.

Below is the accepted code:

def median(A, B):
m, n = len(A), len(B)
if m > n:
A, B, m, n = B, A, n, m
if n == 0:
raise ValueError

imin, imax, half_len = 0, m, (m + n + 1) / 2
while imin <= imax:
i = (imin + imax) / 2
j = half_len - i
if i < m and B[j-1] > A[i]:
# i is too small, must increase it
imin = i + 1
elif i > 0 and A[i-1] > B[j]:
# i is too big, must decrease it
imax = i - 1
else:
# i is perfect

if i == 0: max_of_left = B[j-1]
elif j == 0: max_of_left = A[i-1]
else: max_of_left = max(A[i-1], B[j-1])

if (m + n) % 2 == 1:
return max_of_left

if i == m: min_of_right = B[j]
elif j == n: min_of_right = A[i]
else: min_of_right = min(A[i], B[j])

return (max_of_left + min_of_right) / 2.0

常用数据处理库

numpy

合并

1
2
3
4
5
6
7
8
9
>>> x = np.array([1,2,3])
>>> y = np.array([4,5,6])
>>> np.concatenate([x,y],axis = 0)
array([1, 2, 3, 4, 5, 6])
>>> np.vstack((x,y))
array([[1, 2, 3],
[4, 5, 6]])
>>> np.hstack((x,y))
array([1, 2, 3, 4, 5, 6])

pandas

关键缩写和包导入
在这个速查手册中,我们使用如下缩写:

df:任意的Pandas DataFrame对象
s:任意的Pandas Series对象
同时我们需要做如下的引入:


import pandas as pd
import numpy as np

导入数据

  • pd.read_csv(filename):从CSV文件导入数据
  • pd.read_table(filename):从限定分隔符的文本文件导入数据
  • pd.read_excel(filename):从Excel文件导入数据
  • pd.read_sql(query, connection_object):从SQL表/库导入数据
  • pd.read_json(json_string):从JSON格式的字符串导入数据
  • pd.read_html(url):解析URL、字符串或者HTML文件,抽取其中的tables表格
  • pd.read_clipboard():从你的粘贴板获取内容,并传给read_table()
  • pd.DataFrame(dict):从字典对象导入数据,Key是列名,Value是数据

导出数据

  • df.to_csv(filename):导出数据到CSV文件
  • df.to_excel(filename):导出数据到Excel文件
  • df.to_sql(table_name, connection_object):导出数据到SQL表
  • df.to_json(filename):以Json格式导出数据到文本文件

创建测试对象

  • pd.DataFrame(np.random.rand(20,5)):创建20行5列的随机数组成的DataFrame对象
  • pd.Series(my_list):从可迭代对象my_list创建一个Series对象
  • df.index = pd.date_range(‘1900/1/30’, periods=df.shape[0]):增加一个日期索引

查看、检查数据

  • df.head(n):查看DataFrame对象的前n行
  • df.tail(n):查看DataFrame对象的最后n行
  • df.shape():查看行数和列数
  • df.info():查看索引、数据类型和内存信息
  • df.describe():查看数值型列的汇总统计
  • s.value_counts(dropna=False):查看Series对象的唯一值和计数
  • df.apply(pd.Series.value_counts):查看DataFrame对象中每一列的唯一值和计数

数据选取

  • df[col]:根据列名,并以Series的形式返回列
  • df[[col1, col2]]:以DataFrame形式返回多列
  • s.iloc[0]:按位置选取数据
  • s.loc[‘index_one’]:按索引选取数据
  • df.iloc[0,:]:返回第一行
  • df.iloc[0,0]:返回第一列的第一个元素
  • df.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    n是要抽取的行数。(例如n=20000时,抽取其中的2W行)

    frac是抽取的比列。(有一些时候,我们并对具体抽取的行数不关系,我们想抽取其中的百分比,这个时候就可以选择使用frac,例如frac=0.8,就是抽取其中80%)

    replace抽样后的数据是否代替原DataFrame()

    weights这个是每个样本的权重,具体可以看官方文档说明。

    random_state这个在之前的文章已经介绍过了。

    axis是选择抽取数据的行还是列。axis=0的时是抽取行,axis=1时是抽取列(也就是说axis=1时,在列中随机抽取n列,在axis=0时,在行中随机抽取n行)

数据清理

  • df.columns = [‘a’,’b’,’c’]:重命名列名
  • pd.isnull():检查DataFrame对象中的空值,并返回一个Boolean数组
  • pd.notnull():检查DataFrame对象中的非空值,并返回一个Boolean数组
  • df.dropna():删除所有包含空值的行
  • df.dropna(axis=1):删除所有包含空值的列
  • df.dropna(axis=1,thresh=n):删除所有小于n个非空值的行
  • df.fillna(x):用x替换DataFrame对象中所有的空值
  • s.astype(float):将Series中的数据类型更改为float类型
  • s.replace(1,’one’):用‘one’代替所有等于1的值
  • s.replace([1,3],[‘one’,’three’]):用’one’代替1,用’three’代替3
  • df.rename(columns=lambda x: x + 1):批量更改列名
  • df.rename(columns={‘old_name’: ‘new_ name’}):选择性更改列名
  • df.set_index(‘column_one’):更改索引列
  • df.rename(index=lambda x: x + 1):批量重命名索引
  • df.drop(df,axis=…):删除行或列
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    >>> df
    A B C D
    0 foo one 0.016336 0.087302
    1 bar one -0.394784 -2.609699
    2 foo two -0.241163 0.429637
    3 bar three -1.450263 1.574793
    4 foo two -0.436486 0.047045
    5 bar two 0.378663 -0.596585
    6 foo one 0.576077 0.036312
    7 foo three -1.507273 0.212231
    >>> df.drop('A',axis = 1)
    B C D
    0 one 0.016336 0.087302
    1 one -0.394784 -2.609699
    2 two -0.241163 0.429637
    3 three -1.450263 1.574793
    4 two -0.436486 0.047045
    5 two 0.378663 -0.596585
    6 one 0.576077 0.036312
    7 three -1.507273 0.212231
    >>> df.drop(['A','B'],axis = 1)
    C D
    0 0.016336 0.087302
    1 -0.394784 -2.609699
    2 -0.241163 0.429637
    3 -1.450263 1.574793
    4 -0.436486 0.047045
    5 0.378663 -0.596585
    6 0.576077 0.036312
    7 -1.507273 0.212231
    >>> df.drop([1,2,4],axis = 0)
    A B C D
    0 foo one 0.016336 0.087302
    3 bar three -1.450263 1.574793
    5 bar two 0.378663 -0.596585
    6 foo one 0.576077 0.036312
    7 foo three -1.507273 0.212231

数据处理:Filter、Sort和GroupBy

  • df[df[col] > 0.5]:选择col列的值大于0.5的行
  • df.sort_values(col1):按照列col1排序数据,默认升序排列
  • df.sort_values(col2, ascending=False):按照列col1降序排列数据
  • df.sort_values([col1,col2],ascending=[True,False]):先按列col1升序排列,后按col2降序排列数据
  • df.groupby(col):返回一个按列col进行分组的Groupby对象
  • df.groupby([col1,col2]):返回一个按多列进行分组的Groupby对象
  • df.groupby(col1)[col2]:返回按列col1进行分组后,列col2的均值
  • df.pivot_table(index=col1,values=[col2,col3],aggfunc=max):创建一个按列col1进行分组,并计算col2和col3的最大值的数据透视表
  • df.groupby(col1).agg(np.mean):返回按列col1分组的所有列的均值
  • data.apply(np.mean):对DataFrame中的每一列应用函数np.mean
  • data.apply(np.max,axis=1):对DataFrame中的每一行应用函数np.max
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    #同一属性先移动到同一行,在做差
    df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
    'C' : np.random.randn(8),
    'D' : np.random.randn(8)})

    >>> df
    A B C D E
    0 foo one -0.660824 0.758762 0.097938
    1 bar one 1.673826 -0.369888 1.303938
    2 foo two 1.151669 0.750455 1.902124
    3 bar three -0.902216 -0.344720 -1.246936
    4 foo two -0.232781 -1.256137 -1.488918
    5 bar two 0.387244 -0.671663 -0.284419
    6 foo one -1.199822 0.078424 -1.121398
    7 foo three -0.404454 0.271658 -0.132796
    >>> df['shift'] = df.groupby('A')['C'].apply(lambda i:i.shift(1))
    >>> df
    A B C D E shift
    0 foo one -0.660824 0.758762 0.097938 NaN
    1 bar one 1.673826 -0.369888 1.303938 NaN
    2 foo two 1.151669 0.750455 1.902124 -0.660824
    3 bar three -0.902216 -0.344720 -1.246936 1.673826
    4 foo two -0.232781 -1.256137 -1.488918 1.151669
    5 bar two 0.387244 -0.671663 -0.284419 -0.902216
    6 foo one -1.199822 0.078424 -1.121398 -0.232781
    7 foo three -0.404454 0.271658 -0.132796 -1.199822
    >>> df['diff'] = df['C'] - df['shift']
    >>> df
    A B C D E shift diff
    0 foo one -0.660824 0.758762 0.097938 NaN NaN
    1 bar one 1.673826 -0.369888 1.303938 NaN NaN
    2 foo two 1.151669 0.750455 1.902124 -0.660824 1.812493
    3 bar three -0.902216 -0.344720 -1.246936 1.673826 -2.576042
    4 foo two -0.232781 -1.256137 -1.488918 1.151669 -1.384450
    5 bar two 0.387244 -0.671663 -0.284419 -0.902216 1.289460

    #求A'A'中同一个元素对应的几个 'C'的 去量纲化

    >>> df['shift'] = df.groupby('A')['C'].apply(lambda i:(i-i.min())/i.std())
    #错了,对的df['shift'] = df.groupby('A')['C'].apply(lambda i:(i-i.mean())/i.std())
    >>> df
    A B C D E shift diff
    0 foo one -0.660824 0.758762 0.097938 0.616460 NaN
    1 bar one 1.673826 -0.369888 1.303938 2.000000 NaN
    2 foo two 1.151669 0.750455 1.902124 2.689432 1.812493
    3 bar three -0.902216 -0.344720 -1.246936 0.000000 -2.576042
    4 foo two -0.232781 -1.256137 -1.488918 1.106018 -1.384450
    5 bar two 0.387244 -0.671663 -0.284419 1.001117 1.289460
    6 foo one -1.199822 0.078424 -1.121398 0.000000 -0.967042
    7 foo three -0.404454 0.271658 -0.132796 0.909673 0.795368

    #概率计算
    >>> rating_A = f.groupby('A').size().div(len(f))
    >>> rating_A
    A
    bar 0.375
    foo 0.625
    dtype: float64

    #条件概率
    >>> f.groupby(['A', 'B']).size().div(len(f)).div(rating_A, axis=0, level='A')
    A B
    bar one 0.333333
    three 0.333333
    two 0.333333
    foo one 0.400000
    three 0.200000
    two 0.400000
    dtype: float64

数据合并

  • df1.append(df2):将df2中的行添加到df1的尾部
  • df.concat([df1, df2],axis=1):将df2中的列添加到df1的尾部
  • df1.join(df2,on=col1,how=’inner’):对df1的列和df2的列执行SQL形式的join

数据统计

  • df.describe():查看数据值列的汇总统计
  • df.mean():返回所有列的均值
  • df.corr():返回列与列之间的相关系数
  • df.count():返回每一列中的非空值的个数
  • df.max():返回每一列的最大值
  • df.min():返回每一列的最小值
  • df.median():返回每一列的中位数
  • df.std():返回每一列的标准差
    1
    2
    3
    4
    5
    6
    7
    8
    9
    pd.factorize() #标签映射为数字
    >>> df = pd.DataFrame({"id":[1,2,3,4,5,6,3,2], "raw_grade":['a', 'b', 'b','a', 'a','e','c','a']})
    >>> x,y = pd.factorize(df.raw_grade)
    >>> x
    array([0, 1, 1, 0, 0, 2, 3, 0], dtype=int64)
    >>> y
    Index(['a', 'b', 'e', 'c'], dtype='object')
    >>> pd.factorize(df.raw_grade)
    (array([0, 1, 1, 0, 0, 2, 3, 0], dtype=int64), Index(['a', 'b', 'e', 'c'], dtype='object'))

多条件筛选

在使用dataframe处理数据的时候碰到了按照条件选取行的问题,单个条件时可以使用:

df[df[‘one’] > 5]

如果多个条件的话需要这么写:

import numpy as np

df[np.logical_and(df[‘one’]> 5,df[‘two’]>5)]

也可以这么写

df[(df[‘one’]> 5) & (df[‘two’]>5)]

数值优化来一波

  • 数值变量

    1
    2
    3
    4
    5
    df_int = df_t.select_dtypes(include=['int'])
    df_t[df_int.columns] = df_int.apply(pd.to_numeric, downcast='unsigned')

    df_float = df_t.select_dtypes(include=['int'])
    df_t[df_int.columns] = df_float.apply(pd.to_numeric, downcast='float')
  • object变量

  1. 可以使用 one_hot
  2. 可以使用 categoricals
    其实都一样啦
1
2
3
4
5
6
7
#大佬的方法
def one_hot_encoder(df, nan_as_category=True):
original_columns = list(df.columns)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns

python工具类插件

itertools

迭代器的特点是:惰性求值(Lazy evaluation),即只有当迭代至某个值时,它才会被计算,这个特点使得迭代器特别适合于遍历大文件或无限集合等,因为我们不用一次性将它们存储在内存中。

Python 内置的 itertools 模块包含了一系列用来产生不同类型迭代器的函数或类,这些函数的返回都是一个迭代器,我们可以通过 for 循环来遍历取值,也可以使用 next() 来取值。

itertools 模块提供的迭代器函数有以下几种类型:

  • 无限迭代器:生成一个无限序列,比如自然数序列 1, 2, 3, 4, …;
  • 有限迭代器:接收一个或多个序列(sequence)作为参数,进行组合、分组和过滤等;
  • 组合生成器:序列的排列、组合,求序列的笛卡儿积等;

无限迭代器

count(firstval=0, step=1)

创建一个从 firstval (默认值为 0) 开始,以 step (默认值为 1) 为步长的的无限整数迭代器

1
2
3
4
5
6
7
8
9
10
11
12
>>> nums = itertools.count(10, 2)    # 指定开始值和步长
>>> for i in nums:
... if i > 20:
... break
... print i
...
10
12
14
16
18
20

cycle(iterable)

对 iterable 中的元素反复执行循环,返回迭代器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> cycle_strings = itertools.cycle('ABC')
>>> i = 1
>>> for string in cycle_strings:
... if i == 10:
... break
... print i, string
... i += 1
...
1 A
2 B
3 C
4 A
5 B
6 C
7 A
8 B
9 C

repeat(object [,times]

反复生成 object,如果给定 times,则重复次数为 times,否则为无限

1
2
3
4
5
6
7
8
9
10
11
12
13
>>> for item in itertools.repeat('hello world', 3):
... print item
...
hello world
hello world
hello world
>>>
>>> for item in itertools.repeat([1, 2, 3, 4], 3):
... print item
...
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]

有限迭代器

itertools 模块提供了多个函数(类),接收一个或多个迭代对象作为参数,对它们进行组合、分组和过滤等:

chain(iterable1, iterable2, iterable3, …)

chain 接收多个可迭代对象作为参数,将它们『连接』起来,作为一个新的迭代器返回。
1
2
3
4
5
6
7
8
9
>>> for item in chain([1, 2, 3], ['a', 'b', 'c']):
... print item
...
1
2
3
a
b
c

chain 还有一个常见的用法:

chain.from_iterable(iterable)
接收一个可迭代对象作为参数,返回一个迭代器:

1
2
3
4
5
>>> from itertools import chain
>>>
>>> string = chain.from_iterable('ABCD')
>>> string.next()
'A'

compress(data, selectors)

compress 可用于对数据进行筛选,当 selectors 的某个元素为 true 时,则保留 data 对应位置的元素,否则去除:

1
2
list(compress('ABCDEF', [1, 1, 0, 1, 0, 1]))
['A', 'B', 'D', 'F']

dropwhile(predicate, iterable)

其中,predicate 是函数,iterable 是可迭代对象。对于 iterable 中的元素,如果 predicate(item) 为 true,则丢弃该元素,否则返回该项及所有后续项

1
2
list(dropwhile(lambda x: x < 5, [1, 3, 6, 2, 1]))
[6, 2, 1]

groupby(iterable[, keyfunc])

其中,iterable 是一个可迭代对象,keyfunc 是分组函数,用于对 iterable 的连续项进行分组,如果不指定,则默认对 iterable 中的连续相同项进行分组,返回一个 (key, sub-iterator) 的迭代器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
>>> data = ['a', 'bb', 'ccc', 'dd', 'eee', 'f']
>>> for key,value_iter in groupby(data,len):
... print(key,':',list(value_iter))
...
1 : ['a']
2 : ['bb']
3 : ['ccc']
2 : ['dd']
3 : ['eee']
1 : ['f']
>>> data = ['a', 'bb', 'cc', 'ddd', 'eee', 'f']
>>> for key,value_iter in groupby(data,len):
... print(key,':',list(value_iter))
...
1 : ['a']
2 : ['bb', 'cc']
3 : ['ddd', 'eee']
1 : ['f']

ifilter(function or None, sequence)

将 iterable 中 function(item) 为 True 的元素组成一个迭代器返回,如果 function 是 None,则返回 iterable 中所有计算为 True 的项

1
2
3
4
5
>>> list(ifilter(lambda x: x < 6, range(10)))
[0, 1, 2, 3, 4, 5]
>>>
>>> list(ifilter(None, [0, 1, 2, 0, 3, 4]))
[1, 2, 3, 4]

ifilterfalse()

和ifilter()类似,看名字也可以看出来了,相反嘛

islice(iterable, [start,] stop [, step])

其中,iterable 是可迭代对象,start 是开始索引,stop 是结束索引,step 是步长,start 和 step 可选

1
2
3
4
5
6
7
8
9
10
>>> list(islice([10, 6, 2, 8, 1, 3, 9], 5))
[10, 6, 2, 8, 1]
>>>
>>> list(islice(count(), 6))
[0, 1, 2, 3, 4, 5]
>>>
>>> list(islice(count(), 3, 10))
[3, 4, 5, 6, 7, 8, 9]
>>> list(islice(count(), 3, 10 ,2))
[3, 5, 7, 9]

imap()

imap(func, iter1, iter2, iter3, …)
imap 返回一个迭代器,元素为 func(i1, i2, i3, …),i1,i2 等分别来源于 iter, iter2

1
2
3
4
5
6
7
8
>>> imap(str, [1, 2, 3, 4])
<itertools.imap object at 0x10556d050>
>>>
>>> list(imap(str, [1, 2, 3, 4]))
['1', '2', '3', '4']
>>>
>>> list(imap(pow, [2, 3, 10], [4, 2, 3]))
[16, 9, 1000]

starmap()

tee()

tee(iterable [,n])
tee 用于从 iterable 创建 n 个独立的迭代器,以元组的形式返回,n 的默认值是 2

1
2
3
4
5
6
7
8
9
10
11
>>> tee('abcd')   # n 默认为 2,创建两个独立的迭代器
(<itertools.tee object at 0x1049957e8>, <itertools.tee object at 0x104995878>)
>>>
>>> iter1, iter2 = tee('abcde')
>>> list(iter1)
['a', 'b', 'c', 'd', 'e']
>>> list(iter2)
['a', 'b', 'c', 'd', 'e']
>>>
>>> tee('abc', 3) # 创建三个独立的迭代器
(<itertools.tee object at 0x104995998>, <itertools.tee object at 0x1049959e0>, <itertools.tee object at 0x104995a28>)

takewhile()

takewhile(predicate, iterable)
其中,predicate 是函数,iterable 是可迭代对象。对于 iterable 中的元素,如果 predicate(item) 为 true,则保留该元素,只要 predicate(item) 为 false,则立即停止迭代

1
2
3
4
>>> list(takewhile(lambda x: x < 5, [1, 3, 6, 2, 1]))
[1, 3]
>>> list(takewhile(lambda x: x > 3, [2, 1, 6, 5, 4]))
[]

izip()

izip(iter1, iter2, …, iterN)
如果某个可迭代对象不再生成值,则迭代停止

1
2
3
4
5
6
7
8
9
10
11
>>> for item in izip('ABCD', 'xy'):
... print item
...
('A', 'x')
('B', 'y')
>>> for item in izip([1, 2, 3], ['a', 'b', 'c', 'd', 'e']):
... print item
...
(1, 'a')
(2, 'b')
(3, 'c')

izip_longest()

izip_longest(iter1, iter2, …, iterN, [fillvalue=None])
如果有指定 fillvalue,则会用其填充缺失的值,否则为 None

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> for item in izip_longest('ABCD', 'xy'):
... print item
...
('A', 'x')
('B', 'y')
('C', None)
('D', None)
>>>
>>> for item in izip_longest('ABCD', 'xy', fillvalue='-'):
... print item
...
('A', 'x')
('B', 'y')
('C', '-')
('D', '-')

组合生成器

itertools 模块还提供了多个组合生成器函数,用于求序列的排列、组合等:

  • product
  • permutations
  • combinations
  • combinations_with_replacement

    product(iter1, iter2, … iterN, [repeat=1])

    其中,repeat 是一个关键字参数,用于指定重复生成序列的次数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    >>> for item in product('ABCD', 'xy'):
    ... print item
    ...
    ('A', 'x')
    ('A', 'y')
    ('B', 'x')
    ('B', 'y')
    ('C', 'x')
    ('C', 'y')
    ('D', 'x')
    ('D', 'y')
    >>>
    >>> list(product('ab', range(3)))
    [('a', 0), ('a', 1), ('a', 2), ('b', 0), ('b', 1), ('b', 2)]
    >>>
    >>> list(product((0,1), (0,1), (0,1)))
    [(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)]
    >>>
    >>> list(product('ABC', repeat=2))
    [('A', 'A'), ('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'B'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('C', 'C')]

permutations(iterable[, r])

其中,r 指定生成排列的元素的长度,如果不指定,则默认为可迭代对象的元素长度

1
2
3
4
5
6
7
8
>>> permutations('ABC', 2)
<itertools.permutations object at 0x1074d9c50>
>>>
>>> list(permutations('ABC', 2))
[('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]
>>>
>>> list(permutations('ABC'))
[('A', 'B', 'C'), ('A', 'C', 'B'), ('B', 'A', 'C'), ('B', 'C', 'A'), ('C', 'A', 'B'), ('C', 'B', 'A')]

combinations(iterable, r)

其中,r 指定生成组合的元素的长度

1
2
>>> list(combinations('ABC', 2))
[('A', 'B'), ('A', 'C'), ('B', 'C')]

click

click为命令行工具的开发封装了大量的方法,下面简单学习一下。详细使用还要开官方文档。

  • 程序

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # hello.py
    import click

    @click.command()
    @click.option('--count', default=1, help='Number of greetings.')
    @click.option('--name', prompt='Your name',help='The person to greet.')
    def hello(count, name):
    for x in range(count):
    click.echo('Hello %s!' % name)

    if __name__ == '__main__':
    hello()
  • 运行

    1
    2
    3
    PS G:\test> python .\hello.py
    Your name: a
    Hello a!

相当于click将原来的函数封装了,可以使用输入参数。

1
2
3
4
5
6
7
PS G:\test> python .\hello.py --help
Usage: hello.py [OPTIONS]

Options:
--count INTEGER Number of greetings.
--name TEXT The person to greet.
--help Show this message and exit.

  • click.group()

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    @click.group()
    def cli():
    pass

    @click.command()
    def initdb():
    click.echo('Initialized the database')

    @click.command()
    def dropdb():
    click.echo('Dropped the database')

    cli.add_command(initdb)
    cli.add_command(dropdb)
  • 运行
    秒变子命令

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    PS G:\test> python .\hello.py
    Usage: hello.py [OPTIONS] COMMAND [ARGS]...

    Options:
    --help Show this message and exit.

    Commands:
    dropdb
    initdb
    PS G:\test> python hello.py initdb
    Initialized the database
  • Group.command()
    直接关联,如上面程序的升级版就是

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    import click
    @click.group()
    def cli():
    pass

    @cli.command()
    def initdb():
    click.echo('Initialized the database')

    @cli.command()
    def dropdb():
    click.echo('Dropped the database')

    cli()
  • 运行结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    PS G:\test> python .\hello.py
    Usage: hello.py [OPTIONS] COMMAND [ARGS]...

    Options:
    --help Show this message and exit.

    Commands:
    dropdb
    initdb
    PS G:\test> python .\hello.py initdb
    Initialized the database

yaml(pickle)

加载yaml文件

1
2
3
4
5
6
7
8
>>> import yaml
>>> f = open('test.yml')
>>> f
<_io.TextIOWrapper name='test.yml' mode='r' encoding='UTF-8'>
>>> x = yaml.load(f)
>>> x
{'name': 'Tom Smith', 'age': 37, 'spouse': {'name': 'Jane Smith', 'age': 25}, 'children': [{'name': 'Jimmy Smith', 'age': 15}, {'name1': 'Jenny Smith', 'age1': 12}]}
>>>

1
2
3
4
5
6
7
8
9
10
name: Tom Smith  
age: 37
spouse:
name: Jane Smith
age: 25
children:
- name: Jimmy Smith
age: 15
- name1: Jenny Smith
age1: 12

easydict

可以以属性的方式访问字典的值,和yaml一起使用效果更加,度配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> import yaml
>>> from easydict import EasyDict as edict
>>> y = open('test.yml')
>>> x = yaml.load(y)
>>> x
{'name': 'Tom Smith', 'age': 37, 'spouse': {'name': 'Jane Smith', 'age': 25}, 'children': [{'name': 'Jimmy Smith', 'age': 15}, {'name1': 'Jenny Smith', 'age1': 12}]}
>>> d = edict(x)
>>> d
{'name': 'Tom Smith', 'age': 37, 'spouse': {'name': 'Jane Smith', 'age': 25}, 'children': [{'name': 'Jimmy Smith', 'age': 15}, {'name1': 'Jenny Smith', 'age1': 12}]}
>>> d.name
'Tom Smith'
>>> x.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'name'

常用数据库

SQLite

SQLite是一种嵌入式关系型数据库,它的数据库就是一个文件。Python 2.5x以上版本内置了SQLite3,使用时直接import sqlite3即可
教程

操作流程

sqlite3.connect(database,[timeout,…])

打开一个文件与SQLite连接,如果存在打开,不存在,创建。

sqlite3.cursor()

可以创建表

增删改查都可以用 connect.execute()完成…

MySQL

也是关系型数据库,和SQLite差不多,建立连接与光标对象,用execute()执行SQL语句,commit()提交事务,fetchall()获得查询结果

LMDB

属于key-value型(键值对,非关系型)数据库

操作流程

通过env = lmdb.open()打开环境
通过txn = env.begin()建立事务
通过txn.put(key, value)进行插入和修改
通过txn.delete(key)进行删除
通过txn.get(key)进行查询
通过txn.cursor()进行遍历
通过txn.commit()提交更改

LevelDB

同为key-value数据库

操作流程

LevelDB操作时类似与LMDB,使用Put/Get/Delete,但是更加简单(不需要事务txn和commit提交),同时支持范围迭代器RangeIter。

SQLite与MySQL都是关系型数据库,操作时创建连接对象connection与光标对象cursor,通过execute执行SQL语句,commit提交变更,fetch得到查询结果;LMDB与LevelDB都是K-V数据库,操作时建立与数据库的连接,用put/delete改变数据,用get获取数据,区别是LMDB中有事务需要commit,LevelDB不需要。